|IN THIS ISSUE|
|Friday, June, 1, 2012|
|Big Data: Big Discovery or Big Disappointment?|
|Tags: Big Data|
|Posted By William Perlowitz, Chief Technology Officer, Wyle Information Systems|
In the public or private sector, the goal of Big Data is to evolve from hindsight to insight and make decisions in time to affect the outcome of event. The difference between making “Big Discoveries” (such as sequencing 3.2 billion base pairs in the human genome) versus “Big Disappointments” (such as a stock trading company losing control of an algorithm and causing serious delays on an exchange) is cultural. To be successful, you must admit that you can’t understand your data or the relationships within it prior to analyzing it and you can’t predict what questions you’ll be trying to answer with your data in the future.
Whether you’re grappling with an exponentially increasing volume of data (minus an exponentially increasing budget) or carefully examining the differences between “Big Data,” “linked data,” and “real time data,” one thing is certain: The need to manipulate larger and larger amounts of data is vital to your mission. If extreme data management were just about doing more of the same thing, about storing and processing higher and higher volumes of data, then the improvements we expect in digital technology might just be good enough to continually meet our needs as the data volume increases. Unfortunately, maintaining the current situation doesn’t work because achieving mission success for our users, our constituency, and thwarting our adversaries forever means that we need to analyze more complex data more amore rapidly.
When we study and understand this need for speed, we discover that “Big Data,” the volume of data, is only the tip of the iceberg. The variety of data, the velocity at which it arrives, and the complexity of the data all combine to make achieving the positive impacts of Big Data on efficiency, cost, and productivity more difficult than we might have imagined.
You Don’t Know What You Don’t Know
In our tried and tested relational world, we’ve had the luxury to “know what we don’t know.” That is, we may not know the numerical answer we’re seeking, but we know what data to collect, techniques to massage and store it, and ways to convert it from the data we store to the information we need. And if we don’t store the data in the raw form as we received it, or if we combine and summarize data, or if we throw away data that is outside of some threshold, not only do we save on storage but we can compute information more quickly since there is less data to process.
But data reduction will lead to a Big Disappointment.
Big Data is a new way to think about data. Getting from hindsight to insight is an iterative process: You ask a question, examine a data set, then evolve the question based on the examination, and look at more data sets. Put a different way, the use case for your data today does not anticipate the use cases of tomorrow because you literally “don’t know what you don’t know” tomorrow. If you tailor your data to the first use case that requires it, you will lose data that you need to create tomorrow’s information.
As a simple example, you could use any number of tools to generate detailed information about one of your websites. Today, you might just be interested in page views to see how popular the site is and predict when you might need another server, so you save that information and discard the rest. Three months from now, however, you may want more detailed usage and traffic statistics to help improve the site design. A year from now, you may want to look at the history of geographies accessing the site and correlate that to external events to justify the site’s existence. Since the only data you are retaining is page views and it is not possible to recreate the historical data, only page view analysis is open to you. What a Big Disappointment, right?
The lesson is that to make Big Discoveries you must have access to all of your data in its original form, forever. If data is modified, you are likely to lose something you don’t yet know you need. If data is squirreled away on a tape in a ferrous mountain, it is unlikely that it will be included in your insight iterations.
Lose the Pretense
We’ve all been trained in hypothetical science, in “the scientific method.” We make a prediction (a hypothesis), experiment to collect data, and test the fit of our data against our prediction.
We may suspect for example, that unauthorized people are going to access our network and steal our data. We put countermeasures in place, predict that these are sufficient to prevent access and removal of our data, and collect continuous monitoring data to ensure that we can detect any anomalies.
But that’s not the end. Adversaries measure our countermeasures, develop and apply successful counter-countermeasures, and we detect a loss of data; a Big Disappointment. In response, we update our prediction, put counter-counter-countermeasures in place, and continuously monitor until we lose data again.
As we traverse each measure-countermeasure cycle we gather data about how information was exfiltrated. Eventually, we have enough data to go beyond firewalls and Intrusion Detection Systems and implement machine learning and temporal reasoning to detect and prevent future abnormal access to data. We rely on these machines to “learn” what is normal and what is not because there are too many actors performing too many actions in too many ways for a human to track.
By losing the pretense that every possibility can be guessed beforehand, we enable Big Discoveries. By losing the pretense, we acknowledge that the scientific process is fundamentally transformed because the volume of data and intricacy of relationships within it are more complex than any human could consider. With this data-driven approach, we start by collecting data and then see what it reveals, without attempting to guess every possibility beforehand.
Now For the Good Part: Discovery
Big Data has the potential to increase efficiency, improve the speed and accuracy of decisions, forecast the future, identify savings opportunities, increase transparency, and provide insight into the needs of the citizenry and your agency.
Whether you are already using Big Data analytics and visualization to make strategic decisions or thinking about ways to better analyze and display data you already collect, there are technical and management questions that you need to ask to progress from hindsight to insight:
So whether you think Big Data is what you’ve been doing for years or will fundamentally change science and IT as we have known it beyond all recognition, the time to begin tough discussions about extreme data management is now. By addressing ownership issues, preparing the new tools and skills Big Data requires, and explicitly defining your Data Management Plan, you will create the environment that avoids Big Disappointments and enables your agency to make Big Discoveries.