Panning for gold in your data stream
Data analytics have become part of the information lifeblood. Simple dashboards provide easy access to powerful business insights, almost in real time. Data, drawn from a variety of massive internal and external sources, reveals key performance indicators and other derived knowledge that plays an essential role in a company's strategic plans.
The technology has revolutionized how corporations are managed, and the amount of data that can be and is collected continues to increase. So using that data for a competitive advantage or even just to help focus the business will continue to be a critical task.
But what if you don’t know what you’re looking for?
Three kinds of information
One might argue that there are three classes of knowledge:
- Information we know that we know
- Information we know that we don’t know
- Information we don’t know that we don’t know
Traditionally, statisticians and analysts have worked from a data model mindset. You create a mathematical model that describes how a given input results in an observed output. You then test the model against datasets to determine a p-value that should tell you if the model is correct. If it doesn’t accurately predict the results, you tweak the model—or throw it out and start over—until you find something that works.
Leo Breiman described an alternate approach in his 2001 paper, "Statistical Modeling: The Two Cultures." He advocated for a different method: Take a dataset with inputs and outputs, and then allow decision trees and neural nets to independently discover an algorithm that accurately predicts the results.
According to a global analytics lead at a Research Triangle-based data analytics firm, the big data approach has shifted dramatically in the direction of algorithm discovery. Instead of developing a model first, artificial intelligence approaches rely on massive training and test datasets. You set the neural networks or other machine learning tools to work on the training set. Once patterns are detected, you run them against the test set to see if it produces results that match known outcomes. In effect, you look to see if you can find any needle in your haystacks, and once you have found one, you test the needle's validity.
Unlike building a data model, this second approach doesn’t require that you come up with an explanation of how the inputs and outputs are connected. The discovery approach simply says that for a given input, this is the likely output.
One key tool in this process is signal detection. This refers to algorithms that can tak apparently random data and detect patterns within that data. In other words, it detects a signal against the background noise. But we are still looking for a specific result set. It's not that we already know the answer; it's just that we are looking for answers to specific questions.
Cast a wide network
Another data analytics expert interviewed points out that data in its own right “is dumb.” It is the variety of data that creates the value, when you can find ways to link different datasets. This linking process comes on top of the regular data preparation tasks to create a curated, clean, and interoperable set. It means finding meaningful and accurate ways to link unrelated datasets.
This often requires input from domain experts who are familiar with the data and its use in the field. False assumptions lead to misinterpretation of the data and the creation of improper linkages between related datasets. This is quite different from the traditional GIGO situation, where garbage data as input leads to garbage outputs. The data may be accurate and appropriate, but if it is not interpreted correctly, the results can be very wrong.
Adding data to the system increases its value but also increases the project complexity. Often, adding sets requires accessing data from unconnected silos, which may require negotiations and assurances to stakeholders that the data will be used securely and responsibly.
Speed matters
Another consideration for the algorithm search is that you need to find a solution that can be performed in a reasonable length of time. Many industries face rapidly changing market conditions. Speed to market is an essential business concept: A product that ships on one date could be an industry success, but if that same product ships just six months later, it might be a total failure.
In the same way, information derived from data analytics can lose much of its value in a very short time. One classic example in the field was the Netflix Prize, whereby the company offered a $1 million prize to anyone who could make the biggest improvement on its engine to provide personalized movie recommendations to subscribers. A team won the prize in 2009 for making a 10 percent improvement over the baseline system, but Netflix never implemented it because the array of algorithms was too unwieldy and expensive to set up. Worse, it was too slow. When subscribers rented DVDs by mail, the company had days to come up with recommendations. As streaming video took over the business, subscriber recommendations had to be made in seconds. The Netflix Prize solution was too slow, so it lost its value.
Pick your battlefield
It pays to spend some time in advance to identify the problems—or type of problems—you want to solve. One data analytics expert recommends looking for the highest value solutions that might be available. While you might be able to cut the costs for a certain business activity in half, that may not be a significant target if the total cost is just a fractional percentage of your budget. On the other hand, a small incremental improvement could result in an enormous return if it applies to a major portion of your business.
Meta Brown, author of “Data Mining for Dummies,” recommends looking for cost savings first instead of revenue increases. One reason for this is that implementing changes to increase revenues often requires buy-in from many different parts of an organization. Costs are typically managed by discrete units within the organization, which means changes can often be implemented faster and with less need to gain wide support.
Brown also recommends starting with low-risk projects as the target for your analytics. It’s better at first to try to improve a small project that is going badly rather than focus on the bread-and-butter operations of your business. An improvement won’t have a major impact on the bottom line, but it can provide validation for the process and lead to more ambitious assignments.
Project roadmap
Brown also recommends an open standard as a guideline for data mining projects. The Cross Industry Standard Process for Data Mining (CRISP-DM) was created by an industry consortium with backing from the European Union. The intent was to design a process that is “industry-, tool-, and application-neutral.” The final report was published in 2000.
The process is designed to be circular, with each cycle returning to the starting point with new insights that can lead to new questions that can be even more focused. Each cycle includes six sequential phases:
- Business understanding: Determine objectives, assess situation, determine goals, and create a project plan
- Data understanding: Collect and explore the available data and verify its quality
- Data preparation: Select, clean, integrate, and format data
- Modeling: Choose model techniques, generate test design, and build and assess the model
- Evaluation: Evaluate results, review the process, and determine next steps
- Deployment: Plan deployment and monitoring of the solution, report results, and evaluate the project
Note that phase 4, modeling, can refer to either the traditional building of a mathematical model or it can mean leaving it up to algorithms to detect the patterns without having to specifically describe the model’s inner workings. Brown underscores the importance of documenting everything involved in the project; this is the weak link in many projects, yet it is essential to being able to evaluate the process and its results.
Searching for what you don’t know that you don’t know may seem impossible, but it can be a path to uncovering valuable insights that can have a major positive impact for your company. This is a practice that is already widely used in medical research and likely to find broader application in other industries as leaders learn the value of such data analytics.
Mining for unexpected data: Lessons for leaders
- Tools exist to help you find valuable insights in data that might otherwise appear as a mountain of unrelated noise.
- Careful planning and documentation are the keys to a successful project.
- Data analytics can have a very short shelf life; insights often need to be created almost in real time if they are going to provide valuable guidance.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.