Manage your edge data before it becomes a problem
The amount of data generated at the edge is mind-numbingly huge. The World Economic Forum estimates that by 2025, 463 exabytes of data will be created for all use cases daily. Meanwhile, a Gartner report predicts that in the same time frame, three-quarters of all enterprise data will not only be created at the edge but also be processed there.
That's the good news.
Applications, sensors, customers, social media, telecom, and a wealth of IoT devices all play a part in creating this data deluge. Technology has adapted, allowing for analytics to be performed, in many cases, where the data is actually created. But a significant percentage of that data needs to be rolled up for analysis at a core location.
And given the huge amount of data involved, that's the bad news.
This is a dual analytics play: analysis at the edge and at the core in the cloud. Unfortunately, the resulting large data backhauls from the edge to core computing can put a hard strain on networks: the cost of the network bandwidth, congestion on potentially slow and expensive links, and IT oversight to keep it all operational. To prevent problems, companies need to take proactive steps to reduce the data burden.
"Data backhaul costs money and time. It costs time because moving more data is slower than moving less data, and it costs money in every other dimension, from interconnect costs to power budget," says Shevek, chief technology officer at CompilerWorks, a company that automates how data landscapes are viewed, migrated, and managed.
Why you need to start minimizing data backhaul now
The challenge is in reducing data volumes in meaningful ways and at pace with edge data growth rates. With IDC predicting that the worldwide edge computing market will reach almost $251 billion in 2024, data creation at the edge will grow accordingly.
While it is increasingly simple to create data, getting useful information from that data, and making sure that the data is available where it needs to be, will continue to be a challenge. The cost to manage and move data around can be significant, and finding ways to reduce the financial burden will need to take center stage.
"Ultimately, reducing the volume of data brought back to the cloud by intelligently selecting data at the edge reduces both mobile bandwidth bills and also the volume of data that then needs to be processed," says Daniel Warner, CEO of LGN, an edge AI company. "For AI workloads, this means less data to clean, annotate, and train on, all of which are also highly costly processes." This is, however, a highly nuanced process. More training data is always a good thing for data scientists. The focus should be on striking a balance to ensure there is enough recent data to detect drift and realign the models versus bringing everything back, which has a cost.
Intelligently scheduling backhaul at times when less data is being generated or transmission costs are reduced will also give you better control of the expenditures necessary to move data.
Please read: Your edge. Your future.
"Edge products and services are powering the next wave of digital transformation," says Dave McCarthy, research director for edge strategies at IDC. "Organizations of all types are looking to edge technology as a method of improving business agility and creating new customer experiences."
That means there is a lot of data that must be sorted, stored or deleted accordingly, and analyzed on demand. The more you can prune the data back first, without losing valuable information, the faster, better, and cheaper you can manage the backhaul.
How to reduce data backhaul
The goal is to reach a balance between cost and bandwidth. This is not a static state and must be assessed again and again over time.
In some narrow cases, the choice of what data to move can be simple. Consider the issue of moving data from a Mars lander and asking the question, "Is there life on Mars?"
"We could reduce the entire communication from the Mars lander to one bit of information—yes or no—if we had the foresight to understand and know how to analyze all the data at the edge," Shevek says. "But we haven't, and so we make an intelligent split based on the cost of remote processing and the cost of data transmission, between what to do at the edge and what to do locally, without missing any potential information."
But in more common scenarios here on earth, transparency and the ability to provide evidence of why that bit was set will become ever more critical as AI at the edge grows. For example, that edge might be a finance scenario where your loan application has just been rejected by an algorithm, or a court scenario where someone's bail has been set very high based on an algorithm. In either case, "Computer says no" is not a valid response. So, although that data doesn't necessarily have business value, it has governance and moral value and so should be kept.
This shows the limitations of the approach considered in the "life on Mars" example. The first question you would ask of the nonexistent data is, "Show me!" And in a situation where there are moral and legal concerns, as opposed to a simple binary choice, you will need to be able to provide the data to address those concerns.
In other words, data must be sorted at the edge before you can do anything else to lessen the load or optimize data transfers. But it matters which method you choose to do so.
"Technologies to compress or minimize data at the edge are often a Band-Aid that doesn't really solve the problem," says Erik Ottem, vice president of marketing at Cachengo, an edge AI platform provider. And while these fixes might solve today's bandwidth problem, they can rarely scale and are just putting off the need to find a long-term solution.
"The short-term answer to the backhaul problem is to only send the results of the analytic workload to the data center or cloud. Advances in edge-friendly system components and purpose-built software environments now make this achievable," Ottem adds. This approach doesn't negate the fact that you will still need to regularly move larger chunks of source data to validate that performance and capability are still within acceptable parameters and to influence other edge data collections, as required.
In most cases, enterprises should reexamine and revamp their existing data management strategies to minimize the impact from edge analytics overall.
"As the volume of data for AI applications increases in the data center, derived from ever larger deployments of edge AI applications into production, this type of workload and the need to optimize it are going to become increasingly critical for enterprise data management and data warehousing strategies," says LGN's Warner.
How to decide which data is valuable and which is just flotsam
"The most important thing is to discard data which does not contain information. Of that remaining information, discard the data that does not affect any business decision," Shevek says. But don't let the apparent simplicity of that answer distract you. Remember that all data contains information but maybe not relevant to one specific use case. Because organizations are very use case-centric in collecting data today, data scientists are struggling with new use cases that could easily have been managed had additional data been captured. Organizations don't always know what business decisions will be required in advance. Look at the effects of COVID-19: There was suddenly a huge wave of unforeseen decisions to be made, and by large, only those with the data in advance made good ones.
"A simple approach to step one would be to apply compression, since several compression schemes are close enough to the Shannon limit that they effectively reduce the data transfer to the minimum required," Shevek adds. But for a growing enterprise, this simply buys you time to implement a broader solution.
If you're training a machine learning model and end up with exactly one instance of the correct or expected data and one instance of every variant that is unexpected, then the model will have no bias in which to train itself. If the goal of some of this data is to train or refine models, then this strategy is flawed. An acceptable modification to this "avoid repetitive data" strategy would be to record counts and distribution of occurrences so the patterns could be re-created, even using synthetic data.
For example, smart thermostats may continuously report the temperature of a room or a refrigerated truck container. Rather than keep all the data points generated by the thermostat every 10 milliseconds, it would be less collected data but "equally effective to report only when the room temperature changes. We may view this as either edge analytics or RLE compression, depending on perspective and background; either is valid," Shevek says. As with most decision processes here, this needs to apply to the specific use case for the data collected.
As to determining the value of the data remaining after all of the above considerations, that's mostly determined by your data strategy. Keep in mind there are different kinds of value that can be attached to data.
Analytics is commonly an iterative process. You will pick different training datasets to create your analytic algorithms. Then, to avoid human bias in training, you refine the features of the training sets and the weights of those features. Potentially, you can also add different data sources or fields to the mix (features) to enrich what you have and refine your training sets based on the results you get until you're happy with the result.
Please read: Data scientists take the mystery out of data fabrics
Proprietary data also has unique and contextual value
"This is not to mean trade secrets and intellectual property (which is often proprietary but seldom really data), but rather, data where the company is the only organization that has it, or it has added enough value to make it a unique business asset," writes Thomas Davenport, a senior adviser to Deloitte's AI practice, in a Harvard Business Review post.
"Proprietary data can be big or small, structured or unstructured, raw or refined. What's important is that it is not easily replicated by another entity. That's what makes it a powerful means of achieving offensive value from data management," Davenport says.
While it's important to sort data carefully, there is such a thing as fear hoarding, which should be avoided.
Fear of dumping or losing data that may prove useful down the road is common but generally unfounded in many business contexts. However, in some, like medical research and fraud detection, it is a fundamental truth. But context is everything.
That concern is "rather far-fetched" for Miranda Yan, co-founder of VinPit, a vehicle identification number and license number search platform. Since the data passes through an algorithm, you can still save the outputs, which are the important bits anyway, she says. For example, filters could be location-based, minimizing extraneous information. Remember that much unfiltered data is of little value and a good bit of filtered data ages out. And in machine learning, users don't "choose" the weights or parameters. They may influence them, but ultimately, they are learned by the software.
Know your data
The key to understanding your backhaul requirements and finding the right way to minimize your bandwidth needs is to have a detailed understanding of the data being generated. While it is unlikely that such knowledge is available at the time your business starts collecting the data, ongoing analysis of the data and how it is being used will enable your company to fine-tune its business requirements.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.