How swarm learning provides data insights while protecting data sovereignty
AI is a panacea, but in some industries, it also represents a paradox.
Consider medical research. The data fed into an AI model is drawn from the vital signs, illnesses, genetic data, and other specifics of thousands of individuals. The more data, the better the model. In fact, some AI models might be fed every known detail of thousands of patients' medical history, essential information for helping the algorithm find patterns or relationships that might point the way to a treatment.
But this presents a problem, as all that data cannot, by law, be shared. It must remain within the closed system of a singular model, which means other researchers will not be able to use that data and build upon it. In recent years, this has led to an industry that doesn't advance as quickly as it could because of the inability to share information.
Recently, a solution has been presented: swarm learning. "With swarm learning, raw data never leaves the boundary of the data possessor and his location," says Dr. Eng Lim Goh, senior vice president and chief technology officer for artificial intelligence at Hewlett Packard Enterprise. "Only insights from that data are shared between the nodes. The machine learning method is applied locally at the data source."
An article published in Nature in May 2021 details the concept of swarm learning with the goal being "to facilitate the integration of any medical data from any data owner worldwide without violating privacy laws." The approach involves leveraging a group of technologies, including edge computing and blockchain, to process data while removing the need for centralized coordination, thus preserving the confidentiality of the underlying information and allowing for collaboration that wasn't previously possible.
How to share, privately
It's the decentralized nature of these medical databases that presents the biggest challenge to collaboration. As the Nature report notes, "Medicine is inherently decentral," so there is no way to create a database with the necessary critical mass for an AI model unless information is gathered up from thousands of medical providers. At that point, questions of privacy, ownership, and transferability become inevitable—and difficult to solve.
Please read: Crushing edge complexity with automation
"The beauty of swarm learning is that there is no central node which aggregates the data," says Goh. "The swarm network acts as a union by sharing insights directly with all participants of the respective learning. There is no central custodian collecting all learnings or insights."
In a typical cloud-based machine learning environment, all data and processing would be done on a public cloud service like Amazon Web Services or Azure. Swarm learning pushes everything to the edge and shares only the algorithms and parameters of the AI model with other nodes, each of which build their own independent AI model. The learning is exchanged via blockchain to obtain merged parameters from these independent models after each training iteration, and the process repeats until the model is considered complete.
Those nosy nation-states
Swarm learning is just one of the most visible examples of emerging technologies that preserve data sovereignty, an increasingly important concept in an age when localized privacy laws are becoming increasingly complex. GDPR was just the beginning of this trend. Law firm Morrison Foerster notes that as of January 2021, 133 jurisdictions around the world had localized privacy laws, 60 of which were enacted in the past 10 years—with more on the way. Doing business across national borders, and even across U.S. state lines, means that data sovereignty rules, which subject data to the regulations of the region in which that data is collected, must be followed with an even greater level of care. Swarm learning offers a way to end-run data sovereignty complexities by ensuring data never has to leave the location in which it is collected.
One of the most notable solutions (and arguably the most mature) to emerge in this arena is Gaia-X, a project initiated in the European Union with the goal of developing a decentralized data-sharing system that will protect the privacy and sovereignty of member data. Gaia-X is also designed with improved efficiency and security in mind; data that doesn't have to travel back and forth to the cloud is less expensive to process and inherently safer. More than 230 organizations, including HPE, have joined the project to date.
Please read: AI sharpens its edge
"Gaia-X is really tackling two things," says Nicholas McQuire, chief of research at CCS Insight. "For starters, it's tackling infrastructural requirements for European organizations so that they are not beholden to U.S. cloud companies. And more importantly, it's setting requirements for European organizations around data sharing."
McQuire says a lot of the effort—and the reason that various European governments are directly involved in Gaia-X—is due to the CLOUD Act, a U.S. law that was enacted in 2018 and requires domestic technology companies to provide data stored on their servers in response to a subpoena, even if the data is physically housed overseas. U.S.-based hyperscalers absolutely dominate the market for cloud services on a global basis (together they command a 58 percent market share of worldwide cloud infrastructure spending), and most foreign entities rely on Amazon, Microsoft, and Google for at least some of their data storage and processing needs. If the U.S. government can easily gain access to foreign data, even if it's stored on Europe-based servers, that presents a thorny political problem. The solution is to detach this data from the cloud: "Gaia-X is like a virtual, decentralized hyperscaler," says Patrik Edlund, head of communication for Germany, Austria, and Switzerland at HPE. "It's the next generation of cloud platforms."
Collaboration amidst competition
The goal of Gaia-X isn't just to prevent data from being accessed by the U.S. government. It's also in large part about improving collaboration at the enterprise level, allowing competitors to work together without having to give their secrets to one another. Edlund points to an example that involves finding a parking space when you're in your car. Numerous off-street parking spaces go unused because drivers don't know about them. Modern cars are full of sensors that can detect, among other things, vacant parking spots, and share that information. Car manufacturers want to improve their drivers' experience, and it would be optimal if they all shared data on available spaces. Gaia-X is one way to do that.
"German car manufacturers are selling cars in France, for example, so they want their customers to be able to find a parking lot in Paris," says Edlund. "They won't be able to do that if they only have data on their own fleet. They need all the data from Peugeot, from Citroën, and so on."
Another enterprise example involves machine maintenance. Imagine a manufacturer that sells machines to thousands of customers across the world. The data generated by those machines could be used by customers to perform preventive maintenance—but those customers are unlikely to be interested in sharing their own production data with one another, as they may very well be competitors. Gaia-X provides a way for these companies to develop a model that incorporates competitive production data without having to share it publicly, so everyone can benefit from it. This page describes more Gaia-X use cases.
Please read: Attackers want to exploit and abuse your AI
While the first Gaia-X solutions are currently in the works, with certification expected by the end of the year, the framework's future is far from settled. "It's still up in the air about what it will ultimately become," says McQuire. He notes it's especially unclear whether Gaia-X will build out its own native infrastructure or if EU-based providers will get involved in same way. Other similar projects, including the International Data Spaces initiative and the auto-centric Catena-X, are also emerging as additional options in this space.
One way or another, some of these new frameworks are likely to find a significant foothold in a variety of markets.
"The world is going to be decentralized," says Edlund, "so the majority of data will be created at the edge. That's the thing we have to solve in the future, because in order to get the value out of that data, you must share it somehow."
Lessons for leaders
- Problems sharing data across jurisdictions don't necessarily mean you can't perform collaborative research on the data.
- Commercial applications exist where data from different jurisdictions might be used without encountering privacy problems.
- The decentralized, edge-based model found in Gaia-X may be "the next evolutionary step of cloud platforms."
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.