The best ways to manage analytics in the data-driven enterprise
On March 13, 2020, the U.S. government declared COVID-19 a national emergency. Reacting quickly, the state of California became the first to issue a stay-at-home order. And within days after that, life and work as most of the world had known it ceased to exist as lockdown orders spread across the globe.
But even before the world shut down, a team of data scientists working with healthcare researchers at Johns Hopkins University in Baltimore was already tracking the novel coronavirus from its earliest casts in Wuhan, China, using advanced data analytics to project the spread of the disease across the U.S. and around the world. Such analytics workloads would guide the efforts to contain the disease and the distribution of the vaccines that will eventually end it.
Let's define analytics
Analytics is the heart of the data-driven enterprise. It is universally accepted as the best way for organizations to gain a competitive advantage by unlocking insights from their new and stored data. Yet, understanding what's needed to create and support analytics workloads is not as widely understood. For any enterprise seeking to be truly data driven, solving that puzzle is a major contributor to success.
In part, it's a matter of definition. Traditional data workloads tend to be short and transactional in nature and designed to deliver predictable results. A good example is an enterprise billing or order system. It may involve large quantities of data, but it is not designed to make sense of that data in any significant way.
Analytics workloads are exactly the opposite. They are designed not just to process but also to make sense of what is generally massive amounts of data. Analytics workloads approach the data holistically. They treat it multidimensionally, slicing and dicing it in different ways to uncover insights hidden within.
Analytics workloads are characterized by the variety and complexity of the data they process, the speed at which they process that data, and the unpredictability of the results.
Examples range from classic A/B analysis of testing results, predicting anomalous behavior for fraud detection, or, to use the specific example cited from the current pandemic, projecting the global spread of the coronavirus to calculate the optimal production and geographic distribution of vaccines.
Please read: The top tools data scientists use
Analytics workloads put more demands on enterprise data than traditional data processing systems were ever designed to meet. They require a new kind of architecture that provides users with the capability to explore and model information in multiple dimensions and in real time.
So, what's required to build, support, and ensure smooth performance for analytics workloads? For enterprises planning to make analytics a major part of their IT infrastructure, here are some best practices to consider.
Build in data redundancy and resiliency
Proper data storage is fundamental to successful analytical workloads. Central to planning and configuring that storage is building in redundancy to ensure that the analytics workload is never interrupted.
It's critical to ensure a fail-safe so that in the event one storage disk, volume, or data cluster involved with the analytics workload goes down, a replacement disk, volume, or cluster immediately spins up to take its place.
A best practice early adopters recommend is to build a data storage pool that balances the performance needs of analytics workloads with their disaster recovery needs. If the unforeseen or unexpected happens, that second, redundant storage system will be available to replace it automatically to allow the workload to continue without missing a beat.
Experts recommend building a data storage system that replicates the data three times. In the event of a disaster, there are always three copies of that data, each copy sitting on a different node in the data storage network. That way, even in the event of a failure that affects two of the copies of the data, there is still one copy of the data available for the analytics workload.
Deploy container technologies
Data to be used in analytics workloads in the modern enterprise can reside just about anywhere. Storing, finding, and retrieving that data—along with the potentially massive computational resources and flexibility needed for analytics—would overwhelm the traditional networking data system.
Container technologies offer the enterprise a better way. Already integral to DevOps, cloud computing, modern applications, and microservices, analytics itself is becoming impossible without them.
Please read: Data scientists take the mystery out of data fabrics
A best practice shared by early enterprise analytics leaders recommends building analytics workloads with containers as a foundational element or layer of their data platform architecture. Among many advantages, they provide the flexibility to scale a workload up or down quickly in response to changes or new insights uncovered in the data, as well as to move workloads quickly between systems.
Containers allow enterprises to turn on a dime and play with the data in completely different ways without having to reinvent the wheel every time. They vastly simplify and streamline management and operations, saving enterprises significant time and money while offering users a single access point and dashboard.
The container orchestration technology of Kubernetes plays a significant role in analytics workloads as well. Kubernetes provides the flexibility to combine or move containers to different data sources, without absorbing the massive amounts of memory contained in the data sources themselves. In other words, Kubernetes decouples the data from the compute half of the analytics workload.
Using Kubernetes to build the architecture to support analytics workloads is a best practice that enables faster and more efficient data processing while saving the enterprise significant time and money. With Kubernetes, the data stays in the same place. The analytics workload moves to it like a bee to a flower.
Speed is of the essence
Speed is essential to the collection, storage, and retrieval of the data needed for analytics workloads. The volume of the data, the diversity of its sources—including real-time information from the cloud and edge devices—the complexity of the analysis, and other factors demand a 1 Gbps network speed at a minimum.
When it comes to speed with analytics workloads, however, one size does not fit all. How much speed is needed depends on what kinds of storage media the enterprise is using and the kinds of analytics being performed.
While 1Gbps works well if the analytics workload primarily involves data collected from batch processing sources, it is challenged by larger and more complex analytics involving real-time sourcing and projection. A best practice suggested by leaders in analytics technology is to plan for speeds of at least 2 Gbps.
Please read: How data and AI will shape the post-pandemic future
It is important to architect for future requirements and not just for an initial use case. Remember, the amount of data that must be processed keeps increasing, as does the performance capacity of data processing systems. A modern data platform running at speeds greater than 2 Gbps will easily accommodate the storage file and data transfer speeds of multiple analytical workloads and scale with them as the enterprise grows.
Building in higher data transfer speeds also makes data volume mirroring possible. Volume mirroring is almost exactly what it sounds like: It is a mirror image of the data captured in real time.
A best practice for disaster planning and recovery is to create identical analytics workloads, each one a direct mirror image or copy of the other, in two different geographical locations. One of these workloads is the production workload; its mirror twin is the disaster recovery workload. At regularly pre-programmed intervals, the enterprise replicates (mirrors) the data in the production workload and automatically transfers it to the disaster workload using volume mirroring.
Analytics and the data-driven enterprise
All enterprises are on the journey to becoming data-driven, digital enterprises. Key to that goal is a modern data platform that incorporates analytics as a driver of the insights that result in true business value and competitive advantage. Building the architecture that unleashes the full power and capabilities of analytical workloads is central to that end state. Incorporating these best practices will help enterprises get there faster and more efficiently with less pain and more gain.
Lessons for leaders
- Data redundancy and resilience will ensure that analytics have fast, uninterrupted access to data.
- Modern analytics require high-speed networking of at least 2 Gbps.
- Container design, orchestrated with Kubernetes, is the best-practice architecture for analytics applications.
Analytics workloads treat data multidimensionally, slicing and dicing it in different ways to uncover insights hidden within.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.