Understanding the SMACK stack for big data
Just as LAMP made it easy to create server applications, SMACK is making it simple (or at least simpler) to build big data programs. SMACK's role is to provide big data information access as fast as possible. In other words, developers can create big data applications without reinventing the wheel.
We don't discuss the LAMP stack much, anymore. Once a buzzword for describing the technology underlying server and web hosting projects, LAMP (Linux, Apache, MySQL, and PHP/Python/Perl) was a shortcut way to refer to the pieces used for online infrastructure. The details may change—MariaDB in place of MySQL, Nginx for Apache, and so on—but the fundamental infrastructure doesn't.
It had a major impact. LAMP, with its combination of operating system, web front end, transactional data store, and server-side programming, enabled the birth of Web 2.0. Nowadays, LAMP doesn’t get a lot of dedicated attention because it’s taken for granted.
The premise was, and still is, a good one. A well-known set of technologies designed to easily integrate with one another can be a reliable starting point for creating larger, complex applications. While each component is powerful in its own right, together they become more so.
And thus today, Spark, Mesos, Akka, Cassandra, and Kafka (SMACK) has become the foundation for big data applications.
Among the technology influences driving SMACK adoption is the demand for real-time big data analysis. Apache Hadoop architectures, usually including Hadoop Distributed File System, MapReduce, and YARN, work well for batch or offline jobs, where data is captured and processed periodically, but they're inadequate for real-time analysis.
SMACK is a registered trademark of By the Bay, but the code of its components is open source software.
Most community tech initiatives begin with a pioneer and a lead innovator. In 2014, Apple engineer Helena Edelson wrote KillrWeather to show how easy it would be to integrate big data streaming and processing into a single pipeline. Edelson’s efforts got the attention of other San Francisco big data developers, some of whom organized tech conferences.
This quickly transformed into a movement. The programmers in each component met in 2015 at a pair of West Coast developer conferences, where they defined the SMACK stack by doing and teaching. Among the interested parties was Mesosphere, a container and big data company, which certainly has contributed to popularizing SMACK.
Immediately after those conferences, Mesosphere announced its Mesosphere Infinity product. This pulled together the SMACK stack programs into a whole, with the aid of Cisco.
Mesosphere Infinity's purpose was to create "an ideal environment for handling all sorts of data processing needs—from nightly batch-processing tasks to real-time ingestion of sensor data, and from business intelligence to hard-core data science."
The SMACK stack quickly gained in popularity. It's currently employed in multiple big data pipeline data architectures for data stream processing.
As with LAMP, a developer or system administrator is not wedded to SMACK's main programs. You can replace individual components, just as some original LAMP users swapped out MariaDB for MySQL or Python for Perl. For instance, a SMACK developer can replace Mesos as the cluster scheduler with Apache YARN or use Apache Flink for batch and stream processing instead of Akka. But, as with LAMP, it’s a useful starting point for process and documentation as well as predictable toolsets.
Here's are SMACK's basic pieces:
Apache Mesos is SMACK's foundation. Mesos, a distributed systems kernel, abstracts CPU, memory, storage, and other computational resources away from physical or virtual machines. On Mesos, you build fault-tolerant and elastic distributed systems. Mesos runs applications within its cluster. It also provides a highly available platform. In the event of a system failure, Mesos relocates applications to different cluster nodes.
This Mesos kernel provides the SMACK applications (and other big data applications, such as Hadoop), with the APIs they need for resource management and scheduling across data center, cloud, and container platforms. While many SMACK implementations use Mesosphere's Mesos Data Center Operating System (DC/OS) distribution, SMACK works with any version of Mesos or, with some elbow grease, other distributed systems.
Next on the stack is Akka. Akka both brings data into a SMACK stack and sends it out to end-user applications.
The Akka toolkit aims to help developers build highly concurrent, distributed, and resilient message-driven applications for Java and Scala. It uses the actor model as its abstraction level to provide a platform to build scalable, resilient, and responsive applications.
The actor model is a conceptual model to work with concurrent computation. It defines general rules for how the system’s components should behave and interact. The best-known language using this abstraction is Erlang.
With Akka, all interactions work in a distributed environment; its interactions actors use pure message-passing data in an asynchronous approach.
Apache Kafka is a distributed, partitioned, replicated commit log service. In SMACK, Kafka serves to provide messaging system functionality.
In a larger sense, Kafka decouples data pipelines and organizes data streams. With Kafka, data messages are byte arrays, which you can use to store objects in many formats, such as Apache Avro, JSON, and String. Kafka treats each set of data messages as a log—that is, an ordered set of messages. SMACK uses Kafka as a messaging system between its other programs.
In SMACK, data is kept in Apache Cassandra, a well-known distributed NoSQL database for managing large amounts of structured data across multiple servers, depended on for a lot of high-availability applications. Cassandra can handle huge quantities of data across multiple storage devices and vast numbers of concurrent users and operations per second.
The job of actually analyzing the data goes to Apache Spark. This fast and general-purpose big data processing engine enables you to combine SQL, streaming, and complex analytics. It also provides high-level APIs for Java, Scala, Python, and R, with an optimized general execution graphs engine.
Running through the SMACK pipeline
The smart bit, of course, is how all those pieces form a big data pipeline. There are many ways to install a SMACK stack using your choice of clouds, Linux distributions, and DevOps tools. Follow along with me as I create one to illustrate the process.
I start my SMACK stack by setting up a Mesos-based cluster. For SMACK, you need a minimum of three nodes, with two CPUs each and 32 GB of RAM. You can set this up on most clouds using any supported Linux distribution.
Next, I set up the Cassandra database from within Mesos or a Mesos distribution such as DC/OS.
That done, I set up Kafka inside Mesos.
Then I get Spark up and running in cluster mode. This way, when a task requires Spark, Spark instances are automatically spun up to available resources.
That's the basic framework.
But wait—the purpose here is to process data! That means I need to get data into the stack. For that, I install Akka. This program reads in data—data ingestion—from the chosen data sources.
As the data comes in from the outside world, Akka passes it on to Kafka. Kafka, in turn, streams the data to Akka, Spark, and Cassandra. Cassandra stores the data, while Spark analyzes it. All the while, Mesos is orchestrating all the components and managing system requirements. Once the data is stored and analyzed, you can query it, using Spark for further analysis with the Spark Cassandra Connector. You can then use Akka to move the data and analytic results from Cassandra to the end user.
This is just an overview. For a more in-depth example, see The SMACK stack – hands on!
Who needs SMACK
Before you start to build a SMACK stack, is it the right tool?
The first question to ask is whether you need big data analysis in real time. If you don't, Hadoop-based batch approaches can serve you well. As Patrick McFadin, chief evangelist for Apache Cassandra at DataStax, explains in an interview, "Hadoop fits in the 'slow data' space, where the size, scope, and completeness of the data you are looking at is more important than the speed of the response. For example, a data lake consisting of large amounts of stored data would fall under this."
How much faster than Hadoop is SMACK's analysis engine, Spark? According to Natalino Busa, head of data science at Teradata, "Spark's multistage in-memory primitives provides performance up to 100 times faster for certain applications.” Busa argues that by allowing user programs to load data into a cluster's memory and query it repeatedly, Spark works well with machine learning algorithms.
But when you do need fast big data, SMACK can deliver great performance. Achim Nierbeck, a senior IT consultant for Codecentric AG, explains, "Our requirements contained the ability to process approximately 130,000 messages per second. Those messages needed to be stored in a Cassandra and also be accessible via a front end for real-time visualization." With 15 Cassandra nodes on a fast Amazon Web Services-based Meos cluster, Nierbeck says, "processing 520K [messages per second] was easily achieved."
Another major business win is that SMACK enables you to get the most from your hardware. As McFadin says, Mesos capabilities “allow potentially conflicting workloads to act in isolation from each other, ensuring more efficient use of infrastructure.”
Finally, SMACK provides a complete, open source toolkit for addressing real-time big data problems. Like LAMP, it provides all the tools needed for developers to create applications without getting bogged down in the details of integrating a new stack.
Today, most people still don't know what SMACK is. Tomorrow, expect it to become a commonplace set of tools.
SMACK: Lessons for leaders
- SMACK enables your company to quickly create big data analysis applications.
- Once built, those applications let you pull data speedily from your real-time data.
- And because SMACK is both flexible and makes efficient use of your server resources, you can do all the above with minimal hardware costs.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.