The top tools data scientists use
I go on half-hour walks. After a walk, thanks to data collection tools in my smartwatch and smartphone, I can see that I took 4,227 steps, just over two miles, in 35 minutes and 14 seconds, and that I burned 191 calories along the way. And Google, Samsung, and my doctor, thanks to a connection to her electronic health record software, knows it too. Like it or lump it, we live in the world of big data.
According to Statista, a business portal data site, in 2020 alone we reached a new high of 59 zettabytes of data. How much is a zettabyte, you ask? It's 10 to the 21st power bytes, or a trillion gigabytes. Or, in other words, the Library of Congress's total data is the merest minute fraction of the data constantly pouring into the Internet.
And that's only the beginning. According to IDC, the world's data is growing at a rate of 66% per year. It's only going to accelerate. Thanks to the Internet of Things and edge computing, more and more data is being collected.
Don't believe it? Look around your house. How many smart devices do you see? The computer in front of you, the phone by your side, the watch on your wrist, and the smart speaker you're listening to? The Netflix TV show you'll watch later tonight, perhaps even the lights above you, and soon, the car you'll drive to the grocery store?
While much of that data quickly grows cold and is never used, data science (DS) is growing increasingly more adept at using all data. Paired with machine learning (ML) and artificial intelligence (AI), data science is quickly turning once-obscure data into valuable knowledge.
It's doing this with a wide variety of languages and tools. As Iveta Lohovska, senior data scientist in the global AI and data practice at Hewlett Packard Enterprise, says, "The data science ecosystem is quite broad and changing every minute. Tools and languages can be grouped in many different ways and categories: open source, enterprise, industry verticals, the complexity of the tool or platform, BI platforms, DS collaboration environments, analytical databases, DS frameworks, etc. Some of them have overlaps, and some are very niche and focused on what they solve."
So, without further ado, let's take a quick survey of the hottest data science tools.
While some related fields of data science, such as ML, are clearly dominated by open source, others, usually older, started with proprietary approaches.
The oldest of these is SAS. This program, which dates back to 1976, was designed from the ground up for statistical analysis. It began as a closed source proprietary program for enterprises, and that's still the case today.
SAS uses the SAS programming language for statistical modeling. It's a software ecosystem in its own right. It comes with numerous statistical libraries and tools for modeling and organizing data. For all its advantages, SAS tends to be costly, and several important DS libraries and packages are not available in the base pack. To get them requires additional expenses.
On the other hand, SAS Viya, the latest version, can run on a scalable, cloud-native architecture. There, it can use both open source or SAS models. The very newest edition, SAS Viya 4, has been completely rearchitected and refactored for cloud-native deployment. This version runs as Kubernetes-orchestrated container-based microservices. Although available on Amazon Web Services, SAS has announced that Microsoft Azure is "first among equals" in its public cloud deployments.
Thinking of Microsoft, Excel—while you'll never find it in a continuous integration/continuous delivery (CI/CD) pipeline—has long been an important hands-on DS program. And it still is today.
Why? Because Excel—with its various built-in tables and filters, functionality that lets you create customized functions and formulas, and the ability to hook in easily with SQL queries—make it a good option for creating data visualizations and spreadsheets. Besides making it easy to visualize data, many people still use Excel for data cleaning; it's easy-to-use, familiar GUI makes it ideal for preprocessing data.
Less familiar, unless you're already deep into data science, is MATLAB. This is a numerical proprietary computing environment for processing mathematical information. It facilitates matrix functions, algorithmic implementation, and statistical data modeling.
Data scientists use it for neural networks and fuzzy logic and other ML/AI data approaches such as deep learning. It's also used for image and signal processing. Its graphics library lends itself to create data visualizations. MATLAB is also used in image and signal processing. This makes it a versatile tool for data scientists, as they can tackle all the problems, from data cleaning and analysis to more advanced deep learning algorithms.
Please read: How data and AI will shape the post-pandemic future
It's not as deep, but if you're looking for an open source MATLAB alternative, you should look at Matplotlib. This is a Python-based plotting and visualization library. It can be used as an alternative to MATLAB's graphic modules. In addition, the popular pyplot module gives the program a MATLAB-like interface.
Finally, Wolfram Mathematica must also be mentioned. This is a broad technical computing system, which incorporates many useful data science approaches such as neural networks, ML, and data visualizations. Its power comes not so much from any specific program but from how its many different tools can be deployed to work on data science questions using its Wolfram Language.
As powerful as all those proprietary tools are, the newer open source family of data science-related programs is giving them a run for their money.
Near the top of that list has to be the R programming language. R is an open source language and environment for statistical computing and graphics. It's been called the lingua franca of data science because it lends itself so well to statistical and data modeling.
While the R language is not very friendly to new users, it's not as difficult as mastering Wolfram or SAS. Fortunately, there's RStudio. This is an integrated development environment for R and Python. It comes with a console and a syntax-highlighting editor that supports direct code execution. It also includes tools for plotting, history, debugging, and workspace management.
However, R doesn't just look back to older technologies for support. It also works with Project Jupyter. Jupyter is an IPython-related open source tool that is often used for presenting data science results in live code, visualizations, and presentations. Jupyter Notebooks can also be used for data cleaning, statistical computation, and visualization, and to create predictive machine learning models.
Outside of R proper, there's also the ggplot2. This is an advanced R data visualization package. Ggplot2 is part of tidyverse, a package in R meant expressively for data scientists. Ggplot2 specifically replaces R's native graphics package to make it easier for easily creating useful visualizations from analyzed data.
Of course, Hadoop itself is vital for today's data scientists. Hadoop is an open source library that enables you to create a framework for the distributed processing of large datasets across clusters of computers using simple programming models. It scales up from single servers to thousands of machines, each offering local computation and storage.
While recent surveys show that not all data scientists love Hadoop—some find it too slow—there's also no doubt that Hadoop is still important. Many data science projects use Hadoop to store data. After all, once the data is in Hadoop, you can ask questions of it regardless of the dataset's schema.
A related program is Apache Spark. This is a unified computing engine and libraries for computer cluster parallel data processing. It is used for managing and running big data queries. It also supports multiple widely used programming languages, such as Python, Java, Scala, and R. It also includes SQL and ML libraries. Like Hadoop, it scales up easily from single servers to massive clusters.
TensorFlow is essential for anyone working on AI/ML projects. TensorFlow has become the open source ML software stack. Its ecosystem of tools, libraries, and community resources is widely used for advanced machine learning algorithms such as deep learning. TensorFlow can run on CPUs, GPUs, and on Tensor Processing Units (TPUs). The last is an AI accelerator application-specific integrated circuit (ASIC).
It's no exaggeration to say this Python-friendly open source set of programs defines modern ML development. In no small part, that's because TensorFlow is an end-to-end platform that is easy for both experienced and wet-behind-the-ears data scientists to build and deploy ML models.
The case for mastering data science
Make no mistake about it: Mastering data science programs isn't easy. Becoming an expert on any of the major DS tools will take considerable effort and time. That said, the knowledge you can extract from the data will be the coin of the IT realm in the decade to come.
On a purely personal pragmatic level, there's money to be made from data science. Company review site Glassdoor named data scientist the second-best job in America in 2021. It had the top job from 2015 to 2019. Harvard Business Review once declared data scientists the sexiest job of the 21st century. The U.S. Bureau of Labor Statistics has found data science to be one of the top 20 fastest growing occupations and has projected 31 percent growth over the next decade.
OK, so being a data scientist may not improve your love life, but the field itself is growing almost as fast as the data it lives on. If you're looking for a hot IT career or a switch to one that's both challenging and lucrative, data science is for you.
Lessons for leaders
- While many of the best data science tools are open source and free, sometimes paying for proprietary tools makes sense.
- The amount of data businesses generate is overwhelming and growing. Without the best tools and proficiency in them, you can only fall behind.
- The centrality of data to almost all business makes data science a central skill for an organization.
The data science ecosystem is quite broad and changing every minute.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.