Skip to main content

Don't let your data lake leak

Big data insights are the reason to build a data lake, but lack of attention to security could lead to unpleasant surprises instead. Hint: Trouble often begins with proof-of-concept projects.

Your company needs an edge, and you know how to get it—big data analytics. So you build a data lake and toss in reams of data. Because time to value is critical, you give trusted users access to it over the Internet. Next, you sit back and wait, hoping those line-of-business (LoB) experts can find the strategic insights that will give your company an advantage.

Congratulations. You’ve just committed the most common data lake security blunders: Your data lake is unencrypted, it is on the Internet, and it does not have strict access control. You have put at risk large volumes of sensitive information, corporate financial data, and the personally identifiable information of customers. Unauthorized access could not only result in financial loss, identity theft, and reputational damage, but could also run your organization afoul of regulatory compliance. Then there is the danger of ransomware—imagine your data lake becomes inaccessible to your company until you pay a stiff fee to cybercriminals. 

A data lake is a key first step in the big data journey for any organization. Usually consisting of the Hadoop Distributed File System (HDFS) on industrial-standard hardware, a data lake contains structured and unstructured (raw) data that data scientists and LoB executives can explore, often on a self-serve basis, to find relationships and patterns that could point the way for new business strategies. Having large amounts of diverse data in one place helps increase the likelihood that users will find hidden and potentially valuable patterns and insights. Often used alongside an enterprise data warehouse, a data lake is a low-cost way to store data indefinitely while it awaits analytical processing.

“Data lakes often bring together an enterprise’s most valuable data to perform analytics and/or build predictive models. Hackers would like nothing more than to engineer a single breach with access to all of it,” according to Forrester Research analysts Mike Gualtieri, John Kindervag, and Kelley Mak in the report, “Big Data Security Strategies for Hadoop Enterprise Data Lakes.”

Big data projects are proliferating at a rapid rate across organizations of all kinds, in all industries. In 2016, 65 percent of data and analytics decision-makers had implemented big data solutions, according to Forrester. That number should reach almost 100 percent in the next few years.

Data lakes often start as proof-of-concept projects, and that’s where trouble can begin. “Most data lake breaches to date have been of systems that are connected to the Internet when they did not need to be. Some were systems that had been designed as prototypes and then moved into production without a review of the architecture,” says Bob Gourley, a partner at Cognitio Corp. and publisher of the Daily Threat Brief. “In some cases, the breaches were due to use of easy-to-guess passwords and default logins. As in so many other areas of security, big data security is frequently about getting the basics right."

Get the latest on HPE's strategy: HPE Locks Down Server Security

Data lake security tips

The tactics required to defend a data lake, such as authentication and encryption, are well known to corporate IT. They just need to be applied, whether the data lake is a pilot project or not. However, because data lakes are a relatively new item on the agenda of corporate IT, it’s not always clear just who should take responsibility. “It’s not well understood who should be in the lead. The chief data officer might be a good person,” says Rick Villars, vice president of data center and cloud at IDC. Whoever is responsible must step up and take charge. “There is value to having all your data in one place. But if you are serious about having a data lake, you need to have controls,” says Villars.

Although big data projects have typically been associated with larger companies, which often generate large amounts of data and have the resources to undertake a large-scale project, smaller companies have been getting into the act thanks to steadily falling storage costs. And cloud-based data lakes are emerging as an attractive option for many businesses, small and large, because the elasticity of cloud services allows them to accommodate large amounts of data. Cloud-based data lakes are also a development that might bode well for security because it forces users to examine what data will be used in the cloud and determine the security that needs to be applied.

Forrester recommends adopting an approach in which all traffic is secured and inspected for malicious or hazardous behavior. Forrester calls this a “zero-trust” approach, because no data is considered trustworthy. Zero-trust is particularly valuable in combating insider threats, because insiders have traditionally been considered trusted—a costly mistake, experience has shown. According to Forrester, 38 percent of breaches reported by North American and European enterprise decision-makers are due to internal incidents, which could be human mistakes, negligence, or malicious activity.

The zero-trust approach places a focus on data-centric security, which goes beyond traditional perimeter firewalls and antivirus software. Data-centric security includes a number of steps, including an understanding of where the data is located and whether or not its loss would pose a risk to the organization. Encryption and identity management are also key elements of the data-centric approach. Because data can be on premises, in the cloud, or both and might be manipulated by a number of different users, a data-centric approach mandates that data security measures such as encryption should travel with the data. It’s also critical to give users access to only the data they need and to inspect data usage patterns through continuous monitoring.

The consensus of industry experts is to apply to data lakes the best security practices for legacy database environments. These generally fall into six categories. Here is a checklist:

  • Administration. Data administrators must be able to set policies for all users across all data resources.
  • Authentication. End users must be able to prove who they are. Multifactor authentication is recommended.
  • Authorization. Different users are given permission to access specific resources and perform different tasks, according to roles, privileges, profiles, and resource limitations.  
  • Auditing. Administrators must maintain a record of what was done by monitoring and recording user actions, and they must ensure that those without permission do not access data resources.
  • Data protection. Sensitive data should be encrypted both in transit and at rest, wherever it is located, whether on premises or in the cloud. Encryption key management, including key generation, storage, and access, is essential.  
  • Backup. Ensuring that the data lake is securely backed up is particularly important to help guard against the potential cost and headache of ransomware attacks. Being able to restore data from backup will overcome the necessity of paying a cybercriminal to get it back.

According to Gourley, one of the most common design vulnerabilities in a data lake is failure to encrypt the data. “Designers with no security experience frequently think it is OK to leave the data unencrypted and control access by giving out login credentials to trusted users," he says. That’s like leaving the bank vault open, he contends. “Miscreants with even a basic knowledge of IT can use malicious tools to get around access controls and get directly to the data, making encryption of data at rest a key design criterion.”

Villars stresses the need to maintain organizational compliance with industry regulations, such as PCI DSS for financial services companies and HIPAA for healthcare organizations. Also critical for companies doing business in Europe, or with European customers, is the General Data Protection Regulation (GDPR) of the European Union. Data lake data that might contain personally identifiable information must be managed in accordance with the GDPR or stiff penalties could result.

One best practice above all is eternal vigilance. “With security, there is always someone coming up with something new,” says Villars. The WannaCry virus/ransomware is a good example. Ransomware is usually delivered through phishing attacks via email. But WannaCry spread like a virus. It wasn’t necessarily more effective, but it showed the willingness of cybercriminals to try new approaches. “There has to be a big focus to make sure you have added security into the design of these systems. It’s never a finished effort. You’ll always have to do more at different layers, possibly adding cognitive and machine learning,” advises Villars.

Bottom line: No big data project, including the creation of a data lake, should be undertaken without an airtight security strategy that is followed to the letter.

Data lake security: Lessons for leaders

  • Be aware of all the data types that are going into your data lake, including sources, dependencies, and levels of sensitivity. 
  • Apply the best security practices you would apply to legacy database implementations.
  • For governance, risk, and compliance purposes, be aware that a data lake might retain data for a long period of time, and that the data might be sent to the cloud and back.

Related links:

HPE joins other industry leaders to strengthen the open source community

HPE GreenLake Big Data: Unwrapping the Gifts Your Data Brings

Big Data update: Hadoop 3.0 HDFS erasure coding performance

Can Big Data mean Secure Data? With HPE and Cloudera, the answer is “Yes”

This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.