Skip to main content
Exploring what’s next in tech – Insights, information, and ideas for today’s IT and business leaders

What is data cleansing, and why does your company need it?

Data cleansing platforms can help make dirty data useful, but the process isn't as simple as it sounds.

Sometimes it feels like we're living in an all-you-can-eat data buffet. The volume of information doubles roughly every two years, and "data-driven decision-making" is the new mantra for business.

But while the amount of data is increasing exponentially, the ability of companies to make good use of that data is not. A big part of the reason is that data is inherently messy, inconsistent, or incomplete.

Recent studies by Experian show that only about half of all companies believe the data in their CRM or ERP systems is clean enough to use, and nearly a third believe at least some of their customer or prospect data is inaccurate. According to Gartner, low-quality data costs the average organization nearly $13 million a year. That can lead to inefficient decision-making, reputational damage, and missed opportunities, says David Sweenor, senior director of product marketing at Alteryx, an analytics automation platform.

Please read: The road to machine learning success is paved with failure

"To have the right foundation for your business, you need trustworthy data you can base sound decisions on," says Sweenor. "Cleansing and improving your data quality is the starting point for everything that follows."

Typos, misspellings, inconsistent treatment of common terms, invalid entries, incorrect formatting, duplicate or incomplete records—there's a long list of things that can make good data go bad. And as enterprises increasingly rely on predictive analytics to drive business decisions, ensuring reliability of data is critical.

"Even if you use state-of-the-art machine learning algorithms, low-quality data won't give you the desired results or accuracy," says Saravanan Natarajan, a data scientist at Hewlett Packard Enterprise. "When it comes to predictive models, quality is much more important than quantity."

In other words, data is what fuels your organization's AI engine. And as with an automobile, pumping in dirty fuel will eventually destroy its ability to move forward.

Data cleansing 101

Simply put, data cleansing, also known as data cleaning or data scrubbing, is the process used to identify and correct errors and inconsistencies in a dataset. It sounds simple enough in theory, but things can quickly get complicated.

There are nine different dimensions of data quality, from accessibility and accuracy to consistency and completeness, notes Stewart Bond, research director for IDC's data integration and intelligence software service. For example, are some entries abbreviated while others aren't—Inc. or Incorporated, Corp. or Corporation, International or Intl.? Your database may register the same company as two different entities.

Date, currency, and number formats will likely vary. Are all similar fields using the same units of measure? Inches or centimeters? Dollars or euros? 12-hour clocks or 24 hours?

Are there extra spaces inside a data field? Your brain will automatically identify them as the same entry, but a machine won't.

Does your database say a customer is married but also that he's only 10 years old? At least one of those data points is incorrect, but which one?

Fields required to run a predictive model may be blank or incomplete. In that case, you may need to substitute average values or use machine-generated synthetic data to complete calculations.

Please read: Can anti-bias efforts help women get their voices back?

Data also degrades over time, adds Bond. People move, switch jobs, get married, change their names, and die. Organizations may need to rely on third-party data to ensure that their records are up to date.

There are specialized software programs for cleaning different types of databases—contact, location, business, product, and so on, says Bond. These apps increasingly rely on AI and machine learning to do some of the heavy lifting, like creating rules (delete extra spaces, change all instances of "California" to "CA," etc.) and making cleansing recommendations. But Bond is careful to note you can't automate the entire process. At some point, you'll need human subject matter experts to step in and make the tough calls.

When cleaning becomes destroying

A key element of data cleaning is identifying and correcting duplicate records. For example, if a mail order house has Howard R. Smith, H. Robert Smith, and Bob Smith in its customer database, and they're all the same age and live at the same address, there's a high likelihood they're the same person. That vendor could lower its marketing costs by merging those records and sending him one catalog every month instead of three. But the danger in merging seemingly duplicate records is that it could end up destroying data that might prove valuable later on, argues David Loshin, president of Knowledge Integrity, a data quality consultancy.

"If I 'cleanse' that record by correcting it and storing it in another environment, I've changed the data," says Loshin, who is also director of the University of Maryland's Master of Information Management program. "And when you change data, you lose information."

Loshin says a better approach is to use identity resolution tools to create a registry of entities that map to all the different "Bobs." In that way, companies can gain a unified view of the customer without destroying any of the underlying information. This becomes especially important for compliance and risk mitigation.

Please read: AI is the key to unlock insights from unstructured data

"If an individual creates multiple accounts in order to commit fraud, and you merge those accounts into a single entity, you've erased the footprint of the fraudster's activities," Loshin says. There may also be value in sending Bob Smith multiple catalogs, he adds. Other people in his household are more likely to see them, and the number of purchases per year might actually increase as a result.

"You can't presume that having a single representation of a customer will improve their lifetime value," he says. "You need to understand how datasets are being used and what business processes are being informed by them, and use that to drive the decisions you make around data quality."

Whose job is it?

Data cleansing is a dirty job that somebody has to do. The question is, who? Many organizations leave it to their data scientists. A 2020 survey by data science platform vendor Anaconda found that data scientists spend 45 percent of their time loading and cleaning data—a significant investment of a scarce resource.

Other organizations hand the responsibility to their already overloaded technology departments, notes Alteryx's Sweenor. "They think, 'Hey, our IT team can do it,' but their IT team will never be able to keep up," he says.

Sweenor argues that subject matter experts in each department—marketing, HR, operations, and so on—need to assume responsibility for the cleanliness and accuracy of their data. "They know the business; they know the business data," he says. "Empowering the people within those organizations to improve data quality is something enterprises don't think enough about."

This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.