Skip to main content

The 5 mistruths of data

Humans naturally skew or otherwise damage data so that it becomes untruthful and therefore useless. One problem with AI is that it assumes everyone tells the truth, which isn't always the case. It's difficult to fix these mistruths—but necessary, if we are to rely on AI and machine learning systems to make good decisions.

 AI applications make decisions based on data. If the data has errors, then the result of any machine learning task is flawed. Datasets that contain mistruths present incredible problems, some of which we can solve using new data acquisition and manipulation techniques—and some of which we can’t.

Most people working with AI applications focus their attention on deep learning. That makes sense: It’s the most visible form of AI, and it has a vast array of applications. Humans affect the data used in deep learning tasks, however, which can make it impure with potentially negative outcomes.

Deep learning isn’t the only kind of AI. Whether we speak of cognitive computing, AI, machine learning, or deep learning, or the distinctions within machine learning itself (supervised learning, unsupervised learning, or reinforcement Learning) they all share one attribute: the reliance on data to make decisions. And if we can’t trust the input, no matter how elegant the software implementation, we certainly cannot trust the output.

Garbage in, garbage out

People may take for granted that AI needs truthful data to do a good job during analysis, but “truth” can be an exceedingly hard thing to quantify. Humans instinctively see and consider mistruths, but an AI can’t. An algorithm only sees data—never truth or mistruth. The data is massaged in particular ways, but the interpretation of the data always comes from humans in some way.

Are you looking for help with your AI deep learning journey? We have a Dummies Guide for that

To be useful, the data used in AI and machine learning applications must be properly vetted. Even so, humans must accept that some results will still be suspect because it’s not possible to avoid some level of misunderstanding.

The five types of mistruths are as follows:

Commission

The data contains a verifiable mistruth.

Sometimes people lie outright or shape data to fit their worldview. There are applications that try to find these untruths—for example, CaseWare IDEA helps automate the process.

However, the “commission” mistruth doesn’t necessarily mean that anyone or anything purposely lied. Often, there was an error in collecting the data. Dust on a camera, for instance, can cause mistruths of commission in the resulting data.

Changing the manner in which systems collect data can help reduce or eliminate mistruths of commission.

Omission

The data doesn’t contain any mistruths, but the data has a missing element, skewing how others view the data.

For example, think about an insurance company recording data about car accidents. Someone reporting an accident could say that a deer crossed their path, they were somewhat slow in applying the brakes, and that the road was slick so that the accident was worse than it might otherwise have been. That all may be accurate. However, if the driver does not mention that they were also texting at the time of the accident, it skews the data, although the account didn’t have a single mistruth. Unfortunately, the skewing of the data can create unrealistic rates that could hurt the customer and insurance company alike.

Not collecting all of the correct data can also cause mistruths of omission. As an example, much of our knowledge of Asian-Americans' health has been determined by studies in which investigators either grouped together Asian-American subjects or examined one subgroup alone (e.g., Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese). When national health data is reported for Asian-American subjects, it is often reported for the aggregated group. This aggregation may mask differences among Asian-American subgroups.

However, not all mistruths of omission involve false or missing data input. A momentary glitch in any sensor can cause a mistruth of omission, even when the sensor normally performs flawlessly. If the glitch happens at the wrong time, such as during a power surge in an industrial setting, the AI may not know to fix the problem automatically, requiring human intervention, and significant damage can result. Finding errors of omission requires vetting by a human, but smart humans can find them and usually correct them.

Bias

The collection method doesn’t see all of the attributes required to make the data useful.

As an example of this particular kind of mistruth, imagine a programmer who spends hours looking for a glitch in a piece of code. Because so much time was spent looking at the code, the programmer can have a bias against actually seeing the error. Someone else can come along and find the error much faster. However, in the meantime, any automatic data collected about the development process is flawed because it doesn’t reflect the actual problem source and time required to fix it.

Another sort of bias occurs when someone tweaks a data analysis to produce the desired result rather than a true result. In one such case, a study by the U.S. General Accountability Office showed that Federal Communications Commission statistics on availability of Internet access are extremely flawed. It’s impossible to make optimal decisions based on incorrect data.

Bias enters the data stream in all sorts of ways. If a camera isn’t designed to collect infrared data, it’s hardly surprising that the dataset misses the animal crossing its path in the dark. If this animal is a rat that's living in a company's storage shed, for example, the business can lose money as a result of damage caused by the rodent. Bias can also occur during conditioning and all sorts of other ways, but it always points to not seeing something that is (maybe obviously) there.

Creating an environment that works to reduce bias is one way to correct this mistruth, but it can be quite hard to remove the bias entirely.

Perspective

The collection method isn’t in a position to see all the data.

Imagine a police officer collecting information from four people after a car accident. Even if no one has any reason to lie, each person tells a different story. The driver was looking at the dashboard and could feel the impact and therefore provide information that the onlooker on the sidewalk can’t. However, the onlooker could see when the driver applied the brake hard enough to cause the car to skid, as well as that little bit of black ice on the road that the driver missed. The pedestrian hit by the driver could see the driver’s face—the expression of complete surprise that indicates the accident wasn’t malicious. The person looking out from the window could clearly see that the pedestrian wasn’t looking when crossing the street and was also partially hidden by some bushes. The officer may never come to a complete truth because the best-case scenario uses the common elements of all of the stories and therefore lacks detail.

It’s not possible to correct this mistruth completely, but better (more comprehensive) collection methods usually reduce it. For instance, finding additional information sources (the camera in this example), using multiple methods to query the information sources (multiple officers with different query techniques), and cross-validation of information sources (comparing this accident with other, similar accidents) can all help to reduce problems with perspective.

Frame of reference

Experience is an essential element in relating a truth.

It helps to explain things to people who share similar experiences. Imagine a developer who relays his experiences of designing exceptionally complex equipment in a nearly impossible time frame and then using it without testing. It’s easiest to share that story with a listener who also experienced that design scenario and is more likely to understand its effects. The mutual experience creates a frame of reference between the two people, which makes it possible to pass information in nonverbal ways. More important, some data need not be transferred; the other party already knows how a particular situation was resolved, or both can take some things for granted.

In data terms, a sensor can run afoul of frame-of-reference issues when it exists in an environment that isolates the sensor from the true experience. For example, it may not be possible to place a sensor in a position to obtain every bit of data required to understand an industrial accident. In that case, the data obtained would lack a frame of reference to the complete environment. It’s often not realistic to expose the sensor to the full experience due to issues with technology—for example, a fire or an irradiated environment could cause the sensor to malfunction. Destroying the sensor won’t garner data any more effectively than isolating it in a manner that causes a frame-of-reference issue.

Frame-of-reference mistruths are correctable only when someone with the required frame of reference reviews and vets the data. Sometimes, however, they aren't correctable at all.

Adding the human component

Humans complicate matters even further. In some cases, a human expects a mistruth as output, but an AI is unable to provide it. For example, an AI may analyze the data, mistruths and all, and come up with something completely unexpected and possibly unwanted. The article "Which outfit looks best? AI set to give shoppers smart styling tips" discusses this very issue. Imagine trying to run a clothing store that uses this software to mitigate less positive customer sales results and finding that the software produces more unsatisfied customers instead.

Worse still, in some cases, a truthful output is hurtful in a manner the AI can never understand. For example, some business settings are driven by ego, such as a startup based on the vision of one person. An AI could possibly provide statistics showing that ego gets in the way of creating a great business—something the entrepreneur doesn’t want to know and doesn’t need to know. In fact, in this particular setting, possibly the only time a case like this could be made, removing the ego from the person is most definitely harmful. A human coming in contact with the information would probably keep it quiet, but the AI cannot understand that the information is hurtful.

Business people regularly flaunt the rules, but there are plenty of people to point out that the rules are supposedly important, which makes one wonder how Steve Jobs and Bill Gates became so successful. In the article "Truth is good, but knowing too much truth is harmful," the author explores the harm that too much truth can cause.

AI needs help

Some people worry that eventually, AIs will steal their jobs. The problem isn’t one of stealing all the jobs but of retraining humans to have different skills. In some respects, the world is now in the same state it was in during the Industrial Revolution, which caused such major upheaval that the effects were felt a hundred or more years later. Coupling AI and humans is essential to create more truthful and useful output, and this cooperative work environment between AI and human is already taking place. So, far from replacing humans, AI is providing a level of augmentation that will only make our lives more interesting and possibly fun.

This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.