Skip to main content
Exploring what’s next in tech – Insights, information, and ideas for today’s IT and business leaders

The road to machine learning success is paved with failure

Failures in machine learning models generate major headlines, but this is how the technology is supposed to work.

You might remember the story that broke in 2018 about Amazon's early experiments in using machine learning and AI to aid its recruiting efforts. The system, which was built in 2014, was designed to give online job applicants a quality rating from 1 to 5, drawing on a decade's worth of resumes that had been fed into it as training data.

It wasn't long before developers noticed a curious feature with the job applicants recommended by the algorithm: They were virtually all men. And the algorithm didn't stop there. If it couldn't determine the gender of the applicant, it would penalize candidates if the resume included the word women in the body or the candidate attended a women-only college. Even after attempts were made to remove bias from the system, Reuters reported that candidates were ultimately recommended for various jobs "almost at random."

The problem was soon revealed to be an issue with the training data—a majority of those resumes fed to the system had been from men, so the AI naturally gravitated in that direction—but the incident was widely held up as a massive failure of machine learning. "Here was a case of FML—Failed Machine Learning—that had the potential to negatively influence job prospects for women while breaking antidiscrimination laws," wrote algorithmic bias expert Joy Buolamwini shortly after the story broke.

When AI projects go wrong, they can go spectacularly wrong. But actually, that's kind of the point.

Failure is part of the process

"The primary source of difference between software engineering and machine learning engineering is that you should always expect failure with ML," says Glyn Bowden, CTO of the AI and data practice at Hewlett Packard Enterprise. "We need to expect failure and need to look out for it. The question becomes, how do we build the transparency, the monitoring capabilities, and the telemetry into the solutions that we're building so we can see when that starts to happen."

For Bowden, one of the core elements of a strong ML program is learning from mistakes like Amazon's. Rather than piling on to "AI is evil" headlines, researchers should be taking the opportunity to learn what went wrong and build those learnings into subsequent models. "When you notice drift sneaking in, it's important to ask how to head that off and retrain the model or reinforce the model so it remains as accurate as possible," he says. "The ML developers who are the most successful are the ones who build that thinking into their strategy."

Please read: How AI is changing the way we talk to each other

The concept of ML being experimental is hardly a new idea. Way back in 1988, Pat Langley, then at the University of California at Irvine, wrote about it in an editorial called "Machine Learning as an Experimental Science." In the paper, he notes that "unlike some empirical sciences, machine learning is fortunate enough to have experimental control over a wide range of factors, making it more akin to physics and chemistry than astronomy or sociology." Langley probably wasn't predicting the issues of bias that would be front and center 30 years later, but he does note that "machine learning occupies a fortunate position that makes systematic experimentation easy and profitable" and that "whether they lead to positive or negative results, experiments are worthwhile only to the extent that they illuminate the nature of learning mechanisms and the reasons for their success or failure."

Mistakes are natural—within reason

That's not to say that every instance of ML gone wrong is a natural part of the experience. A few years ago, a large AI vendor was slammed for giving bad medical advice to actual cancer patients, an issue that was eventually traced to massive problems with the training data the algorithm had been fed—namely that it was based on theoretical preferences and not actual case studies—because scientists found it too difficult to keep up with changing guidelines.

"There was such confidence that the system was doing a good job that no one went back and checked, so it slowly fell behind," says Bowden. "The data is going to evolve, so the framework needs to evolve."

Practitioners need to understand the limitations of an ML model so that radical changes are spotted quickly and a model can be pulled from production before any real damage is done. "If you've done your homework up front, you know why the model behaves the way it behaves," says Bowden. As an example, he posits an ML image recognition system. If something suddenly changes with the way images are being categorized or processed, it's time to pause and look for problems. In many cases, it's not an algorithm that's gone haywire but a change in the data, such as lighting conditions or camera calibration. Once the environment has become polluted, it's time to fall back on other processes until a fix is deployed. "But you need the discipline to say, 'I'm worried about it,'" he says.

Please read: The rise of artificial intelligence and machine learning

Mistakes aren't always OK, of course, and data scientists should take steps to mitigate catastrophic problems from developing. A self-driving car needs to be 100 percent guaranteed it won't decide to run over a pedestrian because it has learned that a person is softer than the wall it is otherwise about to hit. As Bowden notes, ethics must be built into models as well to prevent catastrophic failures like this.

Best practices for ML experimentation

"Machine learning models can and will fail, but you can put in place best practices from the onset that can get that ML experimentation on the right track," says Shomron Jacob, engineering manager for applied ML and AI at

Those best practices begin with dataset analysis and a careful rooting out of bias, before the algorithm is turned loose on it. "Check your data quality and check for an even spread of data across classes, which will help avoid classification issues," says Jacob. "Use fresh data that was never used during training to make sure ML model evaluations are actually true tests, and make sure the model is prepared to handle outliers when it's in a production environment."

Bowden adds that training on outliers and faults must be done carefully. If a large volume of discolored products comes off the assembly line and they're immediately fed back into the model as examples of errors, a model can begin becoming biased toward those specific faults. "That's all the training set starts looking for," he says. "Instead, you need to capture training data for a longer term because you don't just want to find faults; you also want to know the frequency at which they show up."

Please read: It's time for AI to explain itself

"But even best practices can end with an ML model that isn't fit to move forward with," says Jacob. "To know whether an ML model has gone off the rails and needs to be pulled from production, you need to include a built-in feedback loop tool designed to track if and when the model fails. Remember that models are based on probability, and they're brittle, making reliance specific to the data itself. If data changes, the model also needs an update."

And if that happens, don't despair. "The nature of machine learning is experimentation," Bowden reminds us. "That's why we build models: because we don't know."

This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.