One of the primary problems with artificial intelligence (AI) is the “artificial” part. The other is the “intelligence.” While we like to pretend that we’re setting robotic intelligences free from our human biases and other shortcomings, in reality we often transfer our failings into the AI, one dataset at a time.
Hannah Davis, a data scientist, calls this out, arguing that “a dataset is a worldview,” filled with subjective meanings. But rather than leave our AI hopes moribund, she also offers some ways we might improve the data that informs our AI.
AI has always been about people
It has become de rigueur to posture how very “data driven” we are, and nowhere more so than in AI, which is completely dependent on data to be of use. One of the wonders of machine learning algorithms, for example, is how fast they can sift through mountains of data to uncover patterns and respond accordingly. Such models, however, must be trained, which is why data scientists tend to congregate around established, high-quality datasets.
Unfortunately, those datasets aren’t neutral, as Davis points out:
[A] dataset is a worldview. It encompasses the worldview of the people who scrape and collect the data, whether they’re researchers, artists, or companies. It encompasses the worldview of the labelers, whether they labeled the data manually, unknowingly, or through a third-party service like Mechanical Turk, which comes with its own demographic biases. It encompasses the worldview of the inherent taxonomies created by the organizers, which in many cases are corporations whose motives are directly incompatible with a high quality of life.
See the problem? Machine learning models are only as smart as the datasets that feed them, and those datasets are limited by the people shaping them. This could lead, as one Guardian editorial laments, to machines making our same mistakes, just more quickly: “The promise of AI is that it will imbue machines with the ability to spot patterns from data, and make decisions faster and better than humans do. What happens if they make worse decisions faster?”
Complicating matters further, our own errors and biases are, in turn, shaped by machine learning models. As Manjunath Bhat has written, “People consume facts in the form of data. However, data can be mutated, transformed, and altered—all in the name of making it easy to consume. We have no option but to live within the confines of a highly contextualized view of the world.” We’re not seeing data clearly, in other words. Our biases shape the models we feed into machine learning models that, in turn, shape the data available for us to consume and interpret.
Time to abandon hope, all we who enter here?
Data problems are people problems
Not necessarily. As Davis goes on to suggest, one key thing we can do is to set our datasets to expire:
Machine learning datasets are treated as objective. They’re treated as ground truth by both the machine learning algorithms and the creators. And datasets are hard, time-consuming, and expensive to make, so once a dataset is created, it is often in use for a long time. But there is no reason to be held to the past’s values when we as a society are moving forward; similarly, there is no reason to hold future society to our current conditions. Our datasets can and should have expiration dates.
At any given point in time, the people, places, or things that are top of mind will tend to find their way into our datasets. (Davis uses the example of ImageNet, created in 2009, which returns flip phones when “cell phone” is searched.) By setting datasets to expire, we force our models to keep up with society.
This calls out another option, one suggested by McKinsey research, which is to re-introduce people into AI. Whether through pre-processing of data or post-processing of data, humans can step in to correct machine learning models. The math involved in the model may be impeccable, but adding humans (yes, with biases) can help to take account of the model’s outcome and prevent biases from operating unchecked.
Unless we’re careful, Davis warns, “It is easy to accidentally cause harm through something as seemingly simple as collecting and labeling data.” But with extra care, we can gain much of the benefits of AI while minimizing the potential biases and other shortcomings that the machines inherit from us humans.
Copyright © 2020 IDG Communications, Inc.