In 2016 Richard Lee was applying for a passport online in New Zealand when the system objected to the image he uploaded: “subject’s eyes are closed”. Richard is an Asian male and actually his eyes in the photo were open, but the machine learning system had been trained mainly with Caucasian subjects that have a different eye shape. As hard as Richard tries, he will never get his passport if the machine learning system is not properly trained.
This is just an example about how important is to get data that is fully representative of the domain of competences for the neural network. No matter how good our algorithm or robust the structure of our machine learning system, if the data we provide for the training is biased then we will fail to achieve our goal.
In this article, based on Omosola Odetunde’s speech at Codemotion Rome 2019, we will explore how to avoid biases in dataset training and include a couple of guides that can help us to do it.
AI and ML APIs
Omosola Odetunde has been a software engineer since 2010. Now a technical and product advisor, she got a Master’s in artificial intelligence with a focus on natural language processing in 2014. She realised pretty soon that engineers treated machine learning and neural networks as neutral technology, free from any human bias. “I learned exactly how much that is not the case,” she says.
We can define machine learning as an algorithm that, based on data, creates a model to make a prediction. Then it validates such prediction against the reality, obtaining feedback to improve the model with more data.
For example, let’s say that we want a system able to predict the price of a house based on square metres. We have to choose a proper algorithm. In this simple case, we can choose a linear regression that has the goal of finding the best line that fits the correlation between square metres (sqm) and prices. You provide the real cases (sqm & price) to train you system and you will get the model: basically the interceptor and slope of the line you were looking for. Now you are able to predict what will be the price of the house based on square metres; when we actually sell it we can add this info to the model to improve it.
Generally speaking, we have two types of machine learning system: supervised or unsupervised. In the first case, when we provide data, we also add the info about what will be the expected output, whereas in the latter we only provide data and expect the system will give an answer by finding the correlation between data. We can take as an example a baby who has to distinguish between dogs and other animals.
In the first case, the parents will point at a dog every time they see one and will teach the baby “that animal is a dog because…” and say a feature like it has four paws or a tail. That what a supervised system does.
In unsupervised system the baby will see thousands of different animals and will hear some other people naming them. In this case the baby will create a link by itself to determine what is a dog and what is not. Nowadays, unsupervised systems are widely adopted because they can be trained over millions of parameters against the supervised that are usually trained over a limited number.
Omosola explains that “the current era is called gold rush AI” because artificial intelligence (AI) is more accessible every day. By 2021, organisations will spend $52.2 billion annually on AI products.
We could say that everything started in 2006 when Amazon launched AWS (Amazon Web Services), making it possible to have incredibly cheap computing. In 2007, the iPhone arrived, bringing a lot of user data.
Since then, we have acquired computational power and a lot of data, but we need skills to build proper machine learning. For this reason, in the last few years, Google, Amazon and other big players have been providing trained models, ready to use to make accessible to developers and small companies the power of artificial intelligence. This includes, for example, in-app translation, speech translation, text-to-speech interaction, chat bots, sentiment analysis and object recognition.
Here comes the drawback: by leveraging these public machine learning APIs, small companies are able to build their own applications, but it also means that they’re heavily dependent on the data and the models that were trained by the large companies. So if you build something off the Google text-to-speech translation API for example, then you are working off of the model for text-to-speech that Google developed itself which means now your application is also based on that model and the data that was originally used to train that model. So if the model is trained using a selection of people with a US accent, then the model will probably speak with that accent.
Three main biases
Generally speaking we can address three main biases:
- reporting bias
- selection bias
- latent/implicit bias
Reporting bias is when an algorithm is affected by the way users interact with it. So when machines are taught to learn from those around them, they often can’t decide which type of data they should keep or which kind of data they could discard and they can’t tell what is good or bad data, they just take all the data. sually, the system adapts its behaviour to the way of thinking and reacting of the people it is interacting with. In this case, we end up with a system that records or reports things in a frequency that is different from reality. For example, facts that are related to strong feelings and emotions are usually more frequent than others.
Selection bias is caused when the dataset that is used to train a model is not representative of the dataset of the people on which a model is predicted. Usually this happens because there’s one group that is over-represented in a dataset and it doesn’t actually represent the world that it’s used to predict upon, but represents some bias that existed when the data was initially collected. That means that the final classification will happen on this large and over-represented group.
Latent/implicit bias is when a system starts to learn to correlate certain ideas or certain particular qualities with certain categories such as gender, sexuality or country of origin.
For example, we can think of a system that, when searching for images of doctors, first presents images of male doctors over images of female doctors, regardless of how many images it has of male or female doctors. This can happen because it’s learned from many representations, for example of American Hollywood media. It’s learned that men are more often represented as doctors in media than women are. Regardless of the fact that maybe this dataset of images is almost 50% women and 50% men, it’s learned this implicit bias that exists outside of itself. This bias exists in the society where the data was collected to more often represent men as being doctors than women.
What can we do?
“The first and most important thing is acknowledging that the bias exists”.
Another important countermeasure is to create a more representative dataset, then remove unnecessary attributes that can cause implicit biases.
Finally, identify blind spots to balance personal assumptions as developer or data scientist with people who think differently to make a nearly neutral system.
The community is moving in this direction. Joy Buolamwini, researcher at MIT Media Lab, discovered a compensation system called the algorithmic justice league, which is focused on bringing awareness to the issue of bias and machine learning and also providing bias checks for applications. They also work on creating more representative datasets, so she worked on a dataset called “gender shades” which provides more variety and faces that can be used against image recognition softwares to make sure that we don’t have the situation where Microsoft had 95% accuracy rate on lighter male faces over 35% accuracy for darker skinned female faces.
We have also Google’s inclusive ML guide that gives a lot of really good best practices for trying to reduce the kind of bias you could introduce into your dataset. Google as well, after some public problematic AI situations, has been doing a lot of work on to increase the inclusivity in its machine learning.