AI & Dataset Bias
According to statistics, most content on social media is not made by you or me: several studies suggest a small group of users (9-10%), mostly traditional media outlets, create most of the original content. Opinion leaders then produce 18-20% of content by commenting on the content produced by these outlets. The general public then see and interact with this content.
Despite the public perception, the new media landscape is shaped by a fraction of users.
Now, if we turn our attention to AI, we can see that a similar problem may happen here. Assume you’re an AI developer, then you’ll probably go to kaggle.com, Statista, or leverage open government data to train your model. The number of data sources is limited, which means a developer has to fully rely on these few sources. It’s likely that you cannot cross-check this input with other sources and, even if you could, how would you decide which dataset to choose if there were more than one? So, multiple data sources can present their own problems.
A developer can use different models, such as different numbers of features, and choose several optimization techniques to arrive at a working predictive system. But for most of models, the data they rely on may be identical. No matter how sophisticated an AI engineer is, the model is only as good as the data it’s based on. This is where we need to wrap the head around what data we get.
COVID-19 provides an apt example. To create a worldwide model of how the virus spread, we should rely on the data sources from each country. However, given that there’s no uniform way to collect this data globally, we can expect a considerable margin of error.
Another example is global warming. Data from an NGO or a government source can paint a completely different picture – different pictures that can shape an array of nationwide decisions and company strategies. If the source isn’t correct, bias could spread across industries, alter public perception, and perhaps compound the negative impact of global warming itself.
With AI coming to serve a vast array of services around us, possible dataset manipulation may cause misinformation to spread at the scale and impact greater than ever before. When an organization uses an artificial brain, it’s often a black box where we don’t know why they do what they do. To find out whether there’s a problem with the data or with the model itself is tough. AI is becoming directly integrated into the environment around us, in our home appliances, mobile phone, and our shopping experience. As long as we must rely on the quality of the data that’s available, there will always be a problem with automated decision-making.
The problem of creating a universal data language is far from simple, as it raises questions of privacy and issues with protecting business secrets. Nevertheless, we can borrow from the rich industry experience to develop a potential solution. First, as the cost of data collection and storage goes down, it’s better to have predefined data agreements that would allow us to stitch the data within the organization with external sources. It resembles how the networks protocols evolved over time – from proprietary to open protocols. Second, companies can make more data sets publicly available, in line with greater expectations on corporate transparency. Third, companies need to become more responsible for the data they produce, how they store it, and how they disseminate it.
Organization should proactively fit their data sets to public standards rather than waiting for data scientists ask them to do so. It needs to figure into a broader PR strategy of the modern enterprise or government that seeks to garner a positive reputation.
Datasets may be biased and unreliable. That’s doesn’t mean AI is wrong – it means we all need to work clean up the bias.
Click to read how Huawei’s Model Arts 3.0 can train high-precision models with very little data, lowering the barriers for AI adoption in the enterprise sector.
Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.