Selecting non-biased data
Home » AI » Step by step » Data preparation » Selecting non-biased data

Biases are one of the main issues involved in AI development, an issue that contravenes the fairness principle (see the “Fairness” section in the “Principles” chapter). Biases might be caused by a lot of different issues. When data is gathered, it may contain socially constructed biases, inaccuracies, errors and mistakes. Sometimes, it might happen that datasets are biased due to malicious actions. Feeding malicious data into an AI tool may change its behavior, particularly with self-learning systems.[1] For instance, in the case of chatbot Tay, developed by Microsoft, a huge number of Internet users started posting racists and sexist comments that served to feed the algorithm. As a final result, Tay started sending racist and sexist tweets after just a few hours of operation. On other occasions, the main problem is that the dataset does not represent well the population under consideration and for the intended purpose. Therefore, it contains hidden bias that will be transposed to the trained tool that will reflect such biases, and this might lead to the results of the model being incorrect or discriminatory.[2]

Therefore, issues related to the composition of the databases used for training raise crucial ethical and legal issues, not only issues of efficiency or of a technical nature. Thus, they need to be addressed prior to training the algorithm. The AI models must “be trained using relevant and correct data and it must learn which data to emphasize. The model must not emphasize information relating to racial or ethnic origin, political opinion, religion or belief, trade union membership, genetic status, health status or sexual orientation if this would lead to arbitrary discriminatory processing.”[3] Identifiable and discriminatory bias should be removed in the dataset building phase where possible.

Box 18: Understanding biases: the gorilla case

In 2015, a software engineer, Jacky Alciné denounced the image recognition algorithms used in Google Photos that classified some black people as “gorillas.” Google recognized the issue immediately and promised to fix it.

This glitch was produced by a serious mistake in the training phase. The algorithm was trained to recognize people with a dataset that was primarily composed of photographs of Caucasian people. As a consequence, the algorithm considered that a black person was much more similar with the object “gorilla” that it had been trained to recognize, than with the object “human”. This example shows perfectly well the importance of data selection for training purposes.

Thus, to integrate ethical requirements into this phase, the AI developer should evaluate the ethical consequences of data selection in relation to diversity and make changes, if necessary. Indeed, the controller “should use appropriate mathematical or statistical procedures for the profiling, implement technical and organizational measures appropriate to ensure, in particular, that factors which result in inaccuracies in personal data are corrected and the risk of errors is minimized, secure personal data in a manner that takes account of the potential risks involved for the interests and rights of the data subject and that prevents, inter alia, discriminatory effects on natural persons on the basis of racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or that result in measures having such an effect.”[4]

Controllers should always keep in mind that what makes this issue so specific is that selecting a dataset for training involves making decisions and choices at times in an almost unconscious manner (whereas coding a traditional, deterministic algorithm is always a deliberate operation). Whoever trains an algorithm in some ways builds into it their own way of seeing the world, values or, at the very least, the values which are more or less directly inherent in the data gathered from the past.[5] This means that the teams in charge of selecting the data to be integrated in the datasets should be composed of people that ensure the diversity that the AI development is expected to show. In any case, legal expertise on anti-discrimination regulation might be relevant to this point.


1High-Level Expert Group on AI (2019) Ethics guidelines for trustworthy AI. European Commission, Brussels, p.17. Available at: (accessed 15 May 2020).

2For a definition of direct and indirect discrimination, see, for instance, Article 2 of Council Directive 2000/78/EC of 27 November 2000 establishing a general framework for equal treatment in employment and occupation. See also Article 21 of the Charter of Fundamental Rights of the EU.

3Norwegian Data Protection Authority (2018) Artificial intelligence and privacy. Norwegian Data Protection Authority, Oslo. Available at: (accessed 15 May 2020).

4Recital 71 of the GDPR.

5CNIL (2017) How can humans keep the upper hand? The ethical matters raised by algorithms and artificial intelligence. Commission Nationale de l’Informatique et des Libertés, Paris, p.34. Available at: (accessed 15 May 2020).


Skip to content