Data preparation
Home » AI » Case studies » First scenario: building and AI tool devoted to diagnosing COVID-19 disease » Data preparation


“The data preparation phase covers all activities to construct the final data set or the data that will be fed into the modeling tool(s) from the initial raw data. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools. The five steps in data preparation are the selection of data, the cleansing of data, the construction of data, the integration of data, and the formatting of data.”[1]

This stage includes all activities needed to construct the final dataset that is fed into the model, from initial raw data. It involves the following five tasks, not necessarily performed sequentially:

  1. Select data. Decide on the data to be used for analysis, based on relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types.
  2. Clean data. Raise data quality to a required level, for example by selecting clean subsets of the data, insertion of defaults, and estimation of missing data by modeling.
  3. Construct data. The construction of new data through the production of derived attributes, new records, or transformed values for existing attributes.
  4. Integrate data. Combine data from multiple tables or records to create new records or values.
  5. Format data. Make syntactic modifications to data that might be required by the modeling tool.

Main actions that need to be addressed

Introducing safeguards foreseen in Art. 89 of GDPR

Since you are using data for scientific purposes, you must prepare them according to the safeguards foreseen by the GDPR in its article 89. If the purposes of your research can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, i.e., via pseudonimization, those purposes should be fulfilled in that manner. If this is not possible, you must introduce safeguards ensuring that technical and organizational measures that enable an adequate implementation of the principle of data minimization. Please consider the concrete rules establishedby your national regulation regarding safeguards. Consult with your DPO.

Ensuring accuracy of processing of personal data

According to the GDPR, data must be accurate (see Accuracy” section in “Principles” chapter).

This means that data are correct and up-to-date, but also refers to the accuracy of the analytics performed. The EDPB has highlighted the importance of the accuracy of the profiling or the (not exclusively) automated decision-making process at all stages (from the collection of the data to the application of the profile to the individual).[2]

Controllers are responsible of ensuring accuracy. Therefore, once you have finished with the collection of data, you should implement adequate tools to guarantee the accuracy of the data. This typically involves that you have to make some fundamental decisions on the technical and organizational measures that will render this principle applicable (see Related technical and organizational measures subsection in the Accuracy section in Principles chapter). Since most of data come from patients and most of them are quantitative, you can assume that they are accurate. In any case, accuracy requires an adequate implementation of measures devoted to facilitate the data subjects’ right to rectification (see Right to rectification section in Data subjects rights chapter).

Focusing on profiling issues

In the case of a database that will serve to train or validate an AI tool, there is a particularly relevant obligation to inform the data subjects that their data might cause automated decision-making or profiling on them, unless you can guarantee that the tool will in no way produce these consequences. Even though automatic decision-making can hardly happen in the context of research, developers should keep an open eye on this issue. Profiling, on the other hand, might bring some problems to AI development.

According to Article 22(3), automated decisions that involve special categories of personal data, such as the health data that you are using, are permitted only if the data subject has consented, or if they are conducted on a legal basis. This exception applies not only when the observed data fit into this category, but also if the alignment of different types of personal data can reveal sensitive information about individuals or if inferred data enter into that category.

Some additional actions that might be extremely useful to avoid profiling if it is not needed are:

  • Consider the system requirements necessary to support a meaningful human review from the design phase. Particularly, the interpretability requirements and effective user-interface design to support human reviews and interventions;
  • Design and deliver appropriate training and support for human reviewers; and
  • Give staff the appropriate authority, incentives and support to address or escalate individuals’ concerns and, if necessary, override the AI tool’s decision.

If you proceed with profiling or automated decisions, you must inform the data subjects about your decision and provide all necessary information according to the GDPR and national regulation, if applicable.

Selecting non-biased data

Bias is one of the main issues involved in AI development, an issue that contravenes the fairness principle. Bias might be caused by a lot of different issues. When data is gathered, it may contain socially constructed biases, inaccuracies, errors and mistakes. Sometimes, it might happen that datasets are biased due to malicious actions. Feeding malicious data into an AI tool may change its behavior, particularly with self-learning systems.[3] Therefore, issues related to the composition of the databases used for training raise crucial ethical and legal issues, not only issues of efficiency or of a technical nature.

You need to address these issues prior to training the algorithm. Identifiable and discriminatory bias should be removed in the dataset building phase where possible. In the case of COVID, distinctions could be made between patients depending on their age, genre, or ethnic group, for instance. You must ensure that the algorithm takes this factor into consideration when you select the data. This means that the teams in charge of selecting the data to be integrated in the datasets should be composed of people that ensure the diversity that the AI development is expected to show. Finally, always keep in mind that, if your data are mainly related to a concrete group, for example the Caucasian population more than forty years old, you shall declare that the algorithm has been trained on this basis and, thus, it might not work as well in other population groups.




1Colin Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, p. 16.

2Guidelines on Automated individual decision-making and Profiling for the purposes of Regulation 2016/679 (wp251rev.01). 22/08/2018, p. 13; Ducato, Rossana, Private Ordering of Online Platforms in Smart Urban Mobility The Case of Uber’s Rating System, CRIDES Working Paper Series no. 3/20202 February 2020 Updated on 26 July 2020, p. 20-21, at:

3High-Level Expert Group on AI, Ethics guidelines for trustworthy AI, 2019, p. 17. At: 15 May 2020


Skip to content