Data preparation
Home » AI » Case studies » Second scenario: AI for Crime Prediction and Prevention » Data preparation


“The data preparation phase covers all activities to construct the final data set or the data that will be fed into the modeling tool(s) from the initial raw data. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools. The five steps in data preparation are the selection of data, the cleansing of data, the construction of data, the integration of data, and the formatting of data.”[1]

This stage includes all activities needed to construct the final dataset that is fed into the model, from initial raw data. It involves the following five tasks, not necessarily performed sequentially:

  1. Select data: Decide on the data to be used for analysis, based on relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types.
  2. Clean data: Raise data quality to a required level, for example by selecting clean subsets of the data, insertion of defaults, and estimation of missing data by modeling.
  3. Construct data: The construction of new data through the production of derived attributes, new records, or transformed values for existing attributes.
  4. Integrate data: Combine data from multiple tables or records to create new records or values.
  5. Format data: Make syntactic modifications to data that might be required by the modeling tool.

Main actions that need to be addressed

Introducing the safeguards foreseen in Article 89 GDPR

Since you are using data for scientific purposes, you must prepare them according to the safeguards foreseen by the GDPR in Article 89. If the purposes of your research can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, i.e., via pseudonymization, those purposes should be fulfilled in that manner. If this is not possible, you must introduce safeguards ensuring that technical and organizational measures enable an adequate implementation of the principle of data minimization. Please consider the concrete rules establishedby your national regulation regarding safeguards. Consult with your DPO.

Ensuring accuracy of processing of personal data

According to the GDPR, data must be accurate (see Accuracy” section in “Principles” chapter). This means that process data are correct and up to date. Controllers are responsible to ensure accuracy. Therefore, once you have finished with the collection of data, you should implement adequate tools to guarantee the accuracy of the data. This typically involves that you have to make some fundamental decisions on the technical and organizational measures that will render this principle applicable (see “Related technical and organizational measures” subsection in the “Accuracy” section in “Principles” chapter). Since most of the data come from probably quite different sources with no standardised quality requirements and most of them will probably be qualitative in the case of crime prediction, you cannot assume that they are accurate per se. Primarily because these data might be based on individual ratings of different people, while the data subjects might not even know about the fact that this kind of data is stored about them.

In any case, accuracy requires an adequate implementation of measures devoted to facilitate the data subjects’ right to rectification (see Right to rectification” section in “Data subjects’ rights” chapter).

Ensure also that they produce results that are as accurate as possible. The types of false positives and false negatives should be defined in advance during the data preparation phase. False results are one of the essential issues having an impact on individuals’ fundamental rights.

Focusing on profiling issues

In general, in the case of a database that will serve to train or validate an AI tool, there is a particularly relevant obligation to inform the data subjects that their data might cause automated decision-making or profiling on them. Profiling is particularly problematic in AI development, this also holds for AI tools developed for LEAs purposes.

According to Article 22(2)(c), automated decisions that involve special categories of personal data, such as data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation (Article 9(1)) are permitted only if the data subject has consented, or if they are conducted on a legal basis. This exception applies not only when the observed data fit into this category, but also if the alignment of different types of personal data can reveal sensitive information about individuals or if inferred data enter into that category.In the case of crime prediction and prevention explicit consent from the data subjects will normally only be applicable for voluntarily human participants during the R&D phase. The processing of special categories of personal data, for instance of political opinions or religious beliefs, may belong to the core of data of AI tools applied in the field of terrorism prevention.

Some additional actions that might be extremely useful to avoid automated decision-making if it is not needed are:

  • Consider the system requirements necessary to support a meaningful human review from the design phase. Particularly, the interpretability requirements and effective user-interface design to support human reviews and interventions;
  • Design and deliver appropriate training and support for human reviewers; and
  • Give staff the appropriate authority, incentives and support to address or escalate individuals’ concerns and, if necessary, override the AI tool’s decision.[2]

If you proceed with profiling or automated decisions, you must inform the data subjects about your decision and provide all necessary information according to the GDPR and national regulation, if applicable.

Selecting non-biased data

Bias is one of the main issues involved in AI development, an issue that contravenes the fairness principle. Bias might be caused by a lot of different issues. When data is gathered, it may contain socially constructed biases, inaccuracies, errors and mistakes. Sometimes, it might happen that datasets are biased due to malicious actions. Feeding malicious data into an AI tool may change its behavior, particularly with self-learning systems.[3] Therefore, issues related to the composition of the databases used for training raise crucial ethical and legal issues, not only issues of efficiency or of a technical nature.

You need to address these issues prior to training the algorithm. Identifiable and discriminatory bias should be removed in the dataset building phase where possible. As we have seen in the past, the idea that certain groups of people (Black, Arabs or aliens in general, Muslims…) are convicted more often because they break the law more frequently in most cases is not valid. They are searched more often, discriminated more often by the police, encounter more often excessive violence, arbitrariness or hostility by the police and therefore more often come into problematic situations. This observation would most probably hold for any other subset of the population if treated the same way. Therefore, deducting a higher crime rate in areas where many foreigners live might become a self-fulfilling prophecy.

Another example might be the assumption that an AI tool produces the right results as soon as they match with the results by humans. Often decisions by humans are biased as well, and the AI tool would most probably perpetuate such discriminatory practices instead of producing more objective results.

If the algorithm is biased, it may also increase the number of falsepositives or falsenegatives. False positives may have serious adverse effects on concerned individuals, false negatives on society and of course, also on victims of criminal or terroristic activities which could potentially have been avoided.

You must ensure that the algorithm assesses these factorsaccordingly when you select the data. This means that the teams in charge of selecting the data to be integrated in the datasets should be composed of people that ensure the diversity that the AI tool is expected to show. Finally, always keep in mind that, if your data are mainly related to a concrete group, you shall declare that the algorithm has been trained on this basis and, thus, it might not work as well in other population groups.




1Colin Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, p. 16.


3High-Level Expert Group on AI, Ethics guidelines for trustworthy AI, 2019, p. 17. At: 15 May 2020


Skip to content