Data Understanding
Home » AI » Case studies » Second scenario: AI for Crime Prediction and Prevention » Data Understanding


“The data understanding phase starts with an initial data collection. The analyst then proceeds to increase familiarity with the data, to identify data quality problems, to discover initial insights into the data, or to detect interesting subsets to form hypotheses about hidden information. The data understanding phase involves four steps, including the collection of initial data, the description of data, the exploration of data, and the verification of data quality”.[1]

All of these steps are aimed at identifying the data available. At this stage, you need to be aware of the data you will have to work with and start making decisions on how main principles related to data protection will be implemented. You should consult the Ethics and data protection document from 14 November 2018[2] to comply with legal and ethics requirements. In the case of using data from social networks, the information provided in Box 4 Using ‘open source’ data on page 13 is particularly relevant.

You should also be aware that databases that contains personal data about prosecutions related to criminal convictions and offenses are sensitive, and that you as developer will normally not be able to access them.

Main actions that need to be addressed

At this stage alarge number of fundamental issues related to the protection of personal data needs to be addressed. Depending on the decisions made, principles such as data minimization, privacy by design or by default, lawfulness, fairness and transparency, etc. will be adequately settled. A communication between ethics and legal experts, on the one hand, and project developers, on the other hand, has to be established to be able to realise the principles of “privacy by design” or “by default”.

Making decision on types of data to be processed

According to the GDPR, the “controller shall implement appropriate technical and organizational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. In particular, such measures shall ensure that by default personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons.”[3] (see “Data Protection by Design and by Default” in “Concepts” chapter) This demand must be specially kept in mind during this stage, since decisions about the type of data that will be used are often taken at this moment.

Thus, make sure whether you really need vast amounts of data. Focused “smart data” might be much more useful than big data. Of course, using smart, well prepared data might involve a huge effort in terms of unification, homogenization, etc., but it will help to implement the principle of data minimization in a much more efficient way. To this purpose, having expertiseavailable to select relevant features is essential. This step also involves checking the necessity of processing for each category of data; this implies to prove that no, from a data protection and human rights perspective less infringing, alternative measures or methods could be applied to achieve the same result.

Furthermore, you should try to limit the resolution of the data to what is minimally necessary for the purposes pursued by the processing. You should also determine an optimal level of data aggregation before starting the processing (see “Adequate, relevant and limited” part of the “Data minimization” section in “Principles” chapter). In the case of AI applied to crime prediction, prevention or investigation, the possible level of data aggregation, i.e. anonymization of data, is undoubtedly limited, at least for later implementations and uses of the developed systems. As a primary objective is to identify (potential) perpetrators, it must at least be possible to (re-)-personalize data on potential threats.

Data minimization might be complicated in the case of deep learning, where differentiation by features might be impossible.There is an efficient way to regulate the amount of data gathered and increase it only if it seems necessary: the learning curve. You should start by collecting and using a limited amount of training data, and then monitor the model’s accuracy as it is fed with new data.

Checking legitimate dataset usage

Datasets can be obtained in different ways. Firstly, the developer might opt for acquiring or gaining access to a database that has already been built by someone else. If this is the case, you should be particularly careful since there are a lot of legal issues that relate to the acquisition of access to a database (see Purchasing access to a database section in Actions and Tools chapter).[4]

Secondly, the most common alternative to this consists of building a database. Quite obviously, in this case you have to ensure that you comply with all legal requirements imposed by the GDPR to create a database (see “Creating a database” section in “Main tools and actions” chapter).

Thirdly, you might choose an alternative path. You can mix licensed data from third parties with your own dataset so as to create a huge training dataset and another one for validation purposes. This could bring some issues, such as the possibility that the combination of different data sets provides some additional information about the data subjects. For instance, it could allow you to identify data subjects, something that was previously not possible, using only one of the datasets. That could involve de-anonymizing anonymized data and creating new personal information that was not contained in the original data set. This situation would entailsignificant ethical and legal issues. For instance,“if data subjects gave informed consent for the processing of personal information in the original data sets for particular purposes, they did not necessarily by extension also give permission for the merging of data sets and for data mining that reveals new information. New information produced in this way may also be based on probabilities or conjectures, and therefore be false, or contain biases in the portrayal of persons.”[5] Therefore, you should try to avoid such consequences by ensuring that merging datasets do not work against data subjects’ rights and interests.

Finally, if you use several datasets that pursue different purposes, you should implement adequate measures to separate the different processing activities. Otherwise, you could easily use data for a purpose for which it has not been collected. This might bring issues related to the purpose limitation principle.

Be aware that the above-mentioned measures are only sufficient for the research project execution phase. Informed consent will generally be of very limited use in the context of law enforcement activity. The same holds for the creation and use of dummy or synthetic data. The use of synthetic data still may involve issues of potential re-identification as well as the question of whether one can trust such data when training AI algorithms. All these measures may effectively help to mitigate or eliminate ethics or legal issues for the research phase. It is essential to ensure that the datasets needed for real-world implementations also comply with the ethical and legal requirements imposed by EU and national member state regulations; this also holds for the use of police- or government-owned datasets. Be also aware that it might be difficult or even impossible to get access to sufficient large real datasets required for practical training of the AI tool.

Selecting appropriate legal basis for processing

You must decide the legal basis that you will use for processing before starting it, document your decision (along with the purposes) and include the reasons why you have made your choice (see “Accountability” section in “Principles” chapter).

You should select the legal basis that most closely reflects the true nature of your processing of personal data. In case human participants are involved, also the relationship with the participants and the purpose of the processing must be considered. This decision is key, since changing the legal basis for processing is not possible if there are not solid reasons that justify it (see Purpose limitation” section in “Principles” chapter).

In the case of AI tools developed for the purpose of crime prediction, prevention, et cetera, you must again distinguish between the research phase and later implementations. For the research phase you may be able to use consent as the legal ground for processing (see Consent section in Main “Concepts chapter), depending on the concrete involvement of human participants. Examples could be AI tools using biometric identification or the interpretation of video data, requiring the involvement of human participants for testing. Consent also could form a valid legal ground if you are reusing data that was already gathered for another purpose and consent was the basis that allowed the primary use of the data. The GDPR allows the reuse of data for scientific purposes and article 5.1 (b) states that further processing for scientific research purposes shall not be considered to be incompatible with the initial purposes (‘purpose limitation’). Thus, in principle, you could reuse those data on the basis of the original consent. However, you must keep in mind that, according to article 9.4 of the GDPR, “Member States may maintain or introduce further conditions, including limitations, with regard to the processing of genetic data, biometric data or data concerning health.” Thus, it might well happen that your relevant national regulation introduces exceptions or specific conditions to the reuse of personal data. In any case, you should always remember that your information duties remain. You should provide the data subject, prior to any further processing of their data, with information on that other purpose and any further relevant information as referred to in paragraph 2 of Article 13 GDPR.

Please, keep in mind that the above provisions only hold for conducting the research as such. Future uses of the developed systems need to conform to valid legislation of the EU and of member states concerning law enforcement activities. Also, be aware that developing technologies which are not compliant with applicable regulations or with ethics principles or European values would imply a waste of effort and resources.

Reusing of data

At present, there is a lively discussion about the reuse of data for research purposes. According to article 5.1 (b) of the GDPR, further processing for scientific purposes shall not be considered incompatible with the initial purposes. Thus, unless your national regulation states different, you can reuse the data available for research purposes, since these are compatible with the original purpose they were collected for.

However, the EDPS argues that, “in order to ensure respect for the rights of the data subject, the compatibility test under Article 6(4) should still be considered prior to the reuse of data for the purposes of scientific research, particularly where the data was originally collected for very different purposes or outside the area of scientific research. Indeed, according to one analysis from a medical research perspective, applying this test should be straightforward”.[6] According to this interpretation, you should only reuse persona data if the circumstances of article 6.4 apply. Please check in this context also the applicability of article 10 “Processing of personal data relating to criminal convictions and offences or related security measures based on Article 6(1) shall be carried out only under the control of official authority or when the processing is authorized by Union or Member State law providing for appropriate safeguards for the rights and freedoms of data subjects.”




1Colin Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, p. 15


3Article 25(2).

4Yeong Zee Kin, Legal Issues in AI Deployment, At: Accessed 15 May 2020

5SHERPA, Guidelines for the Ethical Development of AI and Big Data Systems: An Ethics by Design approach, 2020, p 38. At: Accessed 15 May 2020

6EDPS, A Preliminary Opinion on data protection and scientific research, 6 January 2020, p. 23.


Skip to content