Data Understanding
Home » AI » Case studies » First scenario: building and AI tool devoted to diagnosing COVID-19 disease » Data Understanding


“The data understanding phase starts with an initial data collection. The analyst then proceeds to increase familiarity with the data, to identify data quality problems, to discover initial insights into the data, or to detect interesting subsets to form hypotheses about hidden information. The data understanding phase involves four steps, including the collection of initial data, the description of data, the exploration of data, and the verification of data quality”.[1]

At this stage, initial data collection takes palace, and an initial study of the data is performed. It involves four sequential tasks:

  • Collect initial data
  • Describe data
  • Analyze data
  • Verify data quality.

All of these tasks are aimed at identifying the data available. At this stage, you need to be aware of the data you will have to work with and start making decisions on the way in which main principles related to data protection will be implemented.

Main actions that need to be addressed

At this stage, there are a huge number of fundamental issues related to the protection of personal data that need to be addressed. Depending on the decisions made, principles such as data minimisation, privacy by design or by default, lawfulness, fairness and transparency, etc. will be adequately settled.

Type of collected data

According to the GDPR, you “shall implement appropriate technical and organizational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. In particular, such measures shall ensure that by default personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons.”[2] (see “Data Protection by Design and by Default” in “Concepts” chapter) This must be specially kept in mind during this stage, since decisions about the type of data that will be used are often taken at this moment. In general, the simplest way to build your AI in terms of data protection issues would exclusively involve the use of X-Ray images. Nonetheless, it might also be interesting to introduce data related to previous pathologies, age, or gender, for instance. Additionally, one could think about using data such as food habits, zip/postal code, sporting habits, etc. It might happen that adding a lot of new features to the model increases its accuracy in a significant way. However, it could also be possible that this does not happen. You should balance whether the introduction of additional data apart from the radiographic images, for instance, provides diagnosis with a sufficient level of increased accuracy to justify their use. This might be difficult to assess in advance, but at least the training phase should clarify this issue. If the increase of accuracy does not justify a disproportionate use of personal data, it should be avoided.

Thus, make sure that you really need huge amounts of data. Smart data might be much more useful than big data. Of course, using smart, well prepared data might involve a huge effort in terms of unification, homogenization, etc., but it will help to implement the principle of data minimization in a much more efficient way. To this purpose, having an expert able to select relevant features might be extremely important.

Furthermore, you should try to limit the resolution of the data to what is minimally necessary for the purposes pursued by the processing. You should also determine an optimal level of data aggregation before starting the processing (see “Adequate, relevant and limited” part of the “Data minimisation” section in “Principles” chapter).

Data minimization might be complex in the case of deep learning, where discrimination by features might be impossible. There is an efficient way to regulate the amount of data gathered and increase it only if it seems necessary: the learning curve. You should start by gathering and using a restricted amount of training data, and then monitor the model’s accuracy as it is fed with new data.

Checking legitimate dataset usage

Datasets can be obtained in different ways. Firstly, the developer might opt for acceding to a database that has already been built by someone else. If this is the case, you should be particularly careful, since there are a lot of legal issues that relate to the acquisition of access to database (see How to access to a database section in Actions and Tools chapter).[3]

Secondly, the most common alternative to this consists of building a database. Quite obviously, in this case you have to ensure that you comply with all legal requirements imposed by the GDPR to create a database (see Creating a database section in Actions and Tools chapter).

Thirdly, you might choose an alternative path. You can mix different datasets so as to create a huge training dataset and another one for validation purposes. This could bring some issues, such as for example the possibility that the combination of these personal data provides some additional information about the data subjects. For instance, it could allow you to identify data subjects, something that was previously not possible. That could involve deanonymizing anonymized data and creating new personal information that was not contained in the original data set, a circumstance that would bring dramatic ethical and legal issues. For instance “if data subjects gave informed consent for the processing of personal information in the original data sets for particular purposes, they did not necessarily by extension also give permission for the merging of data sets and for data mining that reveals new information. New information produced in this way may also be based on probabilities or conjectures, and therefore be false, or contain biases in the portrayal of persons.”[4] Therefore, you should try to avoid such consequences by ensuring that merging datasets do not work against data subjects rights and interests.

Finally, if you use several datasets that pursue different purposes, you should implement adequate measures to separate the different processing activities. Otherwise you could easily use data collected for on purpose to different activities. This might bring issues related to the purpose limitation principle.

Selecting appropriate legal basis

You should decide the legal basis that you will use for processing before starting it, document their decision privacy notice (along with the purposes) and include the reasons why you have made such choices (see “Accountability” section in “Principles” chapter).

You should select the legal basis that most closely reflects the true nature of your relationship with the individual and the purpose of the processing. This decision is key, since changing the legal basis for processing is not possible if there are not solid reasons that justify it (see Purpose limitation” section in “Principles” chapter).

In the case of an AI tool involving patients’ data, developers usually feel tempted to use consent as the legal grounds for processing (see Consent section in Main Concepts chapter). This could make a sense if you are re-using data that was already gathered for another purpose and consent was the basis that allowed the primary use of the data. Indeed, the GDPR allows the reuse of data for scientific purposes and article 5.1 (b) states that further processing for scientific research purposes shall not be considered to be incompatible with the initial purposes (‘purpose limitation’). Thus, in principle, you could re-use those data on the basis of the original consent. However, you must keep in mind that, according to article 9.4 of the GDPR, “Member States may maintain or introduce further conditions, including limitations, with regard to the processing of genetic data, biometric data or data concerning health.” Thus, it might well happen that your relevant national regulation introduces exceptions or specific conditions to the re-use of personal data. In any case, you should always remember that your information duties remain. You should provide the data subject, prior to any further processing of their data, with information on that other purpose and any further relevant information as referred to in paragraph 2 of article 13 GDPR.

The discussion about the re-use of data

At the present moment, there is a lively discussion about the re-use of data for research purposes. According to article 5.1 (b) of the GDPR, further processing for scientific purposes shall not be considered incompatible with the initial purposes. Thus, unless your national regulation states different, you can re-use the data available for research purposes, since these are compatible with the original purpose they were collected for.However, the EDPS argued that, “in order to ensure respect for the rights of the data subject, the compatibility test under Article 6(4) should still be considered prior to the reuse of data for the purposes of scientific research, particularly where the data was originally collected for very different purposes or outside the area of scientific research. Indeed, according to one analysis from a medical research perspective, applying this test should be straightforward”[5]. According to this interpretation, you should only re-use persona data if the circumstances of article 6.4 apply.

This interpretation somehow contradicts the interpretation of this issue by the EDPB, which stated that Article 5(1)(b) GDPR provides that where data is further processed for scientific purposes, “these shall a priori not be considered as incompatible with the initial purpose, provided that it occurs in accordance with the provisions of Article 89, which foresees specific adequate safeguards and derogations in these cases. Where that is the case, the controller could be able, under certain conditions, to further process the data without the need for a new legal basis. These conditions, due to their horizontal and complex nature, will require specific attention and guidance from the EDPB in the future. For the time being, the presumption of compatibility, subject to the conditions set forth in Article 89, should not be excluded, in all circumstances, for the secondary use of clinical trial data outside the clinical trial protocol for other scientific purposes”[6].

Therefore, the situation remains unclear at this moment, even though we consider that the interpretation by the EDPB makes more sense and will probably prevail in the future.

If you can collect new data for your research, we recommend that you avoid consent as the legal basis, especially if data are collected in a situation where patients are in need of urgent health care, as in the case, for example, that they are suffering symptoms associated with COVID. In the context of clinical trials, the EDPB[7] has stated that “it must be kept in mind that even though conditions for an informed consent under the CTR are gathered, a clear situation of imbalance of powers between the participant and the sponsor/investigator will imply that the consent is not “freely given” in the meaning of the GDPR. As a matter of example, the EDPB considers that this will be the case when a participant is not in good health conditions, when participants belong to an economically or socially disadvantaged group or in any situation of institutional or hierarchical dependency. Therefore, and as explained in the Guidelines on consent of the Working Party 29, consent will not be the appropriate legal basis in most cases, and other legal bases than consent must be relied upon (see below alternative legal bases). Consequently, the EDPB considers that data controllers should conduct a particularly thorough assessment of the circumstances of the clinical trial before relying on individuals’ consent as a legal basis for the processing of personal data for the purposes of the research activities of that trial.”

From our point of view, this opinion might be extended to other scenarios where the power balance is biased. However, it might happen that the corresponding ethics committee does not share our criterion. Please be aware of such circumstances and try to avoid possible inconveniences in advance by consulting the committee and/or your DPO and the supervising authorities if need be.

1Colin Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, p. 15

2Article 24.

3Yeong Zee Kin, Legal Issues in AI Deployment, At: Accessed 15 May 2020

4SHERPA, Guidelines for the Ethical Development of AI and Big Data Systems: An Ethics by Design approach, 2020, p 38. At: Accessed 15 May 2020

5EDPS, A Preliminary Opinion on data protection and scientific research, 6 January 2020, p. 23.

6EDPB, Opinion 3/2019 concerning the Questions and Answers on the interplay between the Clinical Trials Regulation (CTR) and the General Data Protection regulation (GDPR) (art. 70.1.b)) Adopted on 23 January 2019, p. 8.

7Opinion 3/2019 concerning the Questions and Answers on the interplay between the Clinical Trials Regulation (CTR) and the General Data Protection regulation (GDPR), at:


Skip to content