Making a decision about the type of data collected

According to the GDPR, the controller “shall implement appropriate technical and organizational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. In particular, such measures shall ensure that by default personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons”^[1] (see the “Data protection by design and by default” section in the “Concepts” chapter). This must be kept in mind particularly during this stage, since decisions about the type of data that will be used are often taken at this moment.

Controllers must consider that it is always better to avoid using personal data if this is possible. Indeed, according to the data minimization principle, the use of personal data should be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed (see the “Data minimization” section in the “Principles” chapter). Therefore, if the same purpose could be reached without using personal data, processing should be avoided.

In a second level of precaution, if developers need to use personal data, they should try to avoid using special category data. Sometimes this is doable, sometimes it is not. This often depends on the area of application of the model. It is not the same working on a model that will be used for the analysis of the influence of epigenetics in human health, a model used to monitor an epidemic outbreak or a model that will serve to target advertisements accurately. If such special category data are finally used, controllers must consider the regulations applying to their processing and the necessary application of appropriate safeguards, able to protect the data subjects’ rights, interests and freedoms. Proportionality between the aim of research and the use of special categories of data must be guaranteed. Furthermore, controllers must ensure that their Member States regulation do not protect genetic, biometric and health data by introducing further conditions or limitations, since they are empowered to do so by the GDPR.

If personal data are necessary, the AI developer should, at least, try to reduce the amount of data considered as much as possible (see the “Data minimization” section in the “Principles” chapter). They should always remember that they can only process data if the processing is adequate and relevant. Therefore, they should avoid using excessive amount of personal data. Too often, this is easier to do that it seems. As the Norwegian Data Protection Agency states, “[i]t is worth noting that the quality of the training data, as well as the features used, can in many instances be substantially more important than the quantity. When training a model, it is important that the selection of training data is representative of the task to be solved later. Huge volumes of data are of little help if they only cover a fraction of what the model will subsequently be working on.”^[2] Therefore, it is particularly important not to collect unnecessary data. Correct labelling could be a nice antidote against unnecessary collection. Note that if data is already stored, selection involves deleting unnecessary data elements.

The developer should always try to avoid the ‘Curse of Dimensionality’; that is, “a poor performance of algorithms and their high complexity associated with data frame having a big number of dimensions/features, which frequently make the target function quite complex and may lead to model overfitting as long as often the dataset lies on the lower dimensionality manifold.”^[3] To this purpose, having an expert able to select relevant features might be extremely important. This would contribute to significantly reduce the amount of personal data used without losing quality. This should not be difficult if the data scientist is well acquainted with the dataset and the meanings of its numerical features. Under such conditions, it would be easy to determine if some of the variables are needed or not. However, it is possible to perform such an approach only in the case where the dataset is easily interpreted and the dependencies between the variables are well known. Therefore, the developer will need a smaller amount of data if they have been adequately classified. Smart data might be much more useful than big data. Of course, this might involve a huge effort in terms of unification, homogenization, etc., but it will help to implement the principle of data minimization minimization (see “Data minimization principle” within Part II section “Principles” of these Guidelines) in a much more efficient way.

Furthermore, the controllers should try to limit the resolution of the data to what is minimally necessary for the purposes pursued by the processing. They should also determine an optimal level of data aggregation before starting the processing (see the “Adequate, relevant and limited” section of the “Data minimization” section of the “Principles” chapter).

Data minimization might be complex in the case of deep learning, where discrimination by features might be impossible. There is an efficient way to regulate the amount of data gathered and increase it only if it seems necessary: the learning curve^[4]. The developer should start by gathering and using a restricted amount of training data, and then monitor the model’s accuracy as it is fed with new data.

Box 16: A data minimization practice that was not adequately implemented

A tool developed by the Norwegian Tax Administration to filter tax returns for errors tested 500 variables in the training phase. However, only 30 were included in the final AI model, as they proved most relevant to the task at hand. This means that they could have probably avoided collecting so much personal data if they had performed a better selection of the variables that were relevant from the very beginning.^[5]

References

¹Article 24 of the GDPR. ↑

²Norwegian Data Protection Authority (2018) Artificial intelligence and privacy. Norwegian Data Protection Authority, Oslo. Available at: https://iapp.org/media/pdf/resource_center/ai-and-privacy.pdf (accessed 15 May 2020). ↑

³Oliinyk, H. (2018) Why and how to get rid of the curse of dimensionality right (with breast cancer dataset visualization). Towards Data Science, 20 March. Available at: https://towardsdatascience.com/why-and-how-to-get-rid-of-the-curse-of-dimensionality-right-with-breast-cancer-dataset-7d528fb5f6c0 (accessed 15 May 2020). ↑

⁴Ng, R. (no date) Learning curve. Available at: www.ritchieng.com/machinelearning-learning-curve/ (accessed 15 May 2020). ↑

⁵Norwegian Data Protection Authority (2018) Artificial intelligence and privacy. Norwegian Data Protection Authority, Oslo. Available at: https://iapp.org/media/pdf/resource_center/ai-and-privacy.pdf (accessed 15 May 2020). ↑<