Data minimization - Guidelines Panelfit

The data minimization principle stipulates that personal data should be “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed”.^[1] In the AI context, this means, in the first instance, that controllers should avoid using personal data if it is not necessary; that is, if the objective that the controller is aimed at can be obtained without processing personal data (see the “Lawfulness,fairness and transparency” within “Principles” in Part II of these Guidelines). Indeed, sometimes personal data can be substituted with non-personal data without affecting the research purposes. In such circumstances, the use of anonymized data is compulsory, according to Article 89.1 of the GDPR.

If anonymization is not possible, controllers should at least try to work with pseudonymized data. Ultimately, each controller needs to define which personal data are actually needed (and which are not) for the purpose of the processing, including the relevant data retention periods. Indeed, controllers must keep in mind that the necessity of processing must be proven in the case of most legal bases – including all those bases stated in Article 6 of the GDPR except consent, and most of the bases included in Article 9(2) regarding special categories of data. In other words, for the majority of legal bases for processing personal data, both data minimization and lawfulness principles require controllers to ensure that AI development cannot be done without using personal data.

The concept of necessity is, however, complex, and has an independent meaning in European Union law.^[2] In general, it requires that processing is a targeted and proportionate way of achieving a specific purpose. It is not enough to argue that processing is necessary because controllers have chosen to operate their business in a particular way. They must be able to demonstrate that the processing is necessary for the objective being pursued and is less intrusive than other options for achieving the same goal; not that it is a necessary part of their chosen methods.^[3]If there are realistic, less intrusive alternatives, the processing of personal data is not deemed necessary.^[4]

Therefore, the data minimization principle requires AI developers to opt for those tools whose development involves minimal use of personal data compared to the available alternatives. Once this has been reached, specific processes should be in place to exclude unnecessary personal data being collected and/or transferred, reduce data fields and provide for automated deletion mechanisms.^[5] Data minimization may be especially complex in the case of deep learning, where discrimination by features might be impossible. Therefore, if alternative solutions might bring the same outcomes, deep learning should better be avoided.

Further, the CIPL notes that “what personal data is considered ‘necessary’ varies depending on the AI system and the objective for which it is used, but the governance of the GDPR in this area should prevent the perfect from being the enemy of the good for AI designers – the fact that the personal data must be limited does not mean that the AI system itself becomes useless, especially since not all AI systems need to provide a precise output.”^[6] In order to determine precisely the range and amount of personal data needed, having an expert able toselect relevant features becomes extremely important. This should significantly reduce the risk to data subjects’ privacy – without losing quality.

There is an efficient tool to regulate the amount of data gathered and increase it only if it seems necessary: the learning curve.^[7] The controller should start by gathering and using a restricted amount of training data, and then monitor the model’s accuracy as it is fed with new data. This will also help a controller to avoid the ‘curse of dimensionality’; that is, “a poor performance of algorithms and their high complexity associated with data frame having a big number of dimensions/features, which frequently make the target function quite complex and may lead to model overfitting as long as often the dataset lies on the lower dimensionality manifold.”^[8]

Some additional measures related to the minimization principle include:

limit the extension of the data categories (e.g. names, physical and addresses, fields about their health, work situation, beliefs, ideology, etc.)
limit the degree of detail or precision of the information, the granularity of the collection in time and frequency, and the age of the information used
limit the extension in the number of interested parties of those who treat the data
limit the accessibility of the different categories of data to the staff of the controller/manager or even the end-user (if there are data from third parties in the AI models) at all stages of the processing.^[9]

Of course, adopting these measures might require a huge effort in terms of data unification, homogenization, etc., but it will contribute towards implementing the principle of data minimization in a much more efficient way.^[10]

Finally, it is useful to remember that controllers should avoid keeping long databases of historical data, for example beyond the period required for normal business purposes, or to fulfil legal obligations, or simply because their analytic tool is able to produce a large number of data and its storage capacity makes this possible. Instead, companies using big data must enforce appropriate retention schedules (see the “Storage limitation” section in the “Principles”, Part II of these Guidelines).

Box 6. An example of the benefits of data minimization in AI

An AI tool developed by the Norwegian tax administration to filter tax returns for errors tested five hundred variables in the training phase. However, only thirty were included in the final AI model, as they proved the most relevant to the task at hand. It is likely that the tool developers could have avoided collecting so many personal data, if they had performed a better selection of the relevant variables at the beginning of the development process.

Source: Norwegian Data Protection Authority (2018) Artificial intelligence and privacy. Norwegian Data Protection Authority, Oslo. Available at: https://iapp.org/media/pdf/resource_center/ai-and-privacy.pdf

Checklist: data minimization

☐ The controllers have ensured that they only use personal data if needed.

☐ The controllers have considered the proportionality between the amount of data and the accuracy of the AI tool.

☐ The controllers periodically review the data they hold, and delete anything they do not need.

☐ The controllers at the training stage of the AI system debug all information not strictly necessary for such training.

☐ The controllers check if personal data are processed at the distribution stage of the AI system and delete them unless there is a justified need and legitimacy to keep them for other compatible purposes.

Additional information

ENISA (2015) Privacy by design in big data. European Union Agency for Cybersecurity, Athens / Heraklion, p.23. Available at: www.enisa.europa.eu/publications/big-data-protection

ICO (no date) Principle (c): data minimization. Information Commissioner’s Office, Wilmslow.Available at: https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/principles/data-minimisation/

Norwegian Data Protection Authority (2018)Artificial intelligence and privacy. Norwegian Data Protection Authority, Oslo. Available at: https://iapp.org/media/pdf/resource_center/ai-and-privacy.pdf

Pure Storage (2015) Big data’s big failure: the struggles businesses face in accessing the information they need. Pure Storage, Mountain View, CA. Available at: http://info.purestorage.com/rs/225-USM-292/images/Big%20Data%27s%20Big%20Failure_UK%281%29.pdf

References

¹Article 5(1)(c) of the GDPR. ↑

²See CJEU, Case C‑524/06, Heinz Huber v Bundesrepublik Deutschland, 18 December 2008, para. 52. ↑

³EDPS (2017) Necessity toolkit: assessing the necessity of measures that limit the fundamental right to the protection of personal data, p.5. European Data Protection Supervisor, Brussels. Available at: https://edps.europa.eu/data-protection/our-work/publications/papers/necessity-toolkit_en (accessed 15 May 2020); ICO (no date) Lawful basis for processing. Information Commissioner’s Office, Wilmslow. Available at: https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/lawful-basis-for-processing/(accessed 15 May 2020). ↑

⁴See CJEU, Joined Cases C‑92/09 and C‑93/09, Volker und Markus Schecke GbR and Hartmut Eifert v Land Hessen, 9. November 2010. ↑

⁵ENISA (2015) Privacy by design in big data. European Union Agency for Cybersecurity, Athens / Heraklion, p.23. Available at: www.enisa.europa.eu/publications/big-data-protection(accessed 28 May 2020). ↑

⁶CIPL (2020) Artificial intelligence and data protection: how the GDPR regulates AI. Centre for Information Policy Leadership, Washington DC / Brussels / London, p.13. Available at: www.informationpolicycentre.com/uploads/5/7/1/0/57104281/cipl-hunton_andrews_kurth_legal_note_-_how_gdpr_regulates_ai__12_march_2020_.pdf(accessed 15 May 2020). ↑

⁷See : www.ritchieng.com/machinelearning-learning-curve/(accessed 28 May 2020). ↑

⁸Oliinyk, H. (2018) Why and how to get rid of the curse of dimensionality right (with breast cancer dataset visualization). Towards Data Science, 20 March. Available at: https://towardsdatascience.com/why-and-how-to-get-rid-of-the-curse-of-dimensionality-right-with-breast-cancer-dataset-7d528fb5f6c0 (accessed 15 May 2020). ↑

⁹AEPD (2020) Adecuación al RGPD de tratamientos que incorporan Inteligencia Artificial. Una introducción. Agencia Espanola Proteccion Datos, Madrid, p.39-40. Available at: www.aepd.es/sites/default/files/2020-02/adecuacion-rgpd-ia.pdf (accessed 15 May 2020). ↑

¹⁰Norwegian Data Protection Authority (2018) Artificial intelligence and privacy. Norwegian Data Protection Authority, Oslo. Available at: https://iapp.org/media/pdf/resource_center/ai-and-privacy.pdf(accessed 15 May 2020). ↑