Implementing data minimization principle

According to the Principle of purpose limitation (see the “Purpose limitation” section in the “Principles” chapter), controllers using AI tools determine the purpose of the AI tool’s use at the outset of its training or deployment, and perform a re-assessment of this determination should the system’s processing throw up unexpected results, since it requires that personal data only be collected for “specified, explicit and legitimate purposes” and not used in a way that is incompatible with the original purpose.

According to the data minimization principle, controllers must proceed to reduce the amount of data and/or the range of information about the data subject they provide as soon as possible. Consequently, personal data used during the training phase have to be purged of all information not strictly necessary for training of the model (see the ‘Temporal aspect” subsection in the “Data minimization” section of the “Principles” chapter). There are multiple strategies to ensure data minimization at the training stage. Techniques are continuously evolving. However, some of the most common are given below;^[1] see also the “Integrity and confidentiality” section in the “Principles” chapter):

Analysis of the conditions that the data must fulfil in order to be considered of high quality and with a great predictive capacity for the specific application.
Critical analysis of the extent of the data typology used in each stage of the AI tool.
Deletion of unstructured data and unnecessary information collected during the pre-processing of the information.
Identification and suppression of those categories of data that do not have a significant influence on learning or on the outcome of the inference.
Suppression of irrelevant conclusions associated with personal information during the training process, for example, in the case of unsupervised training.
Use of verification techniques that require less data, such as cross-validation.
Analysis and configuration of algorithmic hyperparameters that could influence the amount or extent of data processed in order to minimize them.
Use of federated rather than centralized learning models.
Application of differential privacy strategies.
Training with encrypted data using homomorphic techniques.
Data aggregation.
Anonymization and pseudonymization, not only in data communication, but also in the training data, possible personal data contained in the model and in the processing of inference (See the section “Pseudonymization in the Concepts part of these Guidelines).

References

¹AEPD (2020) Adecuación al RGPD de tratamientos que incorporan Inteligencia Artificial. Una introducción. Agencia Espanola Proteccion Datos, Madrid, p.40. Available at: www.aepd.es/sites/default/files/2020-02/adecuacion-rgpd-ia.pdf (accessed 15 May 2020). ↑