Modeling (training) - Guidelines Panelfit

Description

“In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, several techniques exist for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase may be necessary. Modeling steps include the selection of the modeling technique, the generation of test design, the creation of models, and the assessment of models.”^[1]

This phase involves several key tasks. Overall, you must:

Select the modeling technique that will be used. Depending on the type of technique, consequences such as data inference, obscurity or biases are more or less likely to happen.
Make a decision on the training tool to be used. This enables the developer to measure how well the model can predict history before using it to predict the future. Training always involves running empirical testing with data. Sometimes, developers test the model with data that are different from those used to generate it. Therefore, at this stage one might talk about different types of datasets. Sometimes identifying the individuals that the training data relates to might be difficult. This creates issues for fulfilling individuals’ rights that should be addressed appropriately.

Main actions that need to be addressed

Implementing data minimization principle

According to the data minimization principle, you must proceed to reduce the amount of data and/or the range of information about the data subject they provide as soon as possible. Consequently, you have to purge the data used during the training phase of all information not strictly necessary for training of the model (see “Temporal aspect” subsection in “Data minimization” section in “Principles” chapter). There are multiple strategies to ensure data minimization at the training stage. Of course, you should start by erasing all personal data related to the X-ray that you use, but this would only be a first step to follow the minimization principle. Stronger measures should be carefully implemented for this purpose. Techniques are continuously evolving. However, some of the most common are^[2] (see also “Integrity and confidentiality” section in “Principles” chapter):

Analysis of the conditions that the data must fulfil in order to be considered of high quality and with a great predictive capacity for the specific application.
Critical analysis of the extent of the data typology used in each stage of the AI tool.
Deletion of unstructured data and unnecessary information collected during the pre-processing of the information.
Identification and suppression of those categories of data that do not have a significant influence on learning or on the outcome of the inference.
Suppression of irrelevant conclusions associated with personal information during the training process, for example, in the case of unsupervised training.
Use of verification techniques that require less data, such as cross-validation.
Analysis and configuration of algorithmic hyperparameters that could influence the amount or extent of data processed in order to minimize them.
Use of federated rather than centralized learning models
Application of differential privacy strategies.
Training with encrypted data using homomorphic techniques.
Data aggregation.
Anonymization and pseudonymization, not only in data communication, but also in the training data, possible personal data contained in the model and in the processing of inference.

Detecting and erasing biases

Even though the mechanisms against biases are conveniently adopted in previous stages (see previous section about training), it is still necessary to ensure that the results of the training phase minimize biases. This can be difficult, since some types of bias and discrimination are often particularly hard to detect. The team members who are curating the input data are sometimes unaware of them, and the users who are their subjects are not necessarily cognizant of them either. Thus, the monitoring systems implemented by the AI developer in the validation stage are extremely important factors to avoid biases.

There are a lot of technical tools that might serve well to detect biases, such as the Algorithmic Impact Assessment.^[3] You must consider their effective implementation.^[4] However, as the literature shows^[5], it might happen that an algorithm cannot be totally purged of all different types of biases. You should however try to at least be aware of their existence and the implications that this might bring (see “Lawfulness, fairness and transparency” and “Accuracy” sections in “Principles” chapter).

Exercising data subjects’ rights

Sometimes, developers complete the available data through inference. For instance, if you do not have the concrete data corresponding to the arterial pressure of a patient, you might use another algorithm to infer it from the rest of the data. However, this does not mean that these data can be considered as entirely pseudonymized or anonymized. This is particularly true in the case of genomic data, since their anonymization is almost impossible. Thus, they continue to be personal data. Furthermore, inferred data must also be considered personal data. Therefore, data subjects have some fundamental rights on these data that you must respect.

Indeed, you must facilitate all data subjects’ right during the whole life cycle. In this specific stage, right to access, rectification and erasure are particularly sensitive and include certain characteristics that controllers need to be aware of. However, in the case of research for scientific purposes such as the one you are developing, the GDPR includes some safeguards and derogations relating to processing (art. 89). You must be aware of the concrete regulation in your Member state. According to the GDPR, Union or Member State law may provide for derogations from the main rights included in articles 15 and ff. in so far as such rights are likely to render impossible or seriously impair the achievement of the specific purposes, and such derogations are necessary for the fulfilment of those purposes.

-Right of access (see Right to access section in Data subject rights chapter)

In principle, you shall respond to data subjects’ requests to gain access to their personal data, assuming they have taken reasonable measures to verify the identity of the data subject, and no other exceptions apply. However, you do not have to collect or maintain additional personal data to enable identification of data subjects in training data for the sole purposes of complying with the regulation. If you cannot identify a data subject in the training data and the data subject cannot provide additional information that would enable their identification, they are not obliged to fulfil a request that is not possible to satisfy.

–Right to rectification(see Right to rectification section in Data subject rights chapter)

In the case of the right to rectification, you must guarantee the right of rectification of the data, especially those generated by the inferences and profiles drawn up by an AI tool. Even though the purpose of training data is to train models based on general patterns in large datasets and thus individual inaccuracies are less likely to have any direct effect on a data subject, the right to rectification cannot be limited. As a maximum, you could ask for a longer period (two extra months) to proceed with the rectification if the technical procedure is particularly complex (art. 11(3)).

-Right to erasure (see Right to erasure section in Data subject rights chapter)

Data subjects hold a right to request the deletion of their personal data. However, this right might be limited if some concrete circumstances apply. According to the ICO, “organizations may also receive requests for erasure of training data. Organizations must respond to requests for erasure, unless a relevant exemption applies and provided the data subject has appropriate grounds. For example, if the training data is no longer needed because the ML model has already been trained, the organization must fulfil the request. However, in some cases, where the development of the system is ongoing, it may still be necessary to retain training data for the purposes of re-training, refining and evaluating an AI tool. In this case, the organization should take a case-by-case approach to determining whether it can fulfil requests. Complying with a request to delete training data would not entail erasing any ML models based on such data, unless the models themselves contain that data or can be used to infer it.”^[6]

References

¹Colin Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, p. 17. ↑

²AEPD, Adecuación al RGPD de tratamientos que incorporan Inteligencia Artificial. Una introducción, 2020, p.40. At: https://www.aepd.es/sites/default/files/2020-02/adecuacion-rgpd-ia.pdf Accessed 15 May 2020. ↑

³Reisman, D., Crawford, K., Whittaker, M., Algorithmic impact assessments: A practical framework for public agency accountability, 2018, at: https://ainowinstitute.org/aiareport2018.pdf Accessed 15 May 2020 ↑

⁴https://ico.org.uk/media/about-the-ico/consultations/2617219/guidance-on-the-ai-auditing-framework-draft-for-consultation.pdf Accessed 15 May 2020 ↑

⁵Chouldechova. Alexandra,Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments, Big Data. Volume: 5 Issue 2: June 1, 2017. 153-163.http://doi.org/10.1089/big.2016.0047 ↑

⁶ICO, Enabling access, erasure, and rectification rights in AI tools, At: https://ico.org.uk/about-the-ico/news-and-events/ai-blog-enabling-access-erasure-and-rectification-rights-in-ai-systems/ Accessed 15 May 2020 ↑