“In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, several techniques exist for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase may be necessary. Modeling steps include the selection of the modeling technique, the generation of test design, the creation of models, and the assessment of models.”
This phase involves several key tasks. Overall, you must
- Select the modeling technique that will be used. Depending on the type of technique, consequences such as data inference, obscurity or biases are more or less likely to happen.
- Make a decision on the training tool to be used. This enables the developer to measure how well the model can predict history before using it to predict the future. In the case of crime prediction, this could be a problem itself. It’s not like predicting that someone with love for yoghurt will buy it again. We are talking about human beings and their chances in life. Assuming one would re-offend because they did something illegal in the past almost neglects the fact that we think of citizens as humans with free will and the chance to make a better decision the next time. It is inherently problematic to assume the future will be an extrapolation of the past. Depending on the individual and societal consequences, it might be less of a problem in some cases and unjustifiable in others.
Training always involves running empirical testing with data. Sometimes, developers test the model with data that are different from those used to generate it. Therefore, at this stage one might talk about different types of datasets.
Main actions that need to be addressed
Implementing data minimization principle
According to the data minimization principle, you must proceed to reduce the amount of data and/or the range of information about the data subject they provide as soon as possible. Consequently, you have to purge the data used during the training phase of all information not strictly necessary for the training of the model. (see “Temporal aspect” subsection in “Data minimization” section in “Principles” chapter). There are multiple strategies to ensure data minimisation at the training stage. Techniques are continuously evolving. However, some of the most common are (see also “Integrity and confidentiality” section in “Principles” chapter):
- Analysis of the conditions that the data must fulfil in order to be considered of high quality and with a great predictive capacity for the specific application.
- Critical analysis of the extent of the data typology used in each stage of the AI solution.
- Deletion of unstructured data and unnecessary information collected during the pre-processing of the information.
- Identification and suppression of those categories of data that do not have a significant influence on learning or on the outcome of the inference.
- Suppression of irrelevant conclusions associated with personal information during the training process, for example, in the case of unsupervised training.
- Use of verification techniques that require less data, such as cross-validation
- Analysis and configuration of algorithmic hyperparameters that could influence the amount or extent of data processed in order to minimise them
- Use of federated rather than centralized learning models
- Application of differential privacy strategies.
- Training with encrypted data using homomorphic techniques.
- Data aggregation.
- Anonymization and pseudonymization, not only in data communication, but also in the training data, possible personal data contained in the model and in the processing of inference.
Detecting and erasing biases
Even though the mechanisms against biases are conveniently adopted in previous stages (see the section about training above), it is still necessary to ensure that the results of the training phase minimize biases. This can be difficult since some types of bias and discrimination are often particularly hard to detect. The team members who are curating the input data are sometimes unaware of them, and the users who are their subjects are not necessarily cognisant of them either. Thus, the monitoring systems implemented by the AI developer in the validation stage are extremely important factors to avoid biases.
There are a lot of technical tools that might serve well to detect biases, such as the Algorithmic Impact Assessment. You must consider their effective implementation. However, as the literature shows, it might happen that an algorithm cannot be totally purged of all different types of biases. You should, however, try to at least be aware of their existence and the implications that this might bring (see “Lawfulness, Fairness and Transparency” and “Accuracy” sections in “Principles” chapter).
Exercising data subjects’ rights
Sometimes, developers complete the available data through inference. For instance, if you do not have the factual data corresponding to the political opinions of an offender, you might use another algorithm to infer it from the rest of the data, like observed participation in demonstrations. However, this does by no means mean that these data can be considered as pseudonymized or anonymized. Thus, they continue to be personal data. Correspondingly, inferred data must also be regarded as personal data. Therefore, data subjects have some fundamental rights on these data that you must respect.
Indeed, you must respect data subjects’ rights during the whole life cycle. In this specific stage, right to access, rectification and erasure are particularly sensitive and include certain characteristics that controllers need to be aware of. However, in the case of research for scientific purposes such as the one you are developing, the GDPR includes some safeguards and derogations relating to processing (Art. 89). You must be aware of the concrete regulation in your Member state. According to the GDPR, Union or Member State law may provide for derogations from the main rights included in articles 15 and ff. in so far as such rights are likely to render impossible or seriously impair the achievement of the specific purposes, and such derogations are necessary for the fulfilment of those purposes.
-Right of access (see Right to access section in Data subject rights chapter)
In principle, you shall respond to data subjects’ requests to gain access to their personal data, assuming they have taken reasonable measures to verify the identity of the data subject, and no other exceptions apply. However, you do not have to collect or maintain additional personal data to enable the identification of data subjects in training data for the sole purposes of complying with the regulation. If you cannot identify a data subject in the training data and the data subject cannot provide additional information that would enable their identification, they are not obliged to fulfil a request that is not possible to satisfy.
–Right to rectification(see Right to rectification section in Data subject rights chapter)
In the case of the right to rectification, you must guarantee the right to rectification of the data, especially those generated by the inferences and profiles drawn up by an AI tool. Even though the purpose of training data is to train models based on general patterns in large datasets, and thus individual inaccuracies are less likely to have any direct effect on a data subject, the right to rectification cannot be limited. As a maximum, you could ask for a longer period (two extra months) to proceed with the rectification if the technical procedure is particularly complex (Art. 11(3)).
-Right to erasure (see Right to erasure section in Data subject rights chapter)
Data subjects hold a right to delete their personal data. However, this right might be limited if some concrete circumstances apply. According to the British ICO, “organizations may also receive requests for erasure of training data. Organizations must respond to requests for erasure, unless a relevant exemption applies and provided the data subject has appropriate grounds. For example, if the training data is no longer needed because the ML model has already been trained, the organization must fulfil the request. However, in some cases, where the development of the system is ongoing, it may still be necessary to retain training data for the purposes of re-training, refining and evaluating an AI tool. In this case, the organization should take a case-by-case approach to determining whether it can fulfil requests. Complying with a request to delete training data would not entail erasing any ML models based on such data, unless the models themselves contain that data or can be used to infer it.”
1Colin Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, p. 17. ↑
2AEPD, Adecuación al RGPD de tratamientos que incorporan Inteligencia Artificial. Una introducción, 2020, p.40. At: https://www.aepd.es/sites/default/files/2020-02/adecuacion-rgpd-ia.pdf Accessed 15 May 2020. ↑
3Reisman, D., Crawford, K., Whittaker, M., Algorithmic impact assessments: A practical framework for public agency accountability, 2018, at: https://ainowinstitute.org/aiareport2018.pdf Accessed 15 May 2020 ↑
5Chouldechova. Alexandra,Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments, Big Data. Volume: 5 Issue 2: June 1, 2017. 153-163.http://doi.org/10.1089/big.2016.0047 ↑
6ICO, Enabling access, erasure, and rectification rights in AI tools, At: https://ico.org.uk/about-the-ico/news-and-events/ai-blog-enabling-access-erasure-and-rectification-rights-in-ai-systems/ Accessed 15 May 2020 ↑