AI developers should know from the beginning what they expect the tool to be capable of doing. The more inaccurate they are about these expectations, the more difficult it becomes to determine the precise purposes of the processing (see the “Prerequisites to lawfulness specified, explicit purposes” subsection in the “Lawfulness, fairness and transparency” section of the “Principles” chapter). If we keep in mind that controllers must make the purposes of processing explicit, that is, “revealed, explained or expressed in some intelligible way”, accurate expectations are strongly recommended. However, one must distinguish between the different stages of the life cycle of an AI development. In the training stage, the use of large amounts of data might be essential to estimate the concrete utility of the tool. Therefore, processing big datasets might be acceptable even though the specific end (developing the AI tool) is not so precise. This, of course, would not be so easily acceptable if we were in the last stage of the process, that is, the deployment and use of the tool. If, at that moment, the controller would need to use a large amount of data, a much more detailed justification would be needed.
In any case, it is necessary to highlight that some key ideas must be kept in mind from the very beginning. For instance, deciding the expected level of predictive accuracy, in order to consider the project a success, is essential to assess the amount of data that will be needed to develop the AI tool or the nature of that data. The level of predictability or precision of the algorithm, the validation criteria to test it, the maximum quantity or the minimum quality of the data that will be necessary to use it in the real world, etc., are fundamental features of an AI development. These key decisions should be considered from the first stage of the solution’s life cycle. This will be extremely helpful to implement a data protection by design policy (see “Data protection by design” in the “Concepts” part of the Guidelines).
Thus, the AI developer should fix acceptable thresholds of false positives/negatives or ranges, depending on the use case and then perform a utility balance. The AI developer must be aware that determining the expected level of accuracy is clearly linked to the amount of data needed. It is not the same to develop, for example, a product for healthcare or for recommending TV series. In addition, even within the health sector, it is not the same to develop a tool capable of performing a first triage (that is, recommending whether a primary care physician or a specialist should intervene) or a solution that aims to support radiologists in diagnosis. Depending on what the mechanism is intended to do, higher or lower accuracy requirements will be adopted.
If an acceptable level of accuracy could be reached by using considerably less personal data than required by a higher level of accuracy, then this should be strongly considered. Furthermore, AI developers must keep in mind that any marginal increase in terms of accuracy of the prediction sometimes calls for a significant increase in the amount of personal data needed.  Therefore, if they are considering a fundamental modification in the level of accuracy of the prediction required, they should carefully consider if this works well with the data minimization principle (see “Data minimization principle” in the “Principles” part of the Guidelines).