“The initial business understanding phase focuses on understanding the project objectives from a business perspective, converting this knowledge into a data mining problem definition, and then developing a preliminary plan designed to achieve the objectives. In order to understand which data should later be analysed, and how, it is vital for data mining practitioners to fully understand the business for which they are finding a solution. The business understanding phase involves several key steps, including determining business objectives, assessing the situation, determining the data mining goals, and producing the project plan.”
This general objective involves four main tasks:
- Determine the Business Objectives. This means:
- Uncover the primary business objective as well as the related questions the business would like to address
- Determine the measure of success.
- Assess the Situation
- Identify the resources available to the project, both, material and personal.
- Identify what data is available to meet the primary business goal.
- List the assumptions made in the project.
- List the project risks, list potential solutions to those risks, create a glossary of business and data mining terms, and construct a cost-benefit analysis for the project.
- Determine the Data Mining Goals: decide what level of predictive accuracy is expected to consider the project successful.
- Produce a Project Plan: describe the intended plan for achieving the data mining goals, including outlining specific steps and a proposed timeline, an assessment of potential risks, and an initial assessment of the tools and techniques needed to support the project.
Main actions that need to be addressed
Defining business objectives
The first thing to clarify when you want to create an AI tool is what you want to achieve. In the case of a tool that identifies a pathology from an X-ray, it may be, for example, that
- It is meant to serve as a support for the radiologist’s work
- It may be used to support the work of a primary care physician, that is, to determine whether to refer the patient to a specialist.
- It can also be designed to replace the physician and make a diagnosis of, for example, COVID on its own.
- It can be used for performing a first triage (this is, recommending whether a primary care physician or a specialist should intervene).
Each of these scenarios has vastly different characteristics. Some of them require a higher level of accuracy than others. Thus, for example, if you intend to replace the health professional, it is necessary for the AI to reach an impressively high level of precision.
The ethical and legal implications of the different purposes are, at the same time, vastly different. If the mechanism is to be used for automated decision-making purposes, as in cases 3) or 4), the processing of the data will be subject to a considerably stricter legal regime. In fact, in many countries such use may be directly illegal.
All these considerations must be borne in mind from the outset. The development process should not be initiated if you, as the controller, do not clarify what results are to be achieved, because this issue is key in determining whether or not the planned data processing is in line with GDPR. Deciding the level of predictive accuracy expected to consider the project successful, is essential to assess the amount of data that will be needed to develop the AI tool or the nature of that data. The level of predictability or precision of the algorithm, the validation criteria to test it, the maximum quantity or the minimum quality of the personal data that will be necessary to use it in the real world, etc., are fundamental features of an AI development.
These key development elements should be considered from the first stage of the solution’s life cycle. This will be extremely helpful to implement a data protection by design policy. If an acceptable level of accuracy could be reached by using considerably less amount of personal data than a higher level requires, this should be strongly considered. The more inaccurate you are about these assessments, the more difficult it becomes to determine the precise purposes that are pursued by processing (see “Prerequisites to lawfulness specified, explicit purposes” subsection in “Lawfulness, fairness and transparency” section in “Principles” chapter). If we keep in mind that controllers must make the purposes of processing explicit, that is, “revealed, explained or expressed in some intelligible way”, accurate expectations are strongly recommendable.
Opting for the technical solutions
In general, you should always provide for the development of more understandable algorithms over less understandable ones. Trade-offs between the explainability/transparency and best performance of the system must be appropriately balanced based on the context of use. Even though in healthcare the accuracy and performance of the system may be more important than its explainability, you should always keep in mind that explaining a recommendation could serve well to train doctors, provide adequate information to patients who have to make a choice between different possible treatments or to justify a triage decision, for instance. Thus, if a quite similar service can be offered either through an easy to understand algorithm or an opaque one, that is, when there is no trade-off between explainability and performance, you should opt for the one that is more interpretable (see “Lawfulness, fairness and transparency” section in “Principles” chapter).
Implementing a training on ethical and legal issues
This action is one of the most important pieces of advice to be considered from the very first moment of an AI business development.Algorithm designers (developers, programers, coders, data scientists, engineers), who occupy the first link in the algorithmic chain, are likely to be unaware of the ethical and legal implications of their actions. If all intervening staff are in close contact with the data subjects, ethical considerations are easier to implement. However, this will probably not be your case. Indeed, one of the main problems that an AI tool devoted to dealing with health care issues is that it generally uses personal data that are included in large datasets. This somehow blurs the relationship between the data and the data subject, leading to violations of the regulations that rarely occur when the controller and the subject have a direct relationship.
This could bring terrible consequences in terms of adequate compliance with data protection standards, mainly since data of special categories are at stake. It is paramount that these key workers have the fullest possible awareness of the ethical and social implications of their work, and of the very fact that these can even extend to societal choices, which they should not by rights be able to judge alone. Silo mentality must be carefully fought.
In order to avoid that the misrepresentation of the ethical and legal issues provokes unwanted consequences, there are two main courses of action that can be adopted. First, developers might try to ensure that algorithm designers are able to understand the implications of their actions, both for individuals and society, and be aware of their responsibilities by learning to show continued attention and vigilance. In that sense, an optimal training for all subjects involved in the project (developers, programers, coders, data scientists, engineers, researchers) even before it starts could be one of the most efficient tools to save time and resources in term of compliance with data protection regulation. Thus, implementing basic training programs that include at least the fundamentals of the Charter of Fundamental Rights, the principles exposed in Article 5 of the GDPR, the need for a legal basis for processing (including contracts between the parties), etc.
However, training people who have never been in touch with data protection issues might be hard. An alternative policy is the involvement of an expert on data protection, ethical and legal issues in the development team, so as to create an interdisciplinary team. This might be done by hiring an expert for this purpose (an internal worker or an external consultant) to design the strategy and the subsequent decisions on personal data required by the development of the tools, with the close involvement of the Data Protection Officer.
Adopting adequate measures in terms of ensuring confidentiality is also strongly recommendable (see “Measures in support of confidentiality” subsection in the “Integrity and confidentiality” section in “Principles” chapter).
Designing legitimate data processing tool
According to article 5(1)(a) of the GDPR, personal data shall be “collected for specific, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes”. The concept of legitimacy is not well defined in the GDPR, but the Article 29 Working Party stated that legitimacy involves that data must be processes “in accordance with the law”, and “law” should be understood as a broad concept that includes “all forms of written and common law, primary and secondary legislation, municipal decrees, judicial precedents, constitutional principles, fundamental rights, other legal principles, as well as jurisprudence, as such ‘law’ would be interpreted and taken into account by competent court”.
Therefore, it is a wider concept than lawfulness. It involves compliance with the main values of the applicable regulation and the main ethical principles at stake. For instance, some concrete AI developments will need the intervention of an ethics committee. In other cases, guidelines or any other kind of soft regulation might be applicable. You should ensure adequate compliance with this requirement by designing a plan form this preliminary stage of the lifecycle of the tool (see “Legitimacy and lawfulness” part in “Lawfulness, fairness and transparency” in “Principles” chapter). To this purpose, you should be particularly aware of the requirements posed by the applicable regulation at the national level. In many Member states, developing an algorithm related to health care will surely involve the intervention of Ethics Committees, most probably at a preliminary stage. make sure that your research plan fits well with such requirements.
Adopting a risk-based thinking approach
Since the creation of your algorithm will surely involve the use of a huge amount of special categories of personal data, mainly health data, you must ensure that you implement appropriate measures to minimize the risks to data subjects’ rights and freedoms (see Integrity and confidentiality in Principles chapter). To this purpose, you must assess the risks to the rights and freedoms of individuals participating in the research and development process and judge what is appropriate to protect them. In all cases, you need to ensure that they comply with data protection requirements.
Risk-based thinking with regard to confidentiality of data, or a risk-based approach to questions of what harm may be done to people/data subjects, must be included from the first steps of the process. It might have legal consequences for the data controller in relation with the obligations stipulated in the GDPR if it is only considered later. Thus, you must identify the implicit threats to the planned data processing and assess the level of intrinsic risk involved. If you are planning to use software for processing purposes, you should ensure that adequate measures in support of confidentiality are implemented. If your AI will use third party software or off-the-shelf software, it is vital that functions that process personal data that have no legal basis, or are not compatible with the intended purposes, are excluded.
Whenever possible, try to avoid using data storage or software services that are located in a third country. If this is unavoidable, you must ensure that your data processing contracts with those third parties provide adequate GDPR compliant protection or, if this is not the case, ensure that the research participants are fully aware of the privacy/security risks to their data. You should also be aware and informed about appropriate security measures implemented by data storage and software service providers, and that the omissions in security may result in a breach of security processing.
In addition, you must ensure that appropriate technical and organizational measures are implemented to eliminate, or at least mitigate the risk, reducing the probability that the identified threats will materialize or reducing their impact. The security measures must become a part of your records of processing and all implemented measures will be part of the DPIA.
Once the selected measures are implemented, the remaining residual risk should be assessed and kept under control. Both the risk analysis and the DPIA are the tools that apply. In your concrete case, you must carry out a DPIA, since the creation of the AI tool will involve the processing on a large scale of special categories of data.
Finally, do not forget that when using big data and AI it is hard to foresee what the future risks will be, so doing assessment of ethical implications will not be sufficient to address all possible risks. Therefore, it is important to consider having a reassessment of risks and also highly recommendable to integrate a more dynamic way of assessing research risks. Do not hesitate to perform additional DPIAs in other stages of the process if need be.
Preparing the documentation of processing
Whoever processes personal data (including both, controllers and processors) needs to document their activities primarily for the use of qualified/relevant Supervisory Authorities. You must do this through records of processingthat aremaintained centrally by your organization across all its processing activities, and additional documentation that pertains to an individual data processing activity (see “Documentation of Processing” section in “Main tools and actions” chapter). This preliminary stage is the perfect moment to set up a systematic way of collecting the necessary documentation, since it will be the time when you can conceive and plan the processing activity.
Indeed, you should create a Data Protection Policy (see “Economy of scale for compliance and its demonstration” subsection in “Accountability” section in “Principles” chapter)that allows the traceability of information (if approved codes of conduct exist, these should be implemented, again, see “Economy of scale for compliance and its demonstration” subsection in “Accountability” section in “Principles” chapter). This Policy should also make the responsibilities assigned to processors clear if you are willing to involve them in your project and include the processing agreement tasks that will be delegated to it in relation to the execution of data subjects’ rights. You should always remember that Art. 32(4) GDPR clarifies that an important element of security is to ensure that employees act only on instruction and as instructed by you (see “Integrity and Confidentiality” section in “Principles” chapter).
The development of your AI tool might involve the use of different datasets. The traceability of the processing, the information about possible re-use of data, and the use of data pertaining to different datasets in different or in the same stages of the life cycle must be ensured by the records.
As stated in the Requirements and acceptance tests for the purchase and/or development of the employed software, hardware, and infrastructure (subsection of the Documentation of Processing section, the risk evaluation and the decisions taken “have to be documented in order to comply with the requirement of data protection by design (of Art. 25 GDPR). Practically, this can take the form of:
- Data protection requirements specified for the purchase (e.g., a tender) or development of software, hardware and infrastructure,
- Acceptance tests that verify that the chosen software, systems and infrastructure are fit for purpose and provide adequate protection and safeguards.
Such documentation can be an integral part of the DPIA.”
Finally, you should always be aware that, according to Art. 32(1)(d) of the GDPR, data protection is a process. Therefore, you should test, assess, and evaluate the effectiveness of technical and organizational measures regularly. This is an excellent moment to build a strategy aimed at facing these challenges.
Regulatory framework usage
The GDPR includes a specific regulatory framework regarding processing for the purposes of scientific research (see “Data protection and scientific research” section in “Concepts” chapter).Your AI development constitutes scientific research, irrespective of whether it is created for profit or not. Therefore, the “Union or Member State law may provide for derogations from the rights referred to in Articles 15, 16, 18 and 21 subjectto the conditions and safeguards referred to in paragraph 1 of this Article in so far as such rights are likely to render impossible or seriously impair the achievement of the specific purposes, and such derogations are necessary for the fulfillment of those purposes” (Art. 89(2)). Furthermore, according to article 5 (b) “further processing of the data gathered, in accordance with Article 89(1), would not be considered to be incompatible with the initial purposes (‘purpose limitation’). Some other particular exceptions to the general framework applicable to processing for research purposes (such as storage limitation) should also be considered”.
You certainly might profit from this favorable framework. Nevertheless, you must be aware of the concrete regulatory framework that applies to this research (mainly, the safeguards to be implemented). It might include important changes depending on respective national regulations. Consultation with your DPO is highly recommended for this purpose.
Defining data storage adequate policies
According to Article 5(1)(e) GDPR, personal data should be “kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed” (see “Storage limitation” section in “Principles” chapter). This requisite is twofold. On one hand, it relates to identification: data should be stored in a form which permits identification of data subjects for no longer than necessary. Consequently, you should implement policies devoted to avoiding identification as soon as it is not necessary for processing. This involves the adoption of adequate measures to ensure that at any moment, only the minimal degree of identification that is necessary to fulfill the purposes must be used (see “Temporal aspect” subsection in “Storage limitation” section in “Principles” chapter).
On the other hand, data storage implies that data can only be stored for a limited period: the time that is strictly necessary for the purposes for which the data are processed. However, the GDPR permits ‘storage for longer periods’ if the sole purpose is scientific research (as in your concrete case).
Thus, this exception raises the risk that you decide to keep the data longer than strictly needed so as to ensure that they are available for reasons other than the original purposes they were collected for. Do not do it, if there are no good reasons that recommend it (for instance, if X-Rays come from a medical record, you must keep them in the clinical record of the patient). You must be aware that even though the GDPR might allow storage for longer periods, you should have a good reason to opt for such an extended period. Thus, if you do not need the data, and there are no compulsory legal reasons that oblige you to conserve the data, it is better to anonymize or delete it. This could also be an excellent moment to envisage time limits for erasure of the different categories of data and document these decisions (see “Accountability Principle” in “Principles” chapter).
Appointing a Data Protection Officer
According to article 37 GDPR, you must appoint a DPO since you will process a large scale of special categories of data pursuant to Article 9. In any case, key personnel within data controller should elaborate the role of the DPO in relation to the overall management of the project, ensuring that the role of the DPO is not marginal, but cemented into decision making processes of the organization/project. They should also make clear what that role could be in terms of oversight, decision making and similar.
Ensuring compliance with legal framework for medical devices
Even though these Guidelines are mainly oriented to data protection issues, we cannot avoid mentioning that you should be well aware from this preliminary stage that you must ensure adequate compliance with the legal framework related to medical devices. We are mainly referring to Regulation (EU) 2017/745 – Medical Devices Regulation (MDR) and Regulation (EU) 2017/746 – In Vitro Diagnostic Medical Devices Regulation (IVDR). Most probably, there will be national regulations applicable to these issues. Please, take actions aimed at compliance. You can find helpful Guidelines to this purpose here: https://ec.europa.eu/docsroom/documents/40323
Regarding the regulation of health data at the Member State Level, this resource might be particularly relevant:
1Shearer, Colin, The CRISP-DM Model: The New Blueprint for Data Mining, p. 14. ↑
2Ibid., p.55. ↑
3This specific framework also includes historical research purposes or statistical purposes. However, ICT research is not usually related to these purposes. Therefore, we will not analyse them here. ↑