“The data preparation phase covers all activities to construct the final data set or the data that will be fed into the modeling tool(s) from the initial raw data. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools. The five steps in data preparation are the selection of data, the cleansing of data, the construction of data, the integration of data, and the formatting of data.”
This stage includes all activities needed to construct the final dataset that is fed into the model, from initial raw data. It involves the following five tasks, not necessarily performed sequentially.
- Select data. Decide on the data to be used for analysis, based on relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types.
- Clean data. Raise data quality to a required level, for example by selecting clean subsets of the data, insertion of defaults, and estimation of missing data by modeling.
- Construct data. The construction of new data through the production of derived attributes, new records, or transformed values for existing attributes.
- Integrate data. Combine data from multiple tables or records to create new records or values.
- Format data. Make syntactic modifications to data that might be required by the modeling tool.
The main actions that must be addressed at this stage are:
1Shearer, C. (2000) ‘The CRISP-DM model: the new blueprint for data mining’, Journal of Data Warehousing 5(4): 13-23, p.16. Available at: https://mineracaodedados.files.wordpress.com/2012/04/the-crisp-dm-model-the-new-blueprint-for-data-mining-shearer-colin.pdf (accessed 15 May 2020). ↑
|Checklist: Data preparation
☐ The controllers have ensured that data are accurate, that is, correct and up-to-date data.
☐ If profiling or automated decision making is foreseen:
☐ The controllers have sent individuals a link to their privacy statement when they have obtained their personal data indirectly.
☐ The controllers have explained how people can access details of the information that they used to create their profile.
☐ The controllers have communicated the data subjects who provide them with their personal data and how they can object to profiling.
☐ The controllers have introduced procedures for customers to access the personal data input into their profiles, so they can review and edit for any accuracy issues.
☐ The controllers have implemented additional checks in place for their profiling/automated decision-making systems to protect any vulnerable groups (including children).
☐ The controllers have ensured that they only collect the minimum amount of data needed and have a clear retention policy for the profiles that they create.
☐ The controllers have carried out a DPIA to consider and address the risks when they start any new automated decision-making or profiling.
☐ The controllers have involved the corresponding DPO in these activities.
☐ The controllers have considered the system requirements necessary to support a meaningful human review from the design phase. Particularly, the interpretability requirements and effective user-interface design to support human reviews and interventions.
☐ The controllers have designed and delivered appropriate training and support for human reviewers.
☐ The controllers have given the staff involved in the processing the appropriate authority, incentives and support to address or escalate individuals’ concerns and, if necessary, override the AI system’s decision.
☐ The controllers have ensured that the teams in charge of selecting the data to be integrated in the datasets are composed of people that ensure the diversity that the AI development is expected to show.
☐ The controller have ensured that factors which result in inaccuracies in personal data are corrected and the risk of errors is minimized.
☐ The controllers have implemented tools aimed at preventing discriminatory effects on natural persons on the basis of racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or that result in measures having such an effect.