Data understanding
Home » AI » Step by step » Data understanding

“The data understanding phase starts with an initial data collection. The analyst then proceeds to increase familiarity with the data, to identify data quality problems, to discover initial insights into the data, or to detect interesting subsets to form hypotheses about hidden information. The data understanding phase involves four steps, including the collection of initial data, the description of data, the exploration of data, and the verification of data quality”.[1]

At this stage, initial data collection takes place and an initial study of the data is performed. It involves four sequential tasks:

  • Collect initial data
  • Describe data
  • Analyze data
  • Verify data quality.

All of these tasks are aimed at identifying the available data. At this stage, developers need to be aware of the data they will have to work with and start making decisions on the way in which main principles related to data protection will be implemented.

At this stage, there are a huge number of fundamental issues related to the protection of personal data that need to be addressed. Depending on the decisions made, principles such as data minimization, privacy by design or by default, lawfulness, fairness and transparency, etc. will be adequately settled. These are the main actions that need to be addressed at this stage:

Making a decision about the type of data collected

Selecting appropriate legal basis for processing

Checking legitimate dataset usage




1Shearer, C. (2000) ‘The CRISP-DM model: the new blueprint for data mining’, Journal of Data Warehousing 5(4): 13-23, p.15. Available at: (accessed 15 May 2020).


Checklist: data understanding

☐ The controllers have implemented appropriate technical and organisational measures for ensuring that, by default, only personal data that are necessary for each specific purpose of the processing are processed.

☐ The controllers have introduced policies that minimize the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. Such measures ensure that by default personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons.

☐ The controllers do not to collect unnecessary data. If data is already stored, they have taken actions aimed at deleting unnecessary data elements.

☐ The controllers have limited the resolution of the data to what is minimally necessary for the purposes pursued by the processing.

☐ The controllers have selected the legal basis that most closely reflects the true nature of their relationship with the individual and the purpose of the processing.

☐ The controllers have carefully analysed whether processing involves de-anonymizing anonymized data and creating new personal information that was not contained in the original data set and take adequate measures to face these challenges

☐ The controllers have made sure that merging datasets does not create ethical or legal issues regarding data subjects’ rights and freedoms.

Skip to content