Checking legitimate dataset usage
Home » AI » Step by step » Data understanding » Checking legitimate dataset usage

Datasets can be obtained in different ways. Firstly, the developer might opt for acquiring or gaining access to a database that has already been built by someone else. If this is the case, the controller should be particularly careful, since there are a lot of legal issues that relate to the acquisition of access to database (see the Purchasing access to a database section in the Actions and Tools chapter).[1]

Secondly, the most common alternative to this consists of building a database. Quite obviously, in this case controllers have to ensure that they comply with all legal requirements imposed by the GDPR to create a database (see the “Creating a database” section in the “Main tools and actions” ).

Thirdly, sometimes developers choose an alternative path. They mix licensed data from third parties with each other or with the controllers’ own dataset so as to create a huge training dataset and another one for validation purposes. This could bring some issues, such as for example the possibility that the combination of these personal data provides some additional information about the data subjects. For instance, it could allow the controller to identify data subjects, something that was previously not possible. That could involve de-anonymizing anonymized data and creating new personal information that was not contained in the original data set, a circumstance that would bring dramatic ethical and legal issues. Therefore, re-identification must be tested through methods such as k-anonymity, l-diversity or t-closeness techniques.[2] (see the “Anonymization” section in the “Concepts” chapter).

Another common issue is that the original legal basis for processing the data gathered in each dataset is diverse. If a controller merges the datasets and then one of the legal bases is no longer applicable, that controller faces a terrible situation. For instance, if one of the databases was built on the basis of consent and some of the data subjects withdraw their consent, the controller will have to delete them from the merged dataset. This might be really hard to do in practice.

Furthermore, new information produced in this way may also be based on probabilities or conjectures, and therefore be false, or contain biases in the portrayal of persons (see “Fairness data protection principle and biases” section in the part of General Exposition in AI).[3] Therefore, controllers should try to avoid such consequences by ensuring that merging datasets do not work against data subjects’ rights and interests.

Finally, if controllers use several datasets that pursue different purposes, they should implement adequate measures to separate the different processing activities. Otherwise they could easily use data collected for on purpose to different activities. This might bring issues related to the purpose limitation principle see the “Purpose limimtation” section in the “Principles” chapter).

1Yeong Z.K. (2019) Legal issues in AI deployment. Law Gazette, February. Available at: (accessed 15 May 2020).

2Rajendran, K., Jayabalan, M. and Rana, M.E. (2017) ‘A study on k-anonymity, l-diversity, and t-closeness techniques focusing medical data’, International Journal of Computer Science and Network Security 17(12): 172-177.

3SHERPA project (2019) Guidelines for the ethical development of AI and big data systems: an ethics by design approach. SHERPA, p.38. Available at: (accessed 15 May 2020).


Skip to content