Publishing/complementing data in scientific papers
Home » Activities » Publishing/complementing data in scientific papers

Aliuska Duardo (UPV/EHU)

Acknowledgements: The author thankfully acknowledges the useful advice, and feedback on drafts from Igansi Labastida and Maria Grazia Procedda.

It is becoming increasingly common for scientific journals to require the deposition of the raw data in a public repository (subject-based or institutional) as a prerequisite for publication[1]. This practice is aimed at ensuring reproducibility, transparency and quality of research, while maximizing the benefits of sharing the results of scientific research. Such a requirement, of course, can only be demanded when data related to quantitative research is relevant to the text of the paper. However, authors must be very careful when publishing or sharing data, so as not to affect the rights of the subjects involved in a study.

Legal basis for processing personal data in a scientific research

Despite the fact that the European Data Protection Regulation is not designed exclusively for research, any scientific research involving personal data must follow the GDPR rules. In this sense, article 89 recognizes the presence of a relevant “public interest” in the research which entitles the personal data to be processed, provided that appropriate measures are adopted. This entitlement implies that the legitimacy for the data processing may derive from the law, from a contractual obligation, from data subject consent, from the deployment of public interest missions, and also from legitimate interest[2]. However, GDPR imposes a very rigorous methodology involving not only European regulation but also national legislation.

Data sharing and informed consent

In accordance with the above, the consent of the data subject is not the only legal basis for the processing of personal data in the context of scientific research. However, the most advisable is to obtain the corresponding consent whenever possible. In any case, it is necessary to distinguish between consent for participation in research and consent to share or publish data collected.

Firstly, when working with personal data in research, it is essential to carry out a proper informed consent procedure. This ensures that the data subjects are informed and give their consent to the way their personal information will be stored or transmitted. However, having consent does not mean that precautions should not be taken when sharing personal information.

Whenever possible, researchers should have informed consent both for the participation in the research -always on a voluntary basis-, and for the possible uses of the information collected. Thus, the consent form should take into account any future use of the data, such as exchange, preservation and long-term use. In this regard, researchers should:

– inform participants about how the research data will be stored, preserved and used in the long term

– inform the participants about the technical measures that will be taken to ensure the confidentiality of the data

– Inform about the way of publications of results

– obtain informed consent, for the exchange/sharing of data

Without consent for data exchange, opportunities to share even publish research data may be compromised. Data from participants who do not accept publication of potentially identifiable information should be removed from the data set.

Preparing the data-set for publication

How to prepare data for publication in a peer reviewed journal, and avoid privacy issues?

When data is relevant to understand the scientific methodology used in a research project, peer review journals usually demand from authors a data set in a suitable format that will allow statistical analysis to be performed – namely user-friendly data. Going further than journals` requirements, it is advisable from an ethical point of view that researchers respect the so-called FAIR data principles: research data should be ‘FAIR’, that is findable, accessible, interoperable and re-usable . In this sense, researchers must provide supplementary materials, and be aware of erroneous data, duplicates, or, on the contrary, missing information; they should provide sufficient information on each variable to allow other colleagues to replicate the study. Nonetheless, the publication of personal information that has been obtained, or inferred, from a research study will raise privacy issues. For this reason, the data set provided must contain the minimum level of detail necessary to reproduce all the numbers reported in the paper, while at the same time measures must be taken not to compromise the rights of the data subjects.

In this regard, Recital 156 of the GDPR points out a very valuable fact when publishing research data:

“The further processing of personal data for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes is to be carried out when the controller has assessed the feasibility to fulfil those purposes by processing data which do not permit or no longer permit the identification of data subjects, provided that appropriate safeguards exist (such as, for instance, pseudonymization of the data)”.

The data that allows a person to be identified can be direct or indirect. The key is the risk that people may be identified, and their privacy may be compromised, and with-it other rights and freedoms, so direct identifiers must be avoided in the publication of raw data.

On the other hand, a data set with several indirect identifiers could also lead to the identification of the subject. That is why the information should be processed with a high level of security by using some or all of the following techniques: pseudonymizing or aggregating data; separating data content according to security needs; removing personal information, such as names and addresses; encrypting data containing personal information before they are stored/transmitted .

Where consent of the data subjects cannot be obtained, or there is a risk of re-identification despite anonymization measures, a careful assessment must be made on a case-by-case basis, taking into account the public interest and the scientific imperative of publication . In such cases, it is recommended that authors consult the DPO from the research Institution and relevant Ethics Committees on the legal and ethical implications of publishing their raw data in a freely accessible repository before submitting it for publication. Where the relevant committee does not exist, consultation with an appropriate national advisory body is recommended.

Further readings

  • Remove direct identifiers from data sets, such as names, initials, or hospital numbers.
  • Try to reduce the accuracy of detail of a variable/data through aggregation, e.g. use area rather than village, generalizing the meaning of a detailed text variable, e.g. occupational expertise; restricting upper and lower ranges of a variable to hide outliers, e.g. income, age, or by combining variables
  • If you have not been able to obtain consent and/or there is a risk of identification of the subjects involved in the research, consult the corresponding DPO and Ethics committee or equivalent national advisory body about the consequences of publishing your data.
  • Obtaining a general consent without specifying the exact purpose of the data processing is not acceptable.
  • Alterations introduced in a data set in order to prevent identification of subjects must not distort scientific meaning.
  • Be sure you have a valid lawful basis to process/share personal data; Art. 6 GDPR
  • Pay attention to your national laws and national authorities’ policies. You will probably have to meet additional conditions and safeguards set out in national law.
  • Check that your research data is FAIR: findable, accessible, interoperable and re-usable.




1In other cases, journals ask for the link where the data is available.

2Determining which basis is appropriate will depend on the research purposes and the relationship with the data subject.

Skip to content