Data Documentation

Data Management Plan

A Data Management Plan (DMP or DMSP) details how data will be collected, processed, analyzed, described, preserved, and shared during the course of a research project.

NIH, National Library of Medicine

The Data Management Plan is a formal document discussing how data will be handled during and after the project. It can be used as a key tool for communicating expectations with data stakeholders. It is also an evolving document that should be revisited and revised as necessary.

The Fundamental README

The README is the key file that describes a dataset and its metadata. A template can be found below.

Methods Section

The methods section of a README is arguably one of the most important. To the best of your ability, document your methods to improve the reproducibility of your results. However, with respect to human data, privacy is more important than reproducibility.

Software and Data

Do any of these suggestions or better practices apply to your software?

YES.

Software that generates or processes data should be seen as an extension of the data itself. In fact, the FAIR principles have been extended to FAIR for Software.

ConceptDefinitionExample
FindableSoftware, and its associated metadata, is easy for both humans and machines to find.Journal of Open Source Software (JOSS)
AccessibleSoftware, and its metadata, is retrievable via standardised protocolsHosting software on GitHub
InteroperableSoftware interoperates with other software by exchanging data and/or metadata, and/or through interaction via application programming interfaces (APIs), described through standards.Use standard file formats (e.g., csv)
ReusableSoftware is both usable (can be executed) and reusable (can be understood, modified, built upon, or incorporated into other software).Use of open-source licenses
FAIR concepts for Software

Implicit Metadata

Documentation needs to stand the test of time. It must be resilient to:

The four facets that can cause documentation to go out of date. [1] Icon by Iconsea, Freepik; [2] Icon by AGE, Freepik; [3] Icon by Haca Studio, Freepik.
The four facets that can cause documentation to go out of date. [1] Icon by Iconsea, Freepik; [2] Icon by AGE, Freepik; [3] Icon by Haca Studio, Freepik.

Ontologies

Metadata essentially constitutes all of the data that describes your dataset. While setting appropriate metadata names yourself (spelled out and including units) is important, this can be taken a step further by using domain-appropriate ontologies.

An Ontology is a formal dictionary of terms for a given industry or field that shows how the properties are related, Terms are stored as object-relationship pairs.

The key power of using a common ontology to describe your metadata is that gives you human, machine and dataset interoperability. When the same quantities are described by exactly the same predefined nomenclature across datasets, they can be seamlessly integrated and compared.

There are many fantastic resources available to get you started with understanding ontologies and how they might integrate with your system. A few we recommend:


Resources and References