Documentation and Metadata

To avoid errors, mix-ups and long search times in future, it is worth investing some time in creating a systematically organized file and folder structure already at the start of a project. This is especially important if you are collaborating with other research groups. Everyone involved in a project should agree to a scheme and stick to it. It is advisable to record the organizational and naming scheme in a document which you subsequently deposit with the published data as an accompanying document.

  • Group related files in folders (e.g. for measurements, methods or project phases)
  • Use clear, unique folder names
  • Use a hierarchical folder structure (N.B.: too many nested levels results in long and complicated filepaths)
  • Keep active and completed work in separate folders and delete any temporary files that are no longer required.

Make sure you use file names that are unique and are also meaningful for people who are not involved in the project. General elements that can form part of a name:

  • Creation date (YYYY-MM-DD)
  • Project reference/name
  • Description of the content
  • Name of creator (initials or whole name)
  • Name of research team/department
  • Version number

To avoid operating system constraints, use the following character/naming conventions:

  • Short names
  • No special characters (: & * % $ £ ] { ! @)
  • Use underscores _ rather than blank spaces or dots
  • Include a file suffix wherever possible (.txt, .xls, etc.)
  • Do not rely on uppercase/lowercase distinctions

The careful choice of a file format can ensure that files can still be used after many years and consequently greatly facilitate reuse of the research data. When choosing a suitable format, various factors should be taken into consideration:

  • Future-proofing: how many software products can read the data format?
  • Open access to documentation
  • No legal constraints (patents)
  • No technical constraints (encryption, DRM)
  • Established in community

The file formats for research data can vary widely depending on the discipline in question. The following file formats are recommended:

  • Images: TIFF, TIF
  • Documents: TXT, ASC, PDF/A
  • Tabular data: CSV
  • Audio files: WAV
  • Databases: SQL, XML
  • Structured data: XML, JSON, YAML

Further information about which file formats are recommended for long-term preservation can be found at here.

It is essential to use version control, especially for datasets that change over the course of a project. Individual datasets should be named sequentially and the names should include the save date (YYYY-MM-DD) along with the version number. The final version should be indicated as such. Maintaining a version table in which all changes and new names are recorded can help keep track of the datasets.

Especially when working with a number of different people, it may be advisable to regularly save a milestone version of the file which then must not be changed or deleted.

To summarize, recommends:

  • Use sequential numbering
  • Include the date and version number in the name
  • Use a version control table
  • Specify who is responsible for providing the final files
  • Use version control software for large data volumes
  • Save milestone versions

Further information and best practices

We recommend you back up your data using the university's IT system as it collects the data campus-wide and redundantly backs it up to two state-of-the-art tape libraries.

Click here for more information: Campus Backup/Archive

You should always adopt the 3-2-1 backup strategy:

  • 3 copies of the data (1 original + 2 backups)
  • Stored on 2 different types of media (external hard drives, USB sticks, SD cards, CDs, DVDs, Cloud)
  • 1 copy off-site

Backup should be automated to run at regular intervals. Check that the backup was successful and that the data can be retrieved again if necessary.

Comprehensive documentation is essential to enable correct interpretation and reuse of the data at a later date. Among other things, the documentation should include details about the time and place the data was collected, the methods, tools, software and statistics models used, as well as information about the parameters chosen and any missing values, along with nomenclature and acronyms.

Click here for further information.

Metadata is information about data which is created in a structured and machine-readable form. The metadata helps other researchers find and reuse data. Depending on the particular discipline, there are various commonly used metadata standards and tools that can be used to describe datasets in different domains.

The repository of the University of Bern (BORIS) uses the Dublin Core metadata element set. This metadata is automatically generated by filling in a form when depositing a dataset in the repository.