3. Data organization and storage

Modified

November 6, 2024

Course Overview

⏰ Time Estimation: X minutes
💬 Learning Objectives:

Organize your data and external resources efficiently
Apply naming conventions for files and folders
Define rules for naming results and figures accurately

So far, we have covered how to adhere to FAIR and Open Science standards, which primarily focus on data sharing post-project completion. However, effective data management is essential while actively working on the project. Organizing data folders, raw and processed data, analysis scripts and pipelines, and results ensure long-term project success. Without a clear structure, future access and understanding of data become challenging, even more so for collaborators, leading to potential chaos down the line.

Exercise

Have you ever had trouble finding data, results, figures, or specific scripts?
Do you maintain the same structure across different projects?
Have you ever discussed this topic with collaborators?

File structure and naming conventions

On the other hand, applying a consistent file structure and naming conventions to your files will help you to manage your data efficiently. Consider the following practices:

Folder structure: establish a logical and intuitive folder structure that mirrors the organization of research projects and experimental data. Employ descriptive folder names for easy identification and access to specific data files.
- Subfolders: enhance the organization using subfolders to further categorize data based on their contents, such as workflows, scripts, results, reports, etc.
File naming conventions: implement a standardized file naming convention to maintain consistency and clarity. Use descriptive and informative names (e.g., specify data type: plots, results tables, etc.)

In this lesson, we will see a practical example of how you could organize your own files and folders.

Folder organization

Here we suggest the use of three main folders:

Shared project data folders:

This shared directory is designated for storing unprocessed sequencing data files, with each subfolder representing a separate project.
Each project folder contains raw data, corresponding metadata, and optionally pre-processed data like quality control reports and processed data.
- Include the pipeline or workflow used for data processing, along with a metadata file.
Data in these folders should be locked and set to read-only to prevent unauthorized (“unwanted”) modifications.

Individual project folders:

This directory typically belongs to the researcher conducting bioinformatics analyses and encompasses all essential files for a specific research project (data, scripts, software, workflows, results, etc.).
A project may utilize data from various assays or results obtained from other projects. It’s important to avoid duplicating datasets; instead, link them from the original source to maintain data integrity and avoid redundancy.

Resources and databases folders:

This (commonly) shared directory contains common repositories or curated databases that facilitate research (genomics, clinical data, imaging data, and more!). For instance, in genomics, it includes genome references (fasta files), annotations (gtf files) for different species, and indexes for various alignment algorithms.
Each folder corresponds to a unique reference or database version, allowing for multiple references from the same organism or different species.
- Ensure each contains the version of the reference and a metadata file.
- More subfolders can be created for different data formats.

Verify the integrity of downloaded files!

Ensure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.

MD5 Checksum: Files with names ending in “.md5” contain MD5 checksums. For instance, “filename.txt.md5” holds the MD5 checksum of “filename.txt”.”

Database

A database is a structured repository for storing, managing, and retrieving information, forming the cornerstone of efficient data organization.

Create shortcuts to public datasets and assays!

The use of symbolic links, also referred to as softlinks, is a key practice in large labs where data might used for different purposes and by multiple people.

They act as pointers, containing the path to the location of the target files/directories.
They avoid duplication and they are flexible and lightweight (do not occupy much disk space).
They simplify directory structures.
Extra use case: create symbolic links to executable files and libraries!

Exercise: create a softlink link

Open your terminal and create a softlink using the following command. The first path is the target (directory or file) and the second one is where the symbolic link will be created.

ln -s path/to/dataset/<ASSAY_ID> /path/to/user/<PROJECT_ID>/data/

Now, access the target file/directory through the symbolic link:

ls /path/to/user/<PROJECT_ID>/data/

Hint

Follow this example if need extra guidance (change paths!):

Create the target/original file

echo "This is the content of the original file." > /home/users/Documents/original_file.txt

Create the symbolic link

ln -s /home/users/Documents/original_file.txt /home/users/Desktop/original_file.txt

Verify the symbolic link

ls -s /home/users/Desktop/original_file.txt

Access the file through the symbolic link:

cat /home/users/Desktop/original_file.txt

The last command will display the contents of the original file.

1. Navigating Shared Project Data

Let’s focus on the shared folders containing experimental datasets generated in-house.

Naming Shared Folders Effectively

Create a folder for all your NGS experiments, for instance, named Assay. Each subfolder, denoted by a unique Assay-ID, should be named clearly and comprehensibly. Assay-ID comprises raw files, processed files, and the pipeline used to generate them. Raw files should remain unchanged, while modifications to processed files should be restricted post-preprocessing (e.g., after quality control) to prevent unintended alterations. Check the exercise for efficient naming of Assay-ID:

Exercise: name your Assay-ID

How would you ensure its name is unique and understood at a glance?

Hint

Use an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).

<Assay-ID>_<keyword>_YYYYMMDD

Other Assay ID code names

CHIP: ChIP-seq
RNA: RNA-seq
ATAC: ATAC-seq
SCR: scRNA-seq
PROT: Mass Spectrometry Assay
CAT: Cut&Tag
CAR: Cut&Run
RIME: Rapid Immunoprecipitation Mass spectrometry of Endogenous proteins
…

Keep in mind that these folders might be (re)used in different individual projects over many years.

Optimizing Folder Structures

The provided folder structure is designed to be intuitive for NGS data. The description and metadata files aid in understanding the project’s origin and structure, crucial for archiving and manuscript preparation. There is a section dedicated to databases in lesson 4. Let’s explore the example and its folder contents:

<data_type>_<keyword>_YYYYMMDD/
├── README.md 
├── CHECKSUMS
├── pipeline.md
├── processed
    ├── fastqc/
    ├── multiqc/
    ├── final_fastq/
└── raw
    ├── .fastq.gz 
    └── samplesheet.csv

README.md: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay’s objectives). Sometimes, it also outlines the basic directory structure and file naming conventions. README’s are great but the goal is to make everything as self-explanatory as possible!
metadata.yml: This serves as the metadata file for the project (see this lesson).
pipeline.md: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can vary depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks, etc.) (see this lesson). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation)
processed_data: folder with results of the preprocessing pipeline. The contents may vary depending on the pipeline utilized. For example,
- fastqc: quality Control results of the raw fastq files.
- multiqc: aggregated quality control results across all samples
- final_fastq: cleaned and processed files
raw_data: folder with the raw data.
- .fastq.gz or other file formats (depending on the field or the experiment)
- samplesheet.csv: metadata information for the samples. It may contain additional columns that will facilitate downstream analysis. This file is key if are planning to use nf-core pipelines.

2. Navigating Project folder

In the Projects folder, usually private to the individual performing the data analysis, each project has its own subfolder containing project information, data analysis scripts and pipelines, and results. It’s advisable to maintain folders for individual projects, separate from shared data folders, as project-specific files typically aren’t reused across multiple projects, and more than one dataset might be needed to answer a specific scientific question.

Naming Project Folders Effectively

The Project folder should have a unique, easily readable, distinguishable, and instantly understandable name. For instance, consider naming it using the main author’s initials, a descriptive keyword, and the date:

<author_initials>_<keyword>_YYYYMMDD

Optimizing Folder Structures

Next, let’s take a look at a possible folder structure and what kind of files you can find there.

<author_initials>_<keyword>_YYYYMMDD
├── data (symbolic link)
│  └── raw
│  └── processed
│  └── external (third party resources)
├── docs
│  └── project_template.docx
├── notebooks or pipelines/
│  └── data_analysis1.ipynb or data_analysis1.smk
├── README.md
├── logs
├── tmp/scratch
├── environment
│  └── requirements.txt or environment.yml
├── scripts/src
│  └── step1.py 
├── reports
│  └── figures/
│  └── <file_name>.html
├── results
│  └── tables/
│  └── figures/
└── metadata.yml

data: contains symlinks or shortcuts to where the data is (raw, processed, external, etc.), avoiding duplication and modification of original files.
docs: a folder containing Word documents, slides, or PDFs related to the project. It also contains your Data Management Plan.
notebooks or pipelines: a folder containing notebooks (Jupyter, R markdown, Quarto notebooks) or workflows (Snakemake or Nextflow) with the actual data analysis. Tip: Label them numerically indicating the sequential order.
README.md: detailed description of the project in markdown format.
logs: log files.
tmp/scratch: store temporary or intermediate files (eg. testing).
environment: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file (See 6th lesson for more tips on making your pipelines reproducible). It includes software, libraries/packages, and dependencies (and their versions!).
scripts: a folder containing helper scripts to run data analysis or source code. Other common directory names: src, source and code, pick one!
reports: Generated analysis as HTML, PDF, LaTeX, etc. Great for sharing with colleagues and creating formal reports of the data analysis procedure.
- figures: figures produced upon rendering notebooks. Tip: save the figures under a subfolder named after the notebook/pipeline that created them (you will appreciate this organization when you need to rerun analysis and know which script created each figure!).
results: results from the data analysis, such as tables and figures, etc. Tip: Create a subfolder named after the notebook or pipeline for storing the results generated by that specific notebook or pipeline.
metadata.yml: metadata file describing the dataset, samples, etc. (see this lesson).

For good managing project practices, version control everything with git and git-annex!

Exercise: Write your personal data structure

Create your own data structure for one of the projects you are currently working on. Consider how it is similar to the example provided and how it differs. Make sure the data structure is easily understandable and navigable.
What improvements or modifications could be made to enhance clarity and efficiency? Check the following callout for more examples to get inspired.

Need more examples?

If you want to get inspired, here are two other templates proposed by A. The Turing way and B. Coderefinery:

Project Folder/
├── docs                     <- documentation
│   └── codelist.txt
│   └── project_plan.txt
│   └── ...
│   └── deliverables.txt
├── data
│   └── raw/
│       └── my_data.csv
│   └── clean/
│       └── data_clean.csv
├── analysis                 <- scripts
│   └── my_script.R
├── results                  <- analysis output     
│   └── figures
├── .gitignore               <- files excluded from git version control
├── install.R                <- environment setup
├── CODE_OF_CONDUCT          <- Code of Conduct for community projects
├── CONTRIBUTING             <- Contribution guideline for collaborators
├── LICENSE                  <- software license
├── README.md                <- information about the repo
└── report.md                <- report of project

project_name/ 
├── README.md # overview of the project
├── data/ # data files used in the project 
│   ├── README.md # describes where data came from 
│   └── sub-folder/ # may contain subdirectories 
├── processed_data/ # intermediate files from the analysis 
├── manuscript/ # manuscript describing the results 
├── results/ # results of the analysis (data, tables, figures) 
├── src/ # contains all code in the project 
│   ├── LICENSE # license for your code 
│   ├── requirements.txt # software requirements and dependencies 
│   └── ... 
└── doc/ # documentation for your project 
├── index.rst 
└── ...

Template engine

Setting up folder structures manually for each new project can be time-consuming. Thankfully, tools like Cookiecutter offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using cruft alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).

Cookiecutter templates

Sandbox Project/Data analysis template
Sandbox Data/Assay template
Cookiecutter template for Data science projects
Brickmanlab template for NGS data: similar to the folder structures in the examples above. You can download and modify it to suit your needs.

Quick tutorial on cookiecutter

Sandbox Tutorial

Learn how to create your own template here.

3. Resources and databases folder

Health databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.

Example NGS: genomic resources

For example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.

How can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as refgenie.

Refgenie

It manages the storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome “assets”, like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. Check this tutorial to get started.

Manual Download

Best practices for downloading data from the source while ensuring the preservation of information about the version and other metadata include:

Organizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.
Documentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.
README.md: Record the version of the data in the README.md file.
Checksums: Check for and use checksums (MD5, SHA1, SHA256, …) provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below to get more familiar with these files.
Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.
Automated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline).

Optional: Exercise on CHECKSUMS

We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the HLA FTP Directory.

Install md5sum (from coreutils package)

#!/bin/bash
# On Ubuntu/Debian
apt-get install coreutils
# On macOS
brew install coreutils

Create a bash script to download the target files (named “dw_resources.sh” in the data structure).

#!/bin/bash
# Important: go through the README before downloading! Check if a checksums file is included. 

# 1. Create or change the directory to the resources dir. 

# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
7348fbef5ab204f3aca67e91f6c59ed2  hla_prot.fasta
# Finally, save it: 
md5file="md5checksum.txt"

# Define the URL of the files to download
url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"

# (Optional 1) Save the original file name: filename=$(basename "$url")
# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)
# out_filename = "imgt_hla_prot.fasta"

# Download the file
wget $url && \
md5sum --status --check $md5file

We recommend using the argument --status only when you incorporate this sanity check as part of your pipeline so that it only prints the errors (it doesn’t print output when success).

Folder structure

genomic_resources/
├── specie1/
│  └── version/
│     ├── files.txt
│     └── indexes/
└── dw_resources.sh

Create a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files.

md5sum <data>

Exercise

Download a file using md5sums. Choose a file from your favorite dataset or select one from the HLA database (for quick testing, consider using a text file such as Nomenclature_2009.txt).

Naming conventions

Consistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. The importance of uniform naming conventions extends to various fields, in fields like genomics or health data science, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.

General tips for file and folder naming

Remember to keep the folder structure simple.

Keep it short and meaningful (use understandable abbreviation only, e.g., Cor for correlations or LFC for Log Fold Change)
Consider including one of these elements: project name, category, descriptor, content, author…
- Author-based: use initials
Use alphanumeric characters: letters (A-Z) and numbers (0-9)
Avoid special characters: ~!@#$%^&*()`“|
Date-based format: use YYYYMMDD format (year/month/day format helps with sorting and listing files in chronological order)
Use underscores and hyphens as delimiters and avoid spaces.
- Not all search tools may work well with spaces (messy to indicate paths)
- If the length is a concern, use capital letters to delimit words camelCase.
Sequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not, 1 if your sequence only goes up to 99)
Version control: Indicate the version (“V”) or revision (“R”) as the last element, using the two-digit format (e.g., v01, v02)
Write down your naming convention pattern and document it in the README file

Create your own naming conventions

Consider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.

Which naming conventions should not be used and why?

A. data_processing_carlo's.py
B. raw_sequences_V#20241111.fasta
C. differential_expression_results_clara.csv
D. Grant proposal final.doc
E. sequence_alignment$v1.py
F. data/gene_annotations_20201107.gff
G. alpha~1.0/beta~2.0/reg_2024-05-98.tsv
H. alpha=1.0/beta=2.0/reg_2024-05-98.tsv
I. run_pipeline:20241203.sh

Hint

A, B, D, E, H, I

Which file name is more readable?

1a. forecast2000122420240724.tsv
1b. forecast_2000-12-24_2024-07-24.tsv
1c. forecast_2000_12_24_2024_07_24.tsv
2a. 01_data_preprocessing.R
2b. 1_data_preProcessing.R
2c. 01_d4t4_pr3processing.R
3a. B1_2024-12-12_cond~pH7_temp~37C.fastq
3b. B1.20241212.pH7.37C.fastq
3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq

Hint

1b: easier for human & machine, _ separates dates, - separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines.

2a: start with 0 for sorting, consistently with upper/lower and the use of separators (_ separates metadata)

3a: indicates variable temperature is set to 37 Celsius (temperature could be negative - and is better used to separate values in time)

Regular expressions are an incredibly powerful tool for string manipulation. We recommend checking out RegexOne to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click here.

Which of the following regexps match the following filenames?

(in bold filenames that SHOULD be matched):

rna_seq/2021/03/results/Sample_A123_gene_expression.tsv
proteomics/2020/11/Sample_B234_protein_abundance.tsv
rna_seq/2021/03/results/Sample_C345_normalized_counts.tsv
rna_seq/2021/03/results/Sample_D456_quality_report.log
metabolomics/2019/05/Sample_E567_metabolite_levels.tsv
rna_seq/2019/12/Sample_F678_raw_reads.fastq
rna_seq/2021/03/results/Sample_G789_transcript_counts.tsv
proteomics/2021/02/Sample_H890_protein_quantification.TSV

Regular Expressions:

rna_seq.*\.tsv
.*\.csv
.*/2021/03/.*\.tsv
.*Sample_.*_gene_expression.tsv
rna_seq/2021/03/results/Sample_.*_.*\.tsv

Hint

.*rna_seq.*\.tsv and rna_seq/2021/03/results/Sample_.*_.*\.tsv match the exact same files

Wrap up

In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using cookiecutter as a template engine to be able to create your own templates and reuse them as much as you need.

Sources

The Turing way https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html
RDMkit Elixir Europe: https://rdmkit.elixir-europe.org/data_organisation
Coderefinery: https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects
UK Data Service: https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/organising/
Oakland University: https://library.oakland.edu/services/research-data/file-org.html
Cessda guidelines: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/File-naming-and-folder-structure.

Copyright

CC-BY-SA 4.0 license