3. Data organization and storage

Modified

September 13, 2024

Course Overview

⏰ Time Estimation: X minutes
πŸ’¬ Learning Objectives:

  1. Organize your data and external resources efficiently
  2. Apply naming conventions for files and folders
  3. Define rules for naming results and figures accurately

So far, we have covered how to adhere to FAIR and Open Science standards, which primarily focus on data sharing post-project completion. However, effective data management is essential while actively working on the project. Organizing data folders, raw and processed data, analysis scripts and pipelines, and results ensure long-term project success. Without a clear structure, future access and understanding of data become challenging, even more so for collaborators, leading to potential chaos down the line.

Exercise
  • Have you ever had trouble finding data, results, figures, or specific scripts?
  • Do you maintain the same structure across different projects?
  • Have you ever discussed this topic with collaborators?

File structure and naming conventions

On the other hand, applying a consistent file structure and naming conventions to your files will help you to manage your data efficiently. Consider the following practices:

  • Folder structure: establish a logical and intuitive folder structure that mirrors the organization of research projects and experimental data. Employ descriptive folder names for easy identification and access to specific data files.
    • Subfolders: enhance the organization using subfolders to further categorize data based on their contents, such as workflows, scripts, results, reports, etc.
  • File naming conventions: implement a standardized file naming convention to maintain consistency and clarity. Use descriptive and informative names (e.g., specify data type: plots, results tables, etc.)

In this lesson, we will see a practical example of how you could organize your own files and folders.

Folder organization

Here we suggest the use of three main folders:

  1. Shared project data folders:
  • This shared directory is designated for storing unprocessed sequencing data files, with each subfolder representing a separate project.
  • Each project folder contains raw data, corresponding metadata, and optionally pre-processed data like quality control reports and processed data.
    • Include the pipeline or workflow used for data processing, along with a metadata file.
  • Data in these folders should be locked and set to read-only to prevent unauthorized (β€œunwanted”) modifications.
  1. Individual project folders:
  • This directory typically belongs to the researcher conducting bioinformatics analyses and encompasses all essential files for a specific research project (data, scripts, software, workflows, results, etc.).
  • A project may utilize data from various assays or results obtained from other projects. It’s important to avoid duplicating datasets; instead, link them from the original source to maintain data integrity and avoid redundancy.
  1. Resources and databases folders:
  • This (commonly) shared directory contains common repositories or curated databases that facilitate research (genomics, clinical data, imaging data, and more!). For instance, in genomics, it includes genome references (fasta files), annotations (gtf files) for different species, and indexes for various alignment algorithms.
  • Each folder corresponds to a unique reference or database version, allowing for multiple references from the same organism or different species.
    • Ensure each contains the version of the reference and a metadata file.
    • More subfolders can be created for different data formats.
Verify the integrity of downloaded files!

Ensure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.

  • MD5 Checksum: Files with names ending in β€œ.md5” contain MD5 checksums. For instance, β€œfilename.txt.md5” holds the MD5 checksum of β€œfilename.txt”.”

A database is a structured repository for storing, managing, and retrieving information, forming the cornerstone of efficient data organization.

Create shortcuts to public datasets and assays!

The use of symbolic links, also referred to as softlinks, is a key practice in large labs where data might used for different purposes and by multiple people.

  • They act as pointers, containing the path to the location of the target files/directories.
  • They avoid duplication and they are flexible and lightweight (do not occupy much disk space).
  • They simplify directory structures.
  • Extra use case: create symbolic links to executable files and libraries!
Exercise: create a softlink link

Open your terminal and create a softlink using the following command. The first path is the target (directory or file) and the second one is where the symbolic link will be created.

ln -s path/to/dataset/<ASSAY_ID> /path/to/user/<PROJECT_ID>/data/

Now, access the target file/directory through the symbolic link:

ls /path/to/user/<PROJECT_ID>/data/

Follow this example if need extra guidance (change paths!):

  1. Create the target/original file
echo "This is the content of the original file." > /home/users/Documents/original_file.txt
  1. Create the symbolic link
ln -s /home/users/Documents/original_file.txt /home/users/Desktop/original_file.txt
  1. Verify the symbolic link
ls -s /home/users/Desktop/original_file.txt
  1. Access the file through the symbolic link:
cat /home/users/Desktop/original_file.txt

The last command will display the contents of the original file.

Template engine

Setting up folder structures manually for each new project can be time-consuming. Thankfully, tools like Cookiecutter offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using cruft alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).

Cookiecutter templates

Quick tutorial on cookiecutter

Sandbox Tutorial

Learn how to create your own template here.

3. Resources and databases folder

Health databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.

For example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.

How can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as refgenie.

Refgenie

It manages the storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome β€œassets”, like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. Check this tutorial to get started.

Manual Download

Best practices for downloading data from the source while ensuring the preservation of information about the version and other metadata include:

  • Organizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.
  • Documentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.
  • README.md: Record the version of the data in the README.md file.
  • Checksums: Check for and use checksums (MD5, SHA1, SHA256, …) provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below to get more familiar with these files.
  • Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.
  • Automated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline).

We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the HLA FTP Directory.

  1. Install md5sum (from coreutils package)
#!/bin/bash
# On Ubuntu/Debian
apt-get install coreutils
# On macOS
brew install coreutils
  1. Create a bash script to download the target files (named β€œdw_resources.sh” in the data structure).
#!/bin/bash
# Important: go through the README before downloading! Check if a checksums file is included. 

# 1. Create or change the directory to the resources dir. 

# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
7348fbef5ab204f3aca67e91f6c59ed2  hla_prot.fasta
# Finally, save it: 
md5file="md5checksum.txt"

# Define the URL of the files to download
url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"

# (Optional 1) Save the original file name: filename=$(basename "$url")
# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)
# out_filename = "imgt_hla_prot.fasta"

# Download the file
wget $url && \
md5sum --status --check $md5file

We recommend using the argument `--status` **only** when you incorporate this sanity check as part of your pipeline so that it only prints the errors (it doesn't print output when success).
  1. Folder structure
genomic_resources/
β”œβ”€β”€ specie1/
β”‚  └── version/
β”‚     β”œβ”€β”€ files.txt
β”‚     └── indexes/
└── dw_resources.sh
  1. Create a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files.
md5sum <data>
Exercise

Download a file using md5sums. Choose a file from your favorite dataset or select one from the HLA database (for quick testing, consider using a text file such as Nomenclature_2009.txt).

Naming conventions

Consistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. The importance of uniform naming conventions extends to various fields, in fields like genomics or health data science, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.

General tips for file and folder naming

Remember to keep the folder structure simple.

  • Keep it short and meaningful (use understandable abbreviation only, e.g., Cor for correlations or LFC for Log Fold Change)
  • Consider including one of these elements: project name, category, descriptor, content, author…
    • Author-based: use initials
  • Use alphanumeric characters: letters (A-Z) and numbers (0-9)
  • Avoid special characters: ~!@#$%^&*()`β€œ|
  • Date-based format: use YYYYMMDD format (year/month/day format helps with sorting and listing files in chronological order)
  • Use underscores and hyphens as delimiters and avoid spaces.
    • Not all search tools may work well with spaces (messy to indicate paths)
    • If the length is a concern, use capital letters to delimit words camelCase.
  • Sequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not, 1 if your sequence only goes up to 99)
  • Version control: Indicate the version (β€œV”) or revision (β€œR”) as the last element, using the two-digit format (e.g., v01, v02)
  • Write down your naming convention pattern and document it in the README file
Create your own naming conventions

Consider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.

Which naming conventions should not be used and why?
A. data_processing_carlo's.py
B. raw_sequences_V#20241111.fasta
C. differential_expression_results_clara.csv
D. Grant proposal final.doc
E. sequence_alignment$v1.py
F. data/gene_annotations_20201107.gff
G. alpha~1.0/beta~2.0/reg_2024-05-98.tsv
H. alpha=1.0/beta=2.0/reg_2024-05-98.tsv
I. run_pipeline:20241203.sh

A, B, D, E, H, I

Which file name is more readable?
1a. forecast2000122420240724.tsv
1b. forecast_2000-12-24_2024-07-24.tsv
1c. forecast_2000_12_24_2024_07_24.tsv
2a. 01_data_preprocessing.R
2b. 1_data_preProcessing.R
2c. 01_d4t4_pr3processing.R
3a. B1_2024-12-12_cond~pH7_temp~37C.fastq
3b. B1.20241212.pH7.37C.fastq
3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq

1b: easier for human & machine, _ separates dates, - separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines.

2a: start with 0 for sorting, consistently with upper/lower and the use of separators (_ separates metadata)

3a: indicates variable temperature is set to 37 Celsius (temperature could be negative - and is better used to separate values in time)

Regular expressions are an incredibly powerful tool for string manipulation. We recommend checking out RegexOne to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click here.

Which of the following regexps match the following filenames?

(in bold filenames that SHOULD be matched):

  • rna_seq/2021/03/results/Sample_A123_gene_expression.tsv
  • proteomics/2020/11/Sample_B234_protein_abundance.tsv
  • rna_seq/2021/03/results/Sample_C345_normalized_counts.tsv
  • rna_seq/2021/03/results/Sample_D456_quality_report.log
  • metabolomics/2019/05/Sample_E567_metabolite_levels.tsv
  • rna_seq/2019/12/Sample_F678_raw_reads.fastq
  • rna_seq/2021/03/results/Sample_G789_transcript_counts.tsv
  • proteomics/2021/02/Sample_H890_protein_quantification.TSV

Regular Expressions:

rna_seq.*\.tsv
.*\.csv
.*/2021/03/.*\.tsv
.*Sample_.*_gene_expression.tsv
rna_seq/2021/03/results/Sample_.*_.*\.tsv

.*rna_seq.*\.tsv and rna_seq/2021/03/results/Sample_.*_.*\.tsv match the exact same files

Wrap up

In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using cookiecutter as a template engine to be able to create your own templates and reuse them as much as you need.

Sources