NGS Assay and Project metadata

Published

November 30, 2023

Modified

September 13, 2024

Section Overview

Time Estimation: X minutes

💬 Learning Objectives:

  1. Develop your metadata

You should consider revisiting these examples after completing lesson 4 in the course material. Please review these three tables containing pre-filled data fields for metadata, each serving distinct purposes: sample metadata, project metadata, and experimental metadata.

Project metadata fields

Here you will find a table with possible metadata fields that you can use to annotate and track your Project folders:

Metadata field Definition Format Ontology Example
project Project ID <surname\>_et_al_2023 NA proks_et_al_2023
author Owner of the project <First name\> <Surname\> NA Martin Proks
date Date of creation YYYYMMDD NA 20230101
description Short description of the project Plain text NA This is a project describing the effect of Oct4 perturbation after pERK activation

Sample metadata fields

Some details might be specific to your samples. For example, which samples are treated, which are controlled, which tissue they come from, which cell type, the age, etc. Here is a list of possible metadata fields that you can use:

Metadata field Definition Format Ontology Example
sample Name of the sample NA NA control_rep1, treat_rep1
fastq_1 Path to fastq file 1 NA NA AEG588A1_S1_L002_R1_001.fastq.gz
fastq_2 Path to paired fastq file, if it is a paired experiment NA NA AEG588A1_S1_L002_R2_001.fastq.gz
strandedness The strandedness of the cDNA library <unstranded OR forward OR reverse \> NA unstranded
condition Variable of interest of the experiment, such as "control", "treatment", etc wordWord camelCase control, treat1, treat2
cell_type The cell type(s) known or selected to be present in the sample NA ontology field- e.g. EFO or OBI NA
tissue The tissue from which the sample was taken NA Uberon NA
sex The biological/genetic sex of the sample NA ontology field- e.g. EFO or OBI NA
cell_line Cell line of the sample NA ontology field- e.g. EFO or OBI NA
organism Organism origin of the sample <Genus species> Taxonomy Mus musculus
replicate Replicate number <integer\> NA 1
batch Batch information wordWord camelCase 1
disease Any diseases that may affect the sample NA Disease Ontology or MONDO NA
developmental_stage The developmental stage of the sample NA NA NA
sample_type The type of the collected specimen, eg tissue biopsy, blood draw or throat swab NA NA NA
strain Strain of the species from which the sample was collected, if applicable NA ontology field - e.g. NCBITaxonomy NA
genetic variation Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels NA NA NA

Assay metadata fields

Here you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:

Metadata field Definition Format Ontology Example
assay_ID Identifier for the assay that is at least unique within the project <Assay-ID\>_<keyword\>_YYYYMMDD NA CHIP_Oct4_20200101
assay_type The type of experiment performed, eg ATAC-seq or seqFISH NA ontology field- e.g. EFO or OBI ChIPseq
assay_subtype More specific type or assay like bulk nascent RNAseq or single cell ATACseq NA ontology field- e.g. EFO or OBI bulk ChIPseq
owner Owner of the assay (who made the experiment?). <First Name\> <Last Name\> NA Jose Romero
platform The type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform NA ontology field- e.g. EFO or OBI Illumina
extraction_method Technique used to extract the nucleic acid from the cell NA ontology field- e.g. EFO or OBI NA
library_method Technique used to amplify a cDNA library NA ontology field- e.g. EFO or OBI NA
external_accessions Accession numbers from external resources to which assay or protocol information was submitted NA eg protocols.io, AE, GEO accession number, etc GSEXXXXX
keyword Keyword for easy identification wordWord camelCase Oct4ChIP
date Date of assay creation YYYYMMDD NA 20200101
nsamples Number of samples analyzed in this assay <integer\> NA 9
is_paired Paired fastq files or not <single OR paired\> NA single
pipeline Pipeline used to process data and version NA NA nf-core/chipseq -r 1.0
strandedness The strandedness of the cDNA library <+ OR - OR *\> NA *
processed_by Who processed the data <First Name\> <Last Name\> NA Sarah Lundregan
organism Organism origin <Genus species\> Taxonomy name Mus musculus
origin Is internal or external (from a public resources) data <internal OR external\> NA internal
path Path to files </path/to/file\> NA NA
short_desc Short description of the assay plain text NA Oct4 ChIP after pERK activation
ELN_ID ID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling plain text NA NA

The metadata must include key details such as the project’s short description, author information, creation date, experimental protocol, assay ID, assay type, platform utilized, library details, keywords, sample count, paired-end status, processor information, organism studied, sample origin, and file path.

If you would create a database from the metadata files, your table should look like this (each row corresponding to one project):

assay_ID assay_type assay_subtype owner platform extraction_method library_method external_accessions keyword date nsamples is_paired pipeline strandedness processed_by organism origin path short_desc ELN_ID
RNA_oct4_20200101 RNAseq bulk RNAseq Sarah Lundregan NextSeq 2000 NA NA NA oct4 20200101 9 paired nf-core/chipseq 2.3.1 * SL Mus musculus internal NA Bulk RNAseq of Oct4 knockout 234
CHIP_oct4_20200101 ChIPseq bulk ChIPseq Jose Romero NextSeq 2000 NA NA NA oct4 20200101 9 single nf-core/rnaseq 3.12.0 * JARH Mus musculus internal NA Bulk ChIPseq of Oct4 overexpression 123
CHIP_med1_20190204 ChIPseq bulk ChIPseq Martin Proks NextSeq 2000 NA NA NA med1 20190204 12 single nf-core/rnaseq 3.12.0 * MP Mus musculus internal NA Bulk ChIPseq of Med1 overexpression 345
SCR_humanSkin_20210302 RNAseq single cell RNAseq Jose Romero NextSeq 2000 NA NA NA humanSkin 20210302 23123 paired nf-core/scrnaseq 1.8.2 * JARH Homo sapiens external NA scRNAseq analysis of human skin development NA
SCR_humanBrain_20220610 RNAseq single cell RNAseq Martin Proks NextSeq 2000 NA NA NA humanBrain 20220610 1234 paired custom * MP Homo sapiens external NA scRNAseq analysis of human brain development NA

Sources