NGS Assay and Project metadata

Published

November 30, 2023

Modified

November 6, 2024

Section Overview

⏰ Time Estimation: X minutes

💬 Learning Objectives:

Develop your metadata

You should consider revisiting these examples after completing lesson 4 in the course material. Please review these three tables containing pre-filled data fields for metadata, each serving distinct purposes: sample metadata, project metadata, and experimental metadata.

Project metadata fields

Here you will find a table with possible metadata fields that you can use to annotate and track your Project folders:

Metadata field	Definition	Format	Ontology	Example
project	Project ID	<surname\>_et_al_2023	NA	proks_et_al_2023
author	Owner of the project	<First name\> <Surname\>	NA	Martin Proks
date	Date of creation	YYYYMMDD	NA	20230101
description	Short description of the project	Plain text	NA	This is a project describing the effect of Oct4 perturbation after pERK activation

Sample metadata fields

Some details might be specific to your samples. For example, which samples are treated, which are controlled, which tissue they come from, which cell type, the age, etc. Here is a list of possible metadata fields that you can use:

Metadata field	Definition	Format	Ontology	Example
sample	Name of the sample	NA	NA	control_rep1, treat_rep1
fastq_1	Path to fastq file 1	NA	NA	AEG588A1_S1_L002_R1_001.fastq.gz
fastq_2	Path to paired fastq file, if it is a paired experiment	NA	NA	AEG588A1_S1_L002_R2_001.fastq.gz
strandedness	The strandedness of the cDNA library	<unstranded OR forward OR reverse \>	NA	unstranded
condition	Variable of interest of the experiment, such as "control", "treatment", etc	wordWord	camelCase	control, treat1, treat2
cell_type	The cell type(s) known or selected to be present in the sample	NA	ontology field- e.g. EFO or OBI	NA
tissue	The tissue from which the sample was taken	NA	Uberon	NA
sex	The biological/genetic sex of the sample	NA	ontology field- e.g. EFO or OBI	NA
cell_line	Cell line of the sample	NA	ontology field- e.g. EFO or OBI	NA
organism	Organism origin of the sample	<Genus species>	Taxonomy	Mus musculus
replicate	Replicate number	<integer\>	NA	1
batch	Batch information	wordWord	camelCase	1
disease	Any diseases that may affect the sample	NA	Disease Ontology or MONDO	NA
developmental_stage	The developmental stage of the sample	NA	NA	NA
sample_type	The type of the collected specimen, eg tissue biopsy, blood draw or throat swab	NA	NA	NA
strain	Strain of the species from which the sample was collected, if applicable	NA	ontology field - e.g. NCBITaxonomy	NA
genetic variation	Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels	NA	NA	NA

Assay metadata fields

Here you will find a table with possible metadata fields that you can use to annotate and track your Assay folders:

Metadata field	Definition	Format	Ontology	Example
assay_ID	Identifier for the assay that is at least unique within the project	<Assay-ID\>_<keyword\>_YYYYMMDD	NA	CHIP_Oct4_20200101
assay_type	The type of experiment performed, eg ATAC-seq or seqFISH	NA	ontology field- e.g. EFO or OBI	ChIPseq
assay_subtype	More specific type or assay like bulk nascent RNAseq or single cell ATACseq	NA	ontology field- e.g. EFO or OBI	bulk ChIPseq
owner	Owner of the assay (who made the experiment?).	<First Name\> <Last Name\>	NA	Jose Romero
platform	The type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform	NA	ontology field- e.g. EFO or OBI	Illumina
extraction_method	Technique used to extract the nucleic acid from the cell	NA	ontology field- e.g. EFO or OBI	NA
library_method	Technique used to amplify a cDNA library	NA	ontology field- e.g. EFO or OBI	NA
external_accessions	Accession numbers from external resources to which assay or protocol information was submitted	NA	eg protocols.io, AE, GEO accession number, etc	GSEXXXXX
keyword	Keyword for easy identification	wordWord	camelCase	Oct4ChIP
date	Date of assay creation	YYYYMMDD	NA	20200101
nsamples	Number of samples analyzed in this assay	<integer\>	NA	9
is_paired	Paired fastq files or not	<single OR paired\>	NA	single
pipeline	Pipeline used to process data and version	NA	NA	nf-core/chipseq -r 1.0
strandedness	The strandedness of the cDNA library	<+ OR - OR *\>	NA	*
processed_by	Who processed the data	<First Name\> <Last Name\>	NA	Sarah Lundregan
organism	Organism origin	<Genus species\>	Taxonomy name	Mus musculus
origin	Is internal or external (from a public resources) data	<internal OR external\>	NA	internal
path	Path to files	</path/to/file\>	NA	NA
short_desc	Short description of the assay	plain text	NA	Oct4 ChIP after pERK activation
ELN_ID	ID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling	plain text	NA	NA

The metadata must include key details such as the project’s short description, author information, creation date, experimental protocol, assay ID, assay type, platform utilized, library details, keywords, sample count, paired-end status, processor information, organism studied, sample origin, and file path.

If you would create a database from the metadata files, your table should look like this (each row corresponding to one project):

assay_ID	assay_type	assay_subtype	owner	platform	extraction_method	library_method	external_accessions	keyword	date	nsamples	is_paired	pipeline	strandedness	processed_by	organism	origin	path	short_desc	ELN_ID
RNA_oct4_20200101	RNAseq	bulk RNAseq	Sarah Lundregan	NextSeq 2000	NA	NA	NA	oct4	20200101	9	paired	nf-core/chipseq 2.3.1	*	SL	Mus musculus	internal	NA	Bulk RNAseq of Oct4 knockout	234
CHIP_oct4_20200101	ChIPseq	bulk ChIPseq	Jose Romero	NextSeq 2000	NA	NA	NA	oct4	20200101	9	single	nf-core/rnaseq 3.12.0	*	JARH	Mus musculus	internal	NA	Bulk ChIPseq of Oct4 overexpression	123
CHIP_med1_20190204	ChIPseq	bulk ChIPseq	Martin Proks	NextSeq 2000	NA	NA	NA	med1	20190204	12	single	nf-core/rnaseq 3.12.0	*	MP	Mus musculus	internal	NA	Bulk ChIPseq of Med1 overexpression	345
SCR_humanSkin_20210302	RNAseq	single cell RNAseq	Jose Romero	NextSeq 2000	NA	NA	NA	humanSkin	20210302	23123	paired	nf-core/scrnaseq 1.8.2	*	JARH	Homo sapiens	external	NA	scRNAseq analysis of human skin development	NA
SCR_humanBrain_20220610	RNAseq	single cell RNAseq	Martin Proks	NextSeq 2000	NA	NA	NA	humanBrain	20220610	1234	paired	custom	*	MP	Homo sapiens	external	NA	scRNAseq analysis of human brain development	NA

Sources

Transcriptomics metadata standards and fields
Biological ontologies for data scientists,Bionty

Copyright

CC-BY-SA 4.0 license