Metadata field | Definition | Format | Ontology | Example |
---|---|---|---|---|
project | Project ID | <surname\>_et_al_2023 | NA | proks_et_al_2023 |
author | Owner of the project | <First name\> <Surname\> | NA | Martin Proks |
date | Date of creation | YYYYMMDD | NA | 20230101 |
description | Short description of the project | Plain text | NA | This is a project describing the effect of Oct4 perturbation after pERK activation |
NGS Assay and Project metadata
⏰ Time Estimation: X minutes
💬 Learning Objectives:
- Develop your metadata
You should consider revisiting these examples after completing lesson 4 in the course material. Please review these three tables containing pre-filled data fields for metadata, each serving distinct purposes: sample metadata, project metadata, and experimental metadata.
Project metadata fields
Here you will find a table with possible metadata fields that you can use to annotate and track your Project
folders:
Sample metadata fields
Some details might be specific to your samples. For example, which samples are treated, which are controlled, which tissue they come from, which cell type, the age, etc. Here is a list of possible metadata fields that you can use:
Metadata field | Definition | Format | Ontology | Example |
---|---|---|---|---|
sample | Name of the sample | NA | NA | control_rep1, treat_rep1 |
fastq_1 | Path to fastq file 1 | NA | NA | AEG588A1_S1_L002_R1_001.fastq.gz |
fastq_2 | Path to paired fastq file, if it is a paired experiment | NA | NA | AEG588A1_S1_L002_R2_001.fastq.gz |
strandedness | The strandedness of the cDNA library | <unstranded OR forward OR reverse \> | NA | unstranded |
condition | Variable of interest of the experiment, such as "control", "treatment", etc | wordWord | camelCase | control, treat1, treat2 |
cell_type | The cell type(s) known or selected to be present in the sample | NA | ontology field- e.g. EFO or OBI | NA |
tissue | The tissue from which the sample was taken | NA | Uberon | NA |
sex | The biological/genetic sex of the sample | NA | ontology field- e.g. EFO or OBI | NA |
cell_line | Cell line of the sample | NA | ontology field- e.g. EFO or OBI | NA |
organism | Organism origin of the sample | <Genus species> | Taxonomy | Mus musculus |
replicate | Replicate number | <integer\> | NA | 1 |
batch | Batch information | wordWord | camelCase | 1 |
disease | Any diseases that may affect the sample | NA | Disease Ontology or MONDO | NA |
developmental_stage | The developmental stage of the sample | NA | NA | NA |
sample_type | The type of the collected specimen, eg tissue biopsy, blood draw or throat swab | NA | NA | NA |
strain | Strain of the species from which the sample was collected, if applicable | NA | ontology field - e.g. NCBITaxonomy | NA |
genetic variation | Any relevant genetic differences from the specimen or sample to the expected genomic information for this species, eg abnormal chromosome counts, major translocations or indels | NA | NA | NA |
Assay metadata fields
Here you will find a table with possible metadata fields that you can use to annotate and track your Assay
folders:
Metadata field | Definition | Format | Ontology | Example |
---|---|---|---|---|
assay_ID | Identifier for the assay that is at least unique within the project | <Assay-ID\>_<keyword\>_YYYYMMDD | NA | CHIP_Oct4_20200101 |
assay_type | The type of experiment performed, eg ATAC-seq or seqFISH | NA | ontology field- e.g. EFO or OBI | ChIPseq |
assay_subtype | More specific type or assay like bulk nascent RNAseq or single cell ATACseq | NA | ontology field- e.g. EFO or OBI | bulk ChIPseq |
owner | Owner of the assay (who made the experiment?). | <First Name\> <Last Name\> | NA | Jose Romero |
platform | The type of instrument used to perform the assay, eg Illumina HiSeq 4000 or Fluidigm C1 microfluidics platform | NA | ontology field- e.g. EFO or OBI | Illumina |
extraction_method | Technique used to extract the nucleic acid from the cell | NA | ontology field- e.g. EFO or OBI | NA |
library_method | Technique used to amplify a cDNA library | NA | ontology field- e.g. EFO or OBI | NA |
external_accessions | Accession numbers from external resources to which assay or protocol information was submitted | NA | eg protocols.io, AE, GEO accession number, etc | GSEXXXXX |
keyword | Keyword for easy identification | wordWord | camelCase | Oct4ChIP |
date | Date of assay creation | YYYYMMDD | NA | 20200101 |
nsamples | Number of samples analyzed in this assay | <integer\> | NA | 9 |
is_paired | Paired fastq files or not | <single OR paired\> | NA | single |
pipeline | Pipeline used to process data and version | NA | NA | nf-core/chipseq -r 1.0 |
strandedness | The strandedness of the cDNA library | <+ OR - OR *\> | NA | * |
processed_by | Who processed the data | <First Name\> <Last Name\> | NA | Sarah Lundregan |
organism | Organism origin | <Genus species\> | Taxonomy name | Mus musculus |
origin | Is internal or external (from a public resources) data | <internal OR external\> | NA | internal |
path | Path to files | </path/to/file\> | NA | NA |
short_desc | Short description of the assay | plain text | NA | Oct4 ChIP after pERK activation |
ELN_ID | ID of the experiment/assay in your Electronic Lab Notebook software, like labguru or benchling | plain text | NA | NA |
The metadata must include key details such as the project’s short description, author information, creation date, experimental protocol, assay ID, assay type, platform utilized, library details, keywords, sample count, paired-end status, processor information, organism studied, sample origin, and file path.
If you would create a database from the metadata files, your table should look like this (each row corresponding to one project):
assay_ID | assay_type | assay_subtype | owner | platform | extraction_method | library_method | external_accessions | keyword | date | nsamples | is_paired | pipeline | strandedness | processed_by | organism | origin | path | short_desc | ELN_ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RNA_oct4_20200101 | RNAseq | bulk RNAseq | Sarah Lundregan | NextSeq 2000 | NA | NA | NA | oct4 | 20200101 | 9 | paired | nf-core/chipseq 2.3.1 | * | SL | Mus musculus | internal | NA | Bulk RNAseq of Oct4 knockout | 234 |
CHIP_oct4_20200101 | ChIPseq | bulk ChIPseq | Jose Romero | NextSeq 2000 | NA | NA | NA | oct4 | 20200101 | 9 | single | nf-core/rnaseq 3.12.0 | * | JARH | Mus musculus | internal | NA | Bulk ChIPseq of Oct4 overexpression | 123 |
CHIP_med1_20190204 | ChIPseq | bulk ChIPseq | Martin Proks | NextSeq 2000 | NA | NA | NA | med1 | 20190204 | 12 | single | nf-core/rnaseq 3.12.0 | * | MP | Mus musculus | internal | NA | Bulk ChIPseq of Med1 overexpression | 345 |
SCR_humanSkin_20210302 | RNAseq | single cell RNAseq | Jose Romero | NextSeq 2000 | NA | NA | NA | humanSkin | 20210302 | 23123 | paired | nf-core/scrnaseq 1.8.2 | * | JARH | Homo sapiens | external | NA | scRNAseq analysis of human skin development | NA |
SCR_humanBrain_20220610 | RNAseq | single cell RNAseq | Martin Proks | NextSeq 2000 | NA | NA | NA | humanBrain | 20220610 | 1234 | paired | custom | * | MP | Homo sapiens | external | NA | scRNAseq analysis of human brain development | NA |
Sources
- Transcriptomics metadata standards and fields
- Biological ontologies for data scientists,Bionty