name | description | naming_convention | file format | example |
---|---|---|---|---|
.fastq | raw sequencing reads | nan | nan | sampleID_run_read1.fastq |
.fastqc | quality control from fastqc | nan | nan | sampleID_run_read1.fastqc |
.bam | aligned reads | nan | nan | sampleID_run_read1.bam |
GTF | sequence annotation | nan | nan | one of https://www.gencodegenes.org/ |
GFF | sequence annotation | nan | nan | one of https://www.gencodegenes.org/ |
.bed | genome locations | nan | nan | nan |
.bigwig | genome coverage | nan | nan | nan |
.fasta | sequence data (nucleotide/aminoacid) | nan | nan | one of https://www.gencodegenes.org/ |
Multiqc report | QC aggregated report | <assayID\>_YYYYMMDD.multiqc | multiqc | RNA_20200101.multiqc |
Count matrix | final count matrix | <assayID\>_cm_aligner_YYYYMMDD.tsv | tsv | RNA_cm_salmon_20200101.tsv |
DEA | differential expression analysis results | DEA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsv | tsv | DEA_treat-untreat_LFC1_p01_20200101.tsv |
DBA | differential binding analysis results | DBA_<condition1-condition2\>_LFC<absolute_threshold\>_p<pvalue decimals\>_YYYYMMDD.tsv | tsv | DBA_treat-untreat_LFC1_p01_20200101.tsv |
MAplot | MA plot | MAplot_<condition1-condition2\>_YYYYMMDD.jpeg | jpeg | MAplot_treat-untreat_20200101.jpeg |
Heatmap plot | Heatmap plot of anything | heatmap_<type\>_YYYYMMDD.jpeg | jpeg | heatmap_sampleCor_20200101.jpeg |
Volcano plot | Volcano plot | volcano_<condition1-condition2\>_YYYYMMDD.jpeg | jpeg | volcano_treat-untreat_20200101.jpeg |
Venn diagram | Venn diagram | venn_<type\>_YYYYMMDD.jpeg | jpeg | venn_consensus_20200101.jpeg |
Enrichment table | Enrichment results | nan | tsv | nan |
⏰ Time Estimation: X minutes
💬 Learning Objectives:
- NGS data strategies
- File naming conventions examples
Effective RDM Practices in NGS Analysis
In the data life cycle for Next Generation Sequencing (NGS) technology data, processing, and analyzing are critical phases that involve transforming raw sequencing data into meaningful biological insights. Researchers apply computational methods and bioinformatics tools to extract valuable information from the vast amount of sequencing data generated in NGS experiments. We’ll first explore the primary data types generated pre- and post-processing and the importance of detailed documentation. We will then focus on good practices used when performing data analysis and software development.
Next Generation Sequencing (NGS), or high-throughput sequencing, has revolutionized genomics research. It encompasses advanced techniques for rapid and cost-effective analysis of DNA or RNA molecules. Unlike traditional methods, NGS can analyze millions of DNA fragments simultaneously, enhancing the speed, efficiency, and scale of sequencing and becoming integral to modern genomics and biomedical studies. As NGS technologies continue to advance and become more accessible, they will remain at the front of cutting-edge genomics research, driving innovations that contribute to our understanding of complex genetic interactions and their implications for human health and biology.
Applications
It is widely utilized in various applications, including genomic sequencing, transcriptome analysis (RNA-Seq), epigenetic profiling (ChIP-Seq), metagenomics, and targeted sequencing. In addition, it plays a crucial role in fields such as oncology, infectious disease research, and personalized medicine.
Data production
NGS workflows involve key steps, from sample preparation to data analysis. Samples undergo extraction and fragmentation, followed by the addition of unique identifiers, known as library preparation, for multiplexed sequencing. Then, fragments are amplified and sequences in parallel sequencing using state-of-the-art NGS platforms. Subsequent data analysis processes reconstruct the original sequence and identify genetic variations, structural changes, or functional elements. The unique identifiers are specific adapter sequences that allow future identification of individual samples within a multiplexed sequencing run.
- Do you ensure that all the data you collect or generate is accompanied by metadata? Have you ever encountered missing information when reading a provided file?
- Do you utilize specific databases or repositories for storing and accessing your research data?
- What are the typical data formats you encounter during data processing? As outputs of your analysis, what are the common data formats you encounter for visualization or further analysis?
- Do you document and track the workflows you use for data processing and analysis, including the software employed? How do you ensure reproducibility?
Practical tips for computational research
1. Experiments / raw data
Thoroughly document your datasets and the experimental setup to ensure reproducibility. Adhering to standards will ensure interoperability. Data types’ examples:
- Electronic Laboratory Notebook (ELN): digital description of the experimental design, and measurement devices. ELNs offer features like data entry, text editing, file attachments, collaboration tools, and search capabilities.
- Laboratory protocols: methodologies to prepare and manage samples.
- Samples: refers to the biological material (extraction of DNA, RNA, or proteins). Specification of sample identifier, sample type, source organism, etc.
- Sequencing: details on the platform (e.g., Illumina, Oxford Nanopore), library preparation method, coverage, quality control metrics (e.g., Phred score)…
- Raw sequencing data: sequences and quality scores (e.g., FASTQ files)
A metadata file is crucial during data analysis as it contains information about the experimental conditions (such as sequencing details, treatment, sample type, time points, tissue…).
2. Input / Pre- and post-processing data
Examples of data types generated during processing:
- Quality control metrics: to filter out potential artifacts and ensure the reliability of downstream analyses (e.g., bioinformatics tool like FastQC or MultiQC for results’ aggregation)
- Data alignments: in genomics to determine the location of the read in the genome and in transcriptomics to identify gene expression levels.
- DNA analysis results: such as variant calling, genome annotation, functional genomics, phylogenetics, metagenomics, etc. Results are usually presented in tabular format.
- RNA Expression analysis results: from differential gene expression, gene ontology (GO) enrichment, alternative splicing, pathway analysis, etc. Results are usually presented in tabular format.
- Epigenetic profiling outputs: to assess gene regulation and chromatin structure (e.g., ChIP-Seq). Usually presented in BED format.
The interpretation of NGS data relies heavily on the results of data analysis, which are pivotal for understanding the biological significance of the findings and formulating hypotheses for further exploration. Clear and effective visualization methods are crucial for communicating and interpreting the vast amount of information generated by NGS experiments.
Knowledge databases
A knowledge database is a structured repository of biological information that categorizes and annotates genes, proteins, and their functions, facilitating comprehensive understanding and analysis of biological systems. Here are five examples of knowledge databases:
- Gene Ontology (GO): A comprehensive resource that classifies gene functions into defined terms, allowing for standardized annotation and comparison of genes across different organisms.
- Disease Ontology: A database that provides structured, standardized terminology for various diseases and their relationships, aiding in the systematic analysis of disease-related data.
- KEGG Pathways: A collection of manually curated pathway maps representing molecular interactions and reaction networks within cells, enabling the interpretation of high-throughput data in the context of biological systems.
- Reactome: An open-access database that offers curated descriptions of biological processes, including pathways, reactions, and molecular events, facilitating the interpretation of large-scale biological data.
- UniProt: An extensive protein knowledgebase that provides detailed information about proteins, including their sequences, functions, and related annotations, supporting a wide range of biological research endeavors.
- Visualizations
- Heatmaps: frequently used to visualize gene expression patterns, epigenetic modifications, or microbial abundances across samples/conditions.
- Volcano Plots: commonly used in differential gene expression analysis
- Genome Browser Snapshots: display alignments and genomic features in genomic regions (e.g., gene annotations, ChIP-Seq peaks)
- Network Visualizations:utilized to explore gene regulatory networks or protein-protein interaction
- Genomic Annotations: to annotate genetic variations (functional impact on genes, genomic regions, or regulatory element)
3. Software and code:
Best practices for software and code management (don’t forget to read about FAIR software):
- Commenting your code: to enhance readability and comprehension
- Make your source code accessible using a repository (GitHub, GitLab, Bitbucket, SourceForge, etc.) that provides version control (VC) solutions. This step is one of the most important ones as version control systems (Git or SVN) track changes in your code over time and enable collaboration and easy version management. Most Danish institutions provide courses on Git/GitHub, check yours! We also highly recommend reading this paper (Perez-Riverol et al. 2016).
- README file: with comprehensive information about the project including installation instructions, usage examples or tutorials, licensing details, citation information, etc.
- Register your code in a research software registry and include a clear and accessible software usage license: enabling other researchers to discover and reuse software packages (alongside metadata). More recommendations here.
- Use domain-relevant community standards to ensure consistency and interoperability (e.g., CodeMeta).
- University of Copenhagen
- Aarhus University
- Aalborg University
- DTU Git guidelines Find more resources on the Berkeley Library website
4. Pipelines and workflows
You might use standard workflows or generate new ones during data processing and data analysis steps.
- Code notebooks: tools for data documentation (e.g. Jupyter Notebook, Rmarkdown) enabling the combination of code with descriptive text and visualizations.
- Integrated development environments (knitr or MLflow).
- Pipeline frameworks or workflow management systems: designed to streamline and automate various steps involved in data analysis (data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages. There are two very popular systems, Nextflow and Snakemake.
A great example of community-curated workflows is the nf-core community. Nf-core is a collaborative and open-source initiative comprising bioinformaticians and researchers dedicated to developing and maintaining a collection of curated and reproducible Nextflow-based pipelines for NGS data analysis, ensuring standardized and efficient data processing workflows.
Take our course on Reproducible Research Practices LINK
File naming convention examples
Click below to access a list of the most common file formats used when working with NGS data.
Data types summary
Select appropriate file formats that balance data accessibility, storage efficiency, and compatibility with downstream analysis tools. Standardized file formats facilitate data sharing and collaboration among researchers in the scientific community.
- BAM/SAM: stores the alignment information (binary and text-based respectively)
- FASTA: store nucleotide or amino acid sequence, commonly used for reference sequences or assembled contigs.
- Gene Transfer Format (GTF) and General Feature Format (GFF): annotates genomic features such as genes, exons, and transcripts.
- Alignment indexes: data structures for efficient and rapid mapping of sequencing reads to a reference.
- Variant Call Format (VCF): stores genetic variation such as single nucleotide variants (SNVs), insertions, deletions, and structural variants (and their position, quality score, etc.)
- Count matrix: quantifies the abundance of RNA transcripts or genomic features across samples
- BED/BEDGraph: represent genomic intervals or coverage information (e.g., peak calling identifies regions of enriched signal intensity)
- WIG/BigWig: store genome-wide data
General formats
- Tabular formats: File formats like CSV, TSV, and XLSX are used to store data in rows and columns for easy data analysis and sharing
- Image formats: File formats such as PNG and SVG are used to store graphical visualizations, making them easily viewable and shareable
- Binary formats: File formats like NPZ and H5 are used to store large datasets, ensuring efficient data access and storage
- JSON: A lightweight data-interchange format for storing hierarchical data structures, commonly used in bioinformatics tools
- HTML: A format used to create interactive reports that include both visualizations and textual descriptions of analysis results
- Code notebooks: Interactive documents combining code, visualizations, and explanatory text, aiding in data analysis reproducibility and documentation
- Scripts: Text files containing sets of commands or code instructions for automating data processing and analysis tasks
Explore more data types at the UCSC webpage. Check out this tutorial for more detailed explanations.
Wrap up
In this lesson, we have taken a look a the vast and diverse landscape of bioinformatics data.