6. Processing and analyzing biodata

Modified

November 6, 2024

Code and Pipelines for Data Analysis

In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.

Before you start…

Choose a folder structure (e.g., using cookiecutter)
Choose a file naming system
Add a README describing the project (and the naming conventions)
Install and set up version control (e.g., Git and Github)
Choose a coding style!

Python: Python’s PEP or Google’s style guide
R: Google’s style guide or Tidyverse’s style guide

Code formatting

Numerous tools can automatically format code to maintain style consistency, which is crucial for code quality and collaboration. Most modern interactive development environments (IDE) and text editors, such as JetBrains IDE Suite and VSCode, support automatic code formatting, either natively or through plugins. Additionally, many languages have specific formatting tools that integrate with these editors.

Formatting tools

Language	Formatted tools
Python	Black, yapf
R	formatR
Snakemake	Snakefmt
Bash/Shell	ShellIndent
C/C++	GNUIndent, GreatCode
Perl	PerlTidy
Javascript	beautifier
MATLAB/Octove	MISS_HIT
Java	Google Java format, JIndent
CSS	CSSTidy
HTML	Tidy

Tip

Quick Tip: If you use VS Code as your main text editor, you can enable automatic code formatting in your browser. Go to your preferences page in JSON mode and add:

JSON

"editor.formatOnSave": true

Reproducibility and Replicability

Through techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.

Tools for reproducibility:

Code notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.
Integrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.
Pipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.

Computational notebooks for interactive analysis

Computational notebooks, such as Jupyter and R Markdown, offer researchers a flexible platform for interactive and exploratory data analysis, making it easier to document procedures and share insights with collaborators. These notebooks enable users to combine text, images, equations, and executable code within a single document. Typically, text is written using the simple and intuitive Markdown language. In R, Markdown files can be created within the RStudio interface, while Python users often work with Jupyter notebooks for a similar experience.

Documents with live code

Link	Description
Introduction to Markdown	Markdown for `R` in `Rstudio`
Jupyter notebooks	create interactive code with `python`. You can write `R` code in a Jupyter notebook by using the `python` package rpy2

Pipeline Frameworks and Workflow Management Systems

Tools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.

Nextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.
Snakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.

Once your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.

Computational environment

Each computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.

For research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:

Containerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.
Virtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)
Environment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:
- conda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system
- Python virtualenv: is a tool for creating isolated environments to manage dependencies specific to a project
- requirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.
- R’s renv: The ‘renv’ package creates isolated environments in R.
Environment descriptors
- sessionInfo() or devtools::session_info(): In R, these functions provide detailed information about the current session
- sessionInfo(), similarly, in Python. Libraries like NumPy and Pandas have show_versions() methods to display package versions.

While environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.

Connecting data organization and documentation

To maintain clarity and organization in the data analysis process, adopt best practices such as:

Data documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.
Annotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).
Use coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.
Label files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).
- 00.preprocessing., 01.data_analysis_step1., etc.
Provide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.
Data versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.
Integrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code
Use git submodule for code and software that is reused in several projects
Leverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.
Use Software Heritage an archive for software source code are essential for long-term accessibility and reproducibility
Add a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.

Practical HPC pipes

We provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.

Wrap up

This lesson emphasized the importance of reproducibility in computational research and provided practical techniques for achieving it. Using annotated notebooks, pipeline frameworks, and community-curated pipelines, such as those developed by the nf-core community, enhances reproducibility and readability.

Sources

The turing way - reproducible research
RDMkit, Elixir Data Management - Data Analysis
Code documentation by Johns Hopkins Sheridan libraries. This link includes best practices for code documentation, style guides, R markdown, Jupyter Notebook, version control, and code repository.
Guide to reproducible code in ecology and evolution
Best practices for Scientific computing
Elixir Software Best Practices
faircookbook workflows
Atlassian software development tutorial

Copyright

CC-BY-SA 4.0 license