6. Processing and analyzing biodata

Modified

September 13, 2024

Code and Pipelines for Data Analysis

In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.

Before you start…
  1. Choose a folder structure (e.g., using cookiecutter)
  2. Choose a file naming system
  3. Add a README describing the project (and the naming conventions)
  4. Install and set up version control (e.g., Git and Github)
  5. Choose a coding style!

Reproducibility and Replicability

Through techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.

Tools for reproducibility:

  • Code notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.
  • Integrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.
  • Pipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.

Computational notebooks for interactive analysis

Computational notebooks (e.g., Jupyter, R Markdown) provide researchers with a versatile platform for exploratory and interactive data analysis. These notebooks facilitate sharing insights with collaborators and documentation of analysis procedures.

Pipeline Frameworks and Workflow Management Systems

Tools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.

  • Nextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.
  • Snakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.

Once your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.

Computational environment

Each computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.

For research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:

  • Containerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.
  • Virtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)
  • Environment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:
    • conda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system
    • Python virtualenv: is a tool for creating isolated environments to manage dependencies specific to a project
    • requirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.
    • R’s renv: The ‘renv’ package creates isolated environments in R.
  • Environment descriptors
    • sessionInfo() or devtools::session_info(): In R, these functions provide detailed information about the current session
    • sessionInfo(), similarly, in Python. Libraries like NumPy and Pandas have show_versions() methods to display package versions.

While environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.

Connecting data organization and documentation

To maintain clarity and organization in the data analysis process, adopt best practices such as:

  • Data documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.
  • Annotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).
  • Use coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.
  • Label files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).
    • 00.preprocessing., 01.data_analysis_step1., etc.
  • Provide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.
  • Data versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.
  • Integrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code
  • Use git submodule for code and software that is reused in several projects
  • Leverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.
  • Use Software Heritage an archive for software source code are essential for long-term accessibility and reproducibility
  • Add a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.
Practical HPC pipes

We provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.

Wrap up

This lesson emphasized the importance of reproducibility in computational research and provided practical techniques for achieving it. Using annotated notebooks, pipeline frameworks, and community-curated pipelines, such as those developed by the nf-core community, enhances reproducibility and readability.

Sources