6. Processing and analyzing biodata
Code and Pipelines for Data Analysis
In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.
- Choose a folder structure (e.g., using cookiecutter)
- Choose a file naming system
- Add a README describing the project (and the naming conventions)
- Install and set up version control (e.g., Git and Github)
- Choose a coding style!
- Python: Python’s PEP or Google’s style guide
- R: Google’s style guide or Tidyverse’s style guide
Code formatting
Numerous tools can automatically format code to maintain style consistency, which is crucial for code quality and collaboration. Most modern interactive development environments (IDE) and text editors, such as JetBrains IDE Suite and VSCode, support automatic code formatting, either natively or through plugins. Additionally, many languages have specific formatting tools that integrate with these editors.
Language | Formatted tools |
---|---|
Python | Black, yapf |
R | formatR |
Snakemake | Snakefmt |
Bash/Shell | ShellIndent |
C/C++ | GNUIndent, GreatCode |
Perl | PerlTidy |
Javascript | beautifier |
MATLAB/Octove | MISS_HIT |
Java | Google Java format, JIndent |
CSS | CSSTidy |
HTML | Tidy |
Quick Tip: If you use VS Code as your main text editor, you can enable automatic code formatting in your browser. Go to your preferences page in JSON mode and add:
JSON
"editor.formatOnSave": true
Reproducibility and Replicability
Through techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.
Tools for reproducibility:
- Code notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.
- Integrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.
- Pipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.
Computational notebooks for interactive analysis
Computational notebooks, such as Jupyter and R Markdown, offer researchers a flexible platform for interactive and exploratory data analysis, making it easier to document procedures and share insights with collaborators. These notebooks enable users to combine text, images, equations, and executable code within a single document. Typically, text is written using the simple and intuitive Markdown language. In R, Markdown files can be created within the RStudio interface, while Python users often work with Jupyter notebooks for a similar experience.
Documents with live code
Link | Description |
---|---|
Introduction to Markdown | Markdown for R in Rstudio |
Jupyter notebooks | create interactive code with python . You can write R code in a Jupyter notebook by using the python package rpy2 |
Pipeline Frameworks and Workflow Management Systems
Tools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.
- Nextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.
- Snakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.
Once your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.
Computational environment
Each computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.
For research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:
- Containerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.
- Virtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)
- Environment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:
- conda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system
- Python
virtualenv
: is a tool for creating isolated environments to manage dependencies specific to a project - requirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.
- R’s
renv
: The ‘renv’ package creates isolated environments in R.
- Environment descriptors
sessionInfo()
ordevtools::session_info()
: In R, these functions provide detailed information about the current sessionsessionInfo()
, similarly, in Python. Libraries like NumPy and Pandas haveshow_versions()
methods to display package versions.
While environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.
Connecting data organization and documentation
To maintain clarity and organization in the data analysis process, adopt best practices such as:
- Data documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.
- Annotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).
- Use coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.
- Label files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).
- 00.preprocessing., 01.data_analysis_step1., etc.
- Provide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.
- Data versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.
- Integrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code
- Use
git submodule
for code and software that is reused in several projects - Leverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.
- Use Software Heritage an archive for software source code are essential for long-term accessibility and reproducibility
- Add a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.
We provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.
Wrap up
This lesson emphasized the importance of reproducibility in computational research and provided practical techniques for achieving it. Using annotated notebooks, pipeline frameworks, and community-curated pipelines, such as those developed by the nf-core community, enhances reproducibility and readability.
Sources
- The turing way - reproducible research
- RDMkit, Elixir Data Management - Data Analysis
- Code documentation by Johns Hopkins Sheridan libraries. This link includes best practices for code documentation, style guides, R markdown, Jupyter Notebook, version control, and code repository.
- Guide to reproducible code in ecology and evolution
- Best practices for Scientific computing
- Elixir Software Best Practices
- faircookbook workflows
- Atlassian software development tutorial