5. Version Control with Git and GitHub

Modified

November 6, 2024

Course Overview

Time Estimation: X minutes
💬 Learning Objectives:

  1. Version control essentials and practices
  2. Git and Github repositories
  3. Create repositories
  4. GitHub page to showcase your data analysis reports

Best Practices in Data Analysis

This lesson introduces version control with Git and Github and its significance in research. You will gain the ability to create Git repositories, and skills to build GitHub pages for showcasing data analysis.

Version control is a systematic approach to tracking project changes, and documenting alterations to understand the project’s evolution. It is crucial in research data management, software development, and data analysis, offering numerous advantages. By enabling multiple contributors to sync their changes without conflicts, version control tools facilitate collaboration. These tools allow users to commit changes with descriptions, set and assign project objectives, address software issues, and automatically test code for bugs before they impact users. Whether for teams or individual projects, implementing version control is considered a best practice and a standard approach to project management.

Advantages of using version control
  1. Document Progress: Detailed change history aids understanding of project development and modifications.
  2. Ensure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.
  3. Facilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.
  4. Reproducibility: Preserves project state for accurate validation and analysis.
  5. Branching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.
  6. Global Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.

Version control using Git

Git is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:

  1. Local Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.
  2. Snapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.
  3. Commits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.
  4. Branching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.
  5. Merging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.
  6. Distributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.
  7. Remote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.
  8. Push and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.
  9. Conflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.
  10. Versioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.

GitHub Hosting for Git

In addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.

The difference between Git and GitHub is that Git is a version control system used to track changes in code, while GitHub is a cloud-based platform that hosts Git repositories and facilitates collaboration. Essentially, GitHub serves as an online access point for managing and sharing repositories.

Alternatives flows for collaborative projects

We will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.

Warning

We will discuss repositories for archiving experimental or large datasets in lesson 7. However, if you are interested in version control large files, we recommend the use of git annex. It is also important to archive files with a checksum (MD5, SHA1, SHA256) to verify that files are not altered or corrupted buy recomputing their signature.

From Project folders to Git repositories

Moving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.

Create a GitHub account
  • If you don’t have a GitHub account yet, click here
  • Install Git from Git webpage

You have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.

Converting Folders to Git Repositories

If you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).

Github documentation link
Exercise 1: initialize a repository from an existing folder:
  1. Initialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.
  2. Create a remote repository: Once the local repository is initialized, create am empty new repository on GitHub.
  3. Connect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin <URL>. This associates the remote repository with the name “origin.”
  4. Commit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m "add local folder".
  5. Push to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.
Setting Up a Git Repository and copying an existing folder

Alternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.

Tips to write good commit messages

Write useful and clear Git commits. Check out this post for tips.

Github pages

After setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, R Markdown files, or HTML reports, you can showcase them on a GitHub Page.

Once you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.

For simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.

Tutorial links

Step-by-Step Setup Guide

We provide an example of setting up Git, Quarto, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)

Wrap up

In this lesson, we explored version control and utilized Git and GitHub to establish data analysis repositories from our Project folders. Additionally, we delved into creating a GitHub organization and leveraging GitHub Pages to showcase data analysis scripts and notebooks publicly. Remember to complete the corresponding exercise from the practical workshop to reinforce your knowledge.

Sources

Take our Git & Github course at KU

If you’re interested in delving deeper, explore our course on Git and GitHub.

Alternatively, here are some examples and online resources to expand your understanding: