HPC Lab
  • Home
  • HPC Launch
  • HPC Pipes
  • Workshop
  1. HPC Launch
  2. Day 1
  3. HPC file transfers
  • HPC Launch
    • Welcome to the HPC-Launch workshop
    • Day 1
      • HPC setup
      • HPC file transfers
      • Git and Github
      • Project structure
    • Day 2
      • Package managers
      • Queueing systems
      • Archiving
      • Knowledge Checks
  • HPC Pipes
    • Welcome to the HPC-Pipes workshop
    • Day 1
      • Day 1 - Part 1
      • Day 1 - Part 2
    • Day 2
      • Day 2 - Part 3
      • Day 2 - Part 4
      • Day 2 - Part 5
  • UCloud setup
    • UCloud project workspace
    • SSH on UCloud
    • GitHub on UCloud
    • Conda on UCloud

On this page

  • 1. File integrity verification
  • 2. Synchronisation and transfer with
  • 2. Session management using tmux
  1. HPC Launch
  2. Day 1
  3. HPC file transfers

HPC file transfers

1. File integrity verification

We recommend using md5sum to verify data integrity, particularly when downloading large datasets, as it is a widely used tool. All data and files archived on Zenodo include an MD5 hash for this purpose. Let’s have a look at the content of a newly developed software fastmixture that estimates individual ancestry proportions from genotype data.

ExerciseExercise checksums
  1. Open this Zenodo link
  2. Enter the DOI of the repo (for all versions):
  3. Zenodo offers an API at https://zenodo.org/api/, which functions similarly to the DOI API. This allows you to retrieve a BibTeX-formatted reference for a specific record (e.g., records/14106454) using either curl or wget.
Terminal
# ------curl-------
curl -LH 'Accept: application/x-bibtex' https://zenodo.org/api/records/14106454 \
     --output meisner_2024.bib

# ------wget-------
wget --header="Accept: application/x-bibtex" -q \
     https://zenodo.org/api/records/12683372 -O meisner_2024.bib

Does the content of your *.bib file look like this?

@misc{meisner_2024_14106454,
  author       = {Meisner, Jonas},
  title        = {Supplemental data for reproducing "Faster model-
                   based estimation of ancestry proportions"},
  month        = nov,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.93.4},
  doi          = {10.5281/zenodo.14106454},
  url          = {https://doi.org/10.5281/zenodo.14106454},
}
  1. Scroll down to files and download the software zip file (fastmixture-0.93.4.zip) using the command below:
Terminal
curl https://zenodo.org/records/14106454/files/fastmixture-0.93.4.zip \
--output fastmixture.zip 
  1. Compute md5 hash and enter the value (no white-spaces)

  2. Is your value tha same as the one shown on Zenodo

  3. Finally, compute the sha256 digest (with program sha256) and enter the value

HintSolution
md5sum fastmixture.zip
sha256sum fastmixture.zip
TipBonus exercise

We will be using the HLA database for this exercise. Click on this link or google IMGT HLA> Download. Important: go through the README before downloading! Check if a checksums file is included.

  1. Download and open the md5checksum.txt (HLA FTP Directory)
  2. Look for the hash of the file hla_prot.fasta
  3. Create a bash script to download the target files (named “dw_resources.sh” in your current directory).
#!/bin/bash
md5file="md5checksum.txt"

# Define the URL of the files to download
url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"

# (Optional 1) Save the original file name: filename=$(basename "$url")
# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)
# out_filename = "imgt_hla_prot.fasta"

# Download the file
wget $url --output $out_filename && \
md5sum --quiet --ignore-missing --check $md5file

We recommend using the argument --quiet as part of your pipeline so that it only prints the errors (it doesn’t print output when success). The --ignore-missing argument is useful because it allows us to use the raw checksums file while skipping files we may not want to download.

Did you get any error?

  1. Generate the md5 hash & compare to the one from the original md5checksum.txt.

2. Synchronisation and transfer with

To explore all rsync options would require a workshop on its own. Check the manual to learn more about the command: https://linux.die.net/man/1/rsync.

1.1. Create a folder system locally containingrsync/data(inside a folder calledhpcLaunch`) and navigate to the data folder.

mkdir -p hpcLaunch/rsync/data
cd hpcLaunch/rsync

1.2. Generate 100 files with extensions fastq and log in the data folder:

touch data/file{1..100}.fastq data/file{1..100}.log

1.3. Check the data directory:

ls data

Local-to-local copy

We are going to use rsync to create a backup copy of the data we just generated.

Note

The syntax of rsync is pretty simple:

rsync OPTIONS ORIGIN(s) DESTINATION

An archive (incremental) copy can be done with -a option. You can add a progress bar during the transfer with -P option. In this exercise, we want to exclude some files from the backup: we want to keep only those with fastq extension.

Run the following command:

rsync -aP --exclude="*.log" data backup

This will copy all the fastq files in backup/data. Check the new folders with ls using a terminal.

Warning

Using data will copy the entire folder, while data/ will copy only its content! This is common to many other UNIX tools.

Change the first ten fastq files with some text:

for i in {1..10}; do { echo ATGC; echo TCCA; echo NNNN; echo NNNN; } >> data/file$i.fastq; done

Use less file reader.

TipNot familiar with less?

less is ideal for exploring large text files—you can scroll using the arrow keys and exit by pressing q.

Check the documentation (man less or less --help) to learn how to search for specific text within a file.

Then open the file with less, explore its contents, and check which lines contain an N.

less data/file1.fastq

While inside less, type /N and press enter. Is some text highlighted?

Finally, count how many lines contain at least one N in file1.fastq using the command grep. How many are there?

HintSolution
grep -c 'N' data/file1.fastq

We also want to preserve earlier versions of any files that get updated. To do this, create a backup directory named with the current date and time (it will appear in your backup directory):

rsync -aP --exclude="*.log" \
      --backup \
      --backup-dir=versioning_$(date +%F_%T) \
      data \
      backup
Tip

If you create a folder called backup in your project folder, you can use versioning to store your analysis and results with incremental changes.

Transfer between local and remote

You can use the same approach to transfer and back up data between a local machine (your PC/laptop) and another remote system (in this case, UCloud). You need Linux, Mac or WSL/MobaXterm on the local host to perform rsync.

Let’s transfer the fastq files to UCloud. In this case, we want the content in the data folder to be transfer and not the folder itself (PATH_TO/data/).

rsync -aP --exclude="*.log" -e "ssh -i ~/.ssh/id_rsa -p <port>" PATH_TO/data/ ucloud@ssh.cloud.sdu.dk:/work/hpcLaunch/data

Go to UCloud and check the content in /work/hpcLaunch/data.

The opposite can be done uploading data from your computer. For example:

# LOCAL_PATH can be . if you are running the command from the rsync folder you created before
rsync -aP --exclude="*.log" -e "ssh -i ~/.ssh/id_rsa -p <port>"  ucloud@ssh.cloud.sdu.dk:/work/hpcLaunch/day1 LOCAL_PATH

Do you now have all files generated in the previous exercise locally?

You would have to type your password if you do not make use of ssh keys!

2. Session management using tmux

tmux was originally designed as a keyboard-only software. However, you can also configure it to allow switching between windows and panes using the mouse. To enable this, add the following setting to the configuration file:

echo "set -g mouse" >> ~/.tmux.conf

You can start a tmux session anywhere. It is easier to navigate sessions giving them a name.

  1. Start a session called example1 (or choose a different name!):
tmux new -s example1

The command will set you into the session automatically. The window looks something like below:

Now, you are in session example1 and have one window, which you are using now.

  1. Split the window in multiple terminals.

Split the window horizontally and vertically, you will be running a total of 3 terminals.

Ctrl + b + %

Ctrl + b + ""

Ctrl+b, then arrow keys to change pane!
TipUsing the mouse
  • Right-clicked with the mouse to choose the split.
  • Left-click to change pane.
  • Right-clicked on the window bar and create a new window.
  1. Create a new window with Ctrl + b + c
  2. Change between windows with Ctrl+b then n

Now, you have your 2 windows and three panes running in on of them.

In the new window, let’s look at which tmux sessions and windows are open. Run

tmux ls

The output will tell you that session example1 is in use (attached) and has 2 windows.

example1: 2 windows (created Wed Apr  2 16:12:54 2025) (attached)
TipBonus exercise

Launching separate downloads at the same time

Start a new session without attaching to it (d option), and call it downloads:

tmux new-session -d -s downloads

verify the session is there with tmux ls.

Warning

If you want a new session attaching to it, you need to detach from the current session with Ctrl + b + d.

Create a text file with few example files for this workshop to be downloaded.

curl -s https://api.github.com/repos/hds-sandbox/GDKworkshops/contents/Examples/rsync | jq -r '.[] | .download_url' > downloads.txt

The script below launches all the URLs from the list in the download session in a new window. The new window closes after the download. If there are less than K downloads active, a new one starts, until the end! You can use this and close your terminal. The downloads will keep going and the window names will be shown to keep an eye on the current downloads. Try it out and use it whenever you have massive number of file downloads

mkdir -p downloaded
K=2  # Maximum number of concurrent downloads
while read -r url; do
    # Wait until the number of active tmux windows in the "downloads" session is less than K
    while [ "$(tmux list-windows -t downloads | wc -l)" -ge "$((K+1))" ]; do     
        sleep 1
    done

    # Extract the filename from the URL
    filename=$(basename "$url")

    # Start a new tmux window for the download
    tmux new-window -t downloads -n "$filename" "wget -c $url -O downloaded/$filename && tmux kill-window"
    tmux list-windows -t downloads -F "#{window_name}"   
done < downloads.txt

You are done for now! It’s time to stop the job, by holding the Stop application button to do so.

Copyright

CC-BY-SA 4.0 license