snRNA session
Here you find instructions to install software and run the analysis for the snRNA-seq session. You should have followed already the Access instructions and you should be in the folder singleCellMoleculeCourse/snRNA/NAME on the GenomeDK cluster.
Checking if the package manager is installed
First of all, check if you have conda installed on the cluster. Right click in your folder and choose Open Terminal Here. In the terminal, type: conda and press Enter. You should see a series of options and commands related to conda. If you get an error message, it means that conda is not installed on the cluster. In this case, you can install it by following the instructions in the next section.
Installing conda
If you got an error message in the previous step, you can install it by copying and pasting the following commands in the terminal:
wget -O ~/Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash ~/Miniforge3.sh -b -p "${HOME}/conda"
source "${HOME}/conda/etc/profile.d/conda.sh"
conda init After running these commands, close the terminal and open a new one. Now you should be able to use conda by typing conda and pressing Enter.
Creating the conda environment
We need to create a conda environment. An environment is a separate space where you can install softwares without affecting other users and the rest of the computing cluster. I prepared an environment file with the list of packages. Download it using the following command in the terminal:
wget https://github.com/hds-sandbox/AdvancedSingleCell/raw/refs/heads/main/Modules/Wells_data_analysis/Environment/environment.ymlYou should be able to see a new file in your folder, called environment.yml. Now give it to conda on the terminal to create an environment
conda env create -f environment.yml -p ./snrnaAnalysis -yIt might take some time to do all installations. Once you are done, activate the environment with all the software contained into it:
conda activate snrnaAnalysisYou should see (snrnaAnalysis) at the beginning of the line in the terminal, which means that you are now using the snrnaAnalysis environment.
From now on, you only need to run conda activate snrnaAnalysis in the terminal to use the software installed in this environment!
Some software has to be installed manually in R. Write R on the terminal and press Enter. You will be into the R command line. Paste the following commands to install the gene networks software
# install Bioconductor
install.packages("BiocManager")
BiocManager::install()
# install from GitHub
devtools::install_github('smorabit/hdWGCNA', ref='dev', upgrade="never")
# Show R in jupyterlab
IRkernel::installspec(name = 'scsm', displayname = 'SCSM')Select any source of download when asked. When the installations are successful, you can exit R by typing q() and pressing Enter. When asked to save, say no.
Alignment of your data
Now we want to align the data to a reference genome. We will use the software STAR for this step. Use the virtual desktop and the file explorer to open the file align_scrna_star_trimmed.slurm; you will find it under singleCellMoleculeCourse/scRNA/shared .
This file is a script that contains the commands to align the data. At the bottom of it you can see the important part of it, which aligns the data. It looks like this:
STAR \
--runThreadN "${SLURM_CPUS_PER_TASK:-8}" \
--genomeDir "$STAR_INDEX" \
--readFilesIn "$R2" "$R1" \
--readFilesCommand zcat \
--soloType Droplet \
--soloBarcodeReadLength 0 \
--soloCBwhitelist None \
--soloUMIstart 1 --soloUMIlen 6 \
--soloCBstart 7 --soloCBlen 8 \
--soloFeatures Gene GeneFull \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outSAMtype None \
--soloMultiMappers EM \
--quantMode GeneCounts \
--outFileNamePrefix "$SAMPLE_DIR/$SAMPLE."The first line calls STAR, while the rest is a lot of options to configure the alignment. Below you can read the meaning of all of them in a table
| Option | Description |
|---|---|
--runThreadN |
Number of threads to use for the alignment. We set it to the number of CPUs allocated for the job, or to 8 if none are given. |
--genomeDir |
Path to the STAR index of the reference genome. |
--readFilesIn |
Paths to the input FASTQ files. |
--readFilesCommand |
Command to decompress the input files. We use zcat because the input files are compressed in gzip format. |
--soloType |
Type of single cell data. We use Droplet because we have droplet-based data. |
--soloUMIlen |
Length of the UMI. We set it to 6 because the UMI is 6 bases long. |
--soloCBstart |
Position of the first base of the cell barcode. We set it to 7 because the cell barcode starts at the 7th base of the read. |
--soloCBlen |
Length of the cell barcode. We set it to 8 because the cell barcode is 8 bases long. |
--soloFeatures |
Which features to quantify. We set it to Gene GeneFull to quantify both gene expression and full-length gene expression. |
--alignIntronMin |
Minimum intron length. We set it to 20 because we want to allow for short introns. |
--alignIntronMax |
Maximum intron length. We set it to 1000000 because we want to allow for long introns. |
--alignMatesGapMax |
Maximum gap between mates. We set it to 1000000 because we want to allow for long gaps between mates. |
--outSAMtype |
Output format of the alignment. We set it to None because we don’t want to output the alignment in SAM format. |
--soloMultiMappers |
How to handle multi-mapping reads. We set it to EM to use the Expectation-Maximization algorithm to assign multi-mapping reads to genes. |
--quantMode |
Which quantification mode to use. We set it to GeneCounts to quantify gene expression. |
--outFileNamePrefix |
Prefix for the output files. We set it to $SAMPLE_DIR/$SAMPLE. to save the output files in the sample directory with the sample name as prefix. |
Run the alignment and check the final summary
The script above is the setup of our alignment. Now we tell to genomeDK that it needs to run using the resources written on the first lines of the text file. Right-click in your personal folder and open a new terminal. In the terminal, type the following command to submit the job to the cluster:
sbatch ../shared/align_scrna_star_trimmed.slurm ../shared/star_index_dir b ./raw_data/DATA_1.fq.gz ./raw_data/DATA_2.fq.gz ./aligned/where you must substitute the file names of your data.
You will see a message in the terminal with the job ID, which is a number that identifies your job on the cluster. You can check the status of your job by typing squeue -u $USER in the terminal. When your job is finished, you will see a new folder called aligned in your directory, which contains the output files of the alignment.
You should be able to see the folder ./aligned/ to find a folder with name ending with .Solo.out, which contains the results of the alignment and quantification. Inside this folder, you will find a file called GeneFull/summary.csv. Open it and read the statistics: did you recover a good amount of cells? Are there many transcripts per cell? How many reads were mapped to the genome? These are some of the questions you can answer by looking at the summary file.
Analysis tutorial
Follow tutorial notebooks to analyze your own data. Notebooks are interactive documents with text, images, code and results from the code.
Running the notebooks
You can run the notebooks on the cluster using jupyterlab, a simple interface for coding. Before starting that, copy the two files in format .ipynb from the shared folder into your personal folder.
Then open a new terminal in your personal folder. Start your conda environment
conda activate ./snrnaAnalysisTo start jupyterlab, run the following command in the terminal:
jupyter lab --port=$UID --ip=$(hostname)This will start a jupyterlab session on the cluster. If the browser is not opened automatically, you will see a message in the terminal with an URL of the type: http://cn-1033:12345/lab?token=.... Copy the whole URL and paste it in the web browser (which you will open in the virtual desktop).
You should see the jupyterlab interface as in the figure below, where you can navigate to your folder and open the notebooks. For each notebook of this course, you need to choose the kernel SCSM to run the code (the kernel links the R language to a notebook). You can do this by clicking on the top right corner of the notebook, and selecting SCSM from the dropdown menu.