Breast Cancer Proteomics Module

Author

Jacob Fredegaard Hansen

Published

October 16, 2024

!!! note “Section Overview”

🕰 **Total Time Estimation:** 8 hours

💬 **Learning Objectives:**    
    1. Understand the workflow of the study. <br>
    2. Retrieve and understand the study design and data from selected papers. <br>
    3. Download and preprocess proteomics data using FragPipe. <br>
    4. Set up and use FragPipe for TMT-labeled MS data analysis. <br>
    5. Create accurate annotation files for data analysis.

Preliminary work

For this work, we will use the data that was used and analyzed in the paper Breast cancer quantitative proteome and proteogenomic landscape by Johansson et al., which compares subgroups of breast cancer tumors from the Oslo2 cohort.

Before delving into the actual analysis of the data in FragPipe, we must initially:

  1. Read and understand the study design of the paper.
  2. Understand the data being used from the paper.

Questions for Understanding the Paper

To better understand the paper, we have formulated some questions that should help clarify the study design, aim, and overall scope of the work. These questions are listed below:

Supplementary Data

Before we proceed and download the data available from the paper, we must first delve into some of the details in Supplementary Data 1. This is a large table containing the quantitative proteome data from the Oslo2 breast cancer cohort, which includes 45 subgroups of cancer tumors and relates to Figures 1-6 in the paper.

To better understand the supplementary data, we have prepared guiding questions to aid in interpreting the table. You can find the supplementary data in the paper here.

Question 1: Provide a brief description of the content presented in the table.

Question 2: What information does the tumor ID represent?

Question 3: Briefly describe TMT-labeled mass spectrometry proteomics data and explain the experimental procedure involved.

TMT10-Tags and Tumor IDs

Fill in the table below by entering the corresponding TMT10-tags and Tumor IDs. Once submitted, you’ll get instant feedback on whether your input is correct.

TMT/Plex Set Set 1 Set 2 Set 3 Set 4 Set 5
TMT126
TMT127N
TMT127C
TMT128N
TMT128C
TMT129N
TMT129C
TMT130N
TMT130C
TMT131

Question 4: What is in TMT131?

Question 5: What is the purpose of using this type of sample?

Now that we better understand the workflow of the study and the content of the data, we are ready to move on to the analysis performed by FragPipe in the next section.

Analysis of MS Data Using FragPipe

In this section of the teaching module, we will work with data from the paper. The first task is to download sample files from the paper, guided by the questions provided below:

Question 6: Where can the data be found?

Question 7: What is the ProteomeXChange database?

Question 8: What accession code is used for the data deposited in ProteomeXChange?

By examining the accession code for the data deposited on ProteomeXChange, we can access and download the data using FTP.

Question 9: What is FTP, and what is its functionality?

For downloading the data, we will use the Proteomics Sandbox Application on UCloud. This platform allows us to access the necessary storage capacity as well as the computational power required to execute this process.

The Proteomics Sandbox Application is a virtual environment that includes multiple software tools, including FragPipe for analyzing proteomics data.

You can find the Proteomics Sandbox Application on UCloud here.

First, we will download the data for the sample files to be used in FragPipe. Then, we will launch FragPipe to run the first analysis of the data. Before doing so, we have some questions regarding FragPipe and its usability:

Question 10: What is FragPipe, and what are its applications?

Question 11: If FragPipe were not used for this part of the teaching module, which alternative software tools could be employed? Please provide a few examples.

Question 12: What are the benefits of using FragPipe?

Simple analyses in FragPipe may only require 8 GB of RAM, while large-scale or complex analyses may require 24 GB of memory or more (FragPipe Documentation), which is why we will allocate 24 GB for this exercise.

In UCloud, the settings should look like this:

SCREENSHOT HERE

Before submitting the job, it is also recommended to create a personal folder where you can store both the data and the results generated by FragPipe. You can follow the step-by-step guide below:

SCREENSHOT HERE

CAUTION!!!

Make sure to allocate the right number of hours before submitting the job. If the time runs out, the job will be canceled, and all progress will be lost. However, you can always extend the job duration if more time is required after submission.

Time can pass quickly when working, so we recommend initially allocating 2 hours for the job. Now, we are ready to submit the job and launch the virtual environment of the Proteomics Sandbox Application.

Download Data from the Paper

Initially, we will need to download the paper’s data. For this exercise, we will only use one sample file from each Plex Set/Pool.

We will use the terminal in the virtual environment for downloading the data. First, we need to update and download the necessary packages. You can do that by typing the following code:

sudo apt-get update
sudo apt-get install lftp

Question 13: What does the code above do? Please explain its functionality and purpose.

Now, we can access the FTP server where the data is located. You will need the server address from the correct FTP-server, which can be found on the site for the accession code XXX in ProteomeXchange, previously visited. At the bottom of the page, you will find the FTP-server address where the data is stored.

Question 14: Please locate the address.

The address is used for accessing the data used in the study. To do so, we can use the package lftp that we just installed to access the server using the following code:

lftp [insert the address of the FTP server here]
lftp ftp://ftp.......

Question 15: We now have access to the data stored on the FTP server. Please provide a brief description of the contents of the folder on the FTP server.

To download one sample file from each of the Plex Sets, you can use the following code in the terminal:

CODE HERE

Question 16: Please explain what the code is doing by describing the functions used.

If you added your own private folder to the UCloud session, you could now move the data into that folder for better management of the data you’re working with.

Next, we can launch FragPipe, which is located on the desktop. In this tutorial, we are using FragPipe version XX.YY in the June 2024 version of the Proteomics Sandbox Application.

Now that we have launched FragPipe, we need to configure the settings prior to running the analysis. Therefore, we have provided some guiding questions to help you set up the settings in FragPipe:

Getting started with FragPipe

Go to the “Workflow” tab to set up the workflow for the analysis and import the data you have just downloaded.

Question 17: Which workflow should you select? HINT: How many TMT tags are listed in the table in Supplementary Data 1?

Click ‘Load workflow’ after you have found and selected the correct workflow to be used.

Next, add your files by clicking on “Add files” and locate them in the designated folder for your raw files that you previously created.

Now you should relocate to the “Database” tab. Here you can either download or browse for an already preexisting database file. In this case, we will simply download the latest database file.

Question 18: What is the purpose of the database file used in FragPipe, and why is it important?

Question 19: Which organism should you choose when downloading the database file?

Question 20: Describe the relationship between decoys and false discovery rate (FDR) by answering the following questions:

  • What are decoys?
  • Why should you include decoys?
  • What role do decoys play in estimating the FDR?

Next, you can go to the MSFragger tab to adjust the parameter settings for the search and matching of the theoretical and experimental peptide spectra.

Most of the settings used for MSFragger can be obtained from the paper NAME OF PAPER, which is referred to in the Methods and Materials section.

When all settings have been obtained, MSFragger should look something like this:

Question 21: What is MSFragger?

Question 22: How does MSFragger operate?

Question 23: Why is it essential to run MSFragger for this analysis?

Finally, we can navigate to the “Run” tab and run the analysis. For that, we must choose an output directory for the results of the search made by FragPipe. Once you have adjusted that, you are ready to click on “Run”.

This process might take some time, so make sure that you still have enough hours allocated on your job on UCloud—otherwise, it will get terminated. Meanwhile, you can answer these questions:

Question 24: What are your expectations regarding the output results? Consider the implications of the number of files provided for this search in your response.

Question 25: Can the output from this analysis be reliably used for downstream applications given the limited number of sample files? Justify your answer.

Question 26: What does it signify that the sample tissues have been fractionated as described in PAPER?

  • Outline the fractionation process utilized.
  • Explain the study design associated with this research.
  • In your opinion, will increasing the number of fractions improve proteome coverage? Justify your reasoning.

When the run in FragPipe is done, please locate the output results and get an overview of the output.

Question 27: What types of output are generated by FragPipe?

For the downstream analysis, we will use the output from the list of combined proteins, which we will explore further in the following section.

Further Interpretation and Analysis of FragPipe Results

For this part, we will use output files based on a run with FragPipe using all sample files (i.e., 5x72 raw files). That file can be downloaded here???

Now, we will look at the output from FragPipe, where we will use the file named combined_proteins.tsv. Initially, we will explore the contents of the file locally. Therefore, you should download the file from UCloud and view it locally in a file editor such as Excel.

You can download the file by clicking on the file in your output directory in the UCloud interface, from where you can choose to download it.

Question 28: Provide a concise overview of the table’s contents. What information is represented in the rows and columns?

For the downstream analysis, we will use the columns containing the TMT intensities across the proteins identified.

For that we will use OmicsQ, which is a toolkit for quantitative proteomics. OmicsQ can be used to facilitate the processing of quantitative data from Omics type experiments. Additionally, it also serves as an entrypoint for using apps like PolySTest [SCHWAMMLE20201396] for statistical testing, VSClust for clustering and ComplexBrowser for the investigation of the behavior of protein complexes.

Data screening, multi-variate analysis and clustering

For the downstream workflow, we will follow the tutorial from the Jupyter Notebook embedded below.

Personal Details

Name:
Email:
Course/Program:
Date: