How to annotate data in proteomics
This short training material is focused on proteomics data annotation and SDRF format. It is expected to take 1.5 hours to complete.
Learning goals
- Understand the concept, purpose and significance of data annotation in proteomics
- Learn about the scope and structure of SDRF
- Get practical experience with lesSDRF, what it can and cannot do.
Part 1: Understanding study design and experimental setup
For this work, we will use the data that was used and analyzed in the paper Breast cancer quantitative proteome and proteogenomic landscape by Johansson et al., which compares subgroups of breast cancer tumors from the Oslo2 Landscape cohort.
Before delving into the actual analysis of the data in FragPipe, we must initially:
- Read and understand the study design of the paper.
- Understand the data being used from the paper.
We do not include any questions about the study to save time, but please read and understand the article so you know enough about the samples and data collected.
Before we proceed and download the data available from the paper, we must first delve into some of the details in Supplementary Data. Your first task is to look through the Supplementary Data and find the annotation relating the tumor types to the isotopic labels used.
How did you find the required information? Describe the steps you took. How much time did it take?
If you have not been able to find the annotations, open Supplementary Data 1, and then go to the tab called “Tumor annotations”.
To better understand the supplementary data, we have prepared guiding questions to aid in interpreting the table. You can find the supplementary data in the paper here.
Provide a brief description of the content presented in the table.
What information does the tumor ID represent?
Briefly describe TMT-labeled mass spectrometry proteomics data and explain the experimental procedure involved.
Part 2: Data annotation with SDRF
Suppose you want to reproduce the figures from the article. What do you need for that, except the experimental data files?
SDRF metadata format
We are going to work with Sample and Data Relationship Format for Proteomics - a standardized table-based format capturing experimental metadata. An SDRF file is a TSV table describing a proteomics dataset. Rows in the table correspond to samples and columns correspond to sample characteristics and experimental parameters pertaining to data acquisition and processing. Column names and values in SDRF are thoroughly standardized to enable automated creation, validation and re-use of SDRF files.
Read the general description of SDRF and take a look at the specification. Then answer the following questions.
What is the general layout of an SDRF file?
What is the scope of information contained in an SDRF file?
What columns would capture the most important sample characteristics for the dataset you are working with?
How are the valid values defined for different columns?
We will now create an annotation for a small subset of our data according to SDRF-Proteomics standard.
Go to lesSDRF, a web application that serves for easy creation of SDRF files. Start a new SDRF annotation with the human template. When asked for file names, input only the first RAW file name from the annotation table found in supporting information. After that, proceed to step 2, labeling.
Carefully select the list of labels corresponding to the dataset, then click “Submit selection”, and proceed to specify that every label is present in “ALL” files. Then click “Ready” and proceed to step 3.
How many rows does the SDRF table have now? How many would it have if we annotated the entire dataset?
Fill in the first three required columns one by one: source name, organism, and organism part.
What would be a good sample identifier for the “source name” column?
Fill in the next column, cell type. Consider that we are dealing with cancer samples. For the next columns, ancestry category and age, you can select “not available”. Then, fill in the “sex” column.
It is time to fill in the disease column.
How many different values for disease can we possibly have in our annotation? What do they correspond to?
How many different values will we actually use when annotating the selected subset of data?
Proceed to fill the disease column, then fill in the rest of the columns, up to and including “instrument”.
How many different values should you use in the “assay name” column when annotating the subset? Why? How many different values should there be in the entire annotation?
When you have only four columns left (cleavage agent details, modifications, precursor and fragment mass tolerance), skip to step 4 and fill the factor value column.
What is the meaning of factor value, and what should it be in this case?
After that, download the resulting file. Copy the cleavage agent, modifications and mass tolerance information from the partial annotation provided by FragPipe into your file, using Excel or similar software. Congratulations! You have a complete annotation according to the SDRF standard, but only for one out of all raw files in the data set. If you need a grade for this course, submit your SDRF file together with your answers.
How would you go about making a full dataset annotation?
You can try this Python notebook as a start for semi-automated creation of the full SDRF table. Download it to a directory of your choosing and put the supplementary table with tumor annotations next to it. Install Jupyter, Pandas and other dependencies in a virtual environment and run the notebook. Submit a full SDRF file with your answers.
What are your thoughts about this approach to data annotation?
Name | |
Course/Program | |
Date |