The Cancer resource - methods summary

Summary

The cancer resource consists of two different parts; 1) information of association between genome-wide RNA expression levels and survival of cancer patients based on 8384 cancer patients representing 21 types of cancer, and 2) examples of protein expression patterns in cancer tissues based on IHC for 216 tumors representing the 20 most common forms of human cancer.

Key publications

Yuan M et al. (2025) “The Human Pathology Atlas for deciphering the prognostic features of human cancers.” EBioMedicine 111:105495

Uhlen M et al. (2017) “A pathology atlas of the human cancer transcriptome.” Science 357 (6352): aan2507

What can you learn from the Cancer resource?

Learn about:

  • if the mRNA expression of a gene is prognostic for patient survival in each of the cancer types
  • if a gene is enriched in a particular cancer type (specificity)
  • the catalogue of genes elevated in each of the cancer types
  • protein expression in a cancer compared to corresponding normal tissue

Data overview

Data type Count Data Coverage (nr genes)
Prognostic data 21 Analysis of association between TCGA RNA expression and cancer survival in 21 cancers 14039
Prognostic data 10 Validation of association between RNA expression and cancer survival in 10 cancers 14039
Protein expression (IHC) 20 IHC estimated protein expression in 20 cancer types 15302
RNA expression (TCGA) 31 Classification of RNA expression based on 8384 samples from 31 prognostic and validation cancers corresponding to 21 cancer types 19973
RNA expression (TCGA) 21 RNA expression for 8384 samples from 21 cancer types 19973
Protein expression (CPTAC MS) 11 Differential expression between cancerous and matched normal tissues across 11 cancer types 13814

How has the data been generated?

Cancer tissues used for protein expression analysis were obtained from the Department of Pathology, Uppsala University Hospital, Uppsala, Sweden as part of the sample collection governed by the Uppsala Biobank. Cases were selected after microscopical examination of representative HE sections. Cores with 1 mm diameter were subsequently obtained from corresponding tissue blocks and transferred into cancer tissue microarrays. All human tissue samples used in the present study were anonymized in accordance with approval and advisory report from the Uppsala Ethical Review Board.

Cancer patient samples used for mRNA expression and survival analysis were collected from The Cancer Genome Atlas (TCGA) project from the initial release of Genomic Data Commons (GDC) on June 6, 2016, and information regarding sex, age and other clinical information can be found here. The Cancer Genome Atlas (TCGA) project of Genomic Data Commons (GDC) collects and analyzes multiple human cancer samples. RNA-seq data from 21 cancer types having a corresponding major cancer type with in-house IHC data were included to allow for comparisons between the protein staining data from the Human Protein Atlas and RNA-seq from TCGA data. Only samples with both clinical info and transcriptomic data available at that time point were used in this study. For ten of the cancer subtypes additional RNA-seq data obtained from other sources was used to validate the findings in the survival analysis using the TCGA data. The number of samples for the 21 TCGA cancer types and the 10 cancer types used for validation can be found here and more information about the selection and origin of the cancer samples can be found in the Materials & Methods section of this publication Yuan M et al. (2025).

Mass spectrometry-based quantitative proteomic data is sourced from CPTAC and the Proteomic Data Commons (PDC), part of the National Cancer Institute’s Cancer Research Data Commons (CRDC). This section displays relative protein quantities obtained from TMT11-quantified proteomic datasets from 11 cancer types. Alongside proteomic data, clinical and demographic information, such as age, sex, tumor stage, and recurrence status, is available for download from the PDC. The data from the Proteomic Data Commons (PDC) is used under the CC-BY 4.0 license, allowing sharing and adaptation with proper attribution and indication of any changes.

How has the data been analyzed?

For protein expression analysis, sections from cancer tissue microarrays were immunohistochemically stained and corresponding slides scanned to generate digital images. All images were then analyzed by pathologists and annotated with respect to staining intensity and fraction of positive cancer cells for all approved antibodies. The result of immunohistochemistry-based protein expression was then summarized as high, medium, low or not detected.

The raw sequencing data from TCGA was downloaded from this site, mapped using the Ensembl gene id available from TCGA and pTPM values were calculated. The pTPM for each gene was subsequently used for quantification of expression with a detection threshold of 1 pTPM. Genes were classified according to pTPM levels across the 21 cancer subtypes into the following categories: (1) Cancer enriched: pTPM in a particular cancer type at least four times higher than in any other cancer type; (2) Group enriched: pTPM in a group of 2-5 cancer types at least four times higher than in any other cancer type; (3) Cancer enhanced: pTPM in one or several cancer types that has at least four times the mean expression of all cancer types; (4) Low cancer specificity : pTPM ≥ 1 in at least one cancer type but not elevated in any cancer type; (5) Not detected: pTPM <1 in all cancer types.

Based on the pTPM value of each gene, patients were classified into two expression groups and the correlation between expression level and patient survival was examined. The prognosis of each group of patients was examined by Kaplan-Meier survival estimators, and the survival outcomes of the two groups were compared by log-rank tests. To choose the best pTPM cut-offs for relevant grouping of the patients, all pTPM values from the 20th to 80th percentiles were used to group the patients, significant differences in the survival outcomes of the groups were examined and the value yielding the lowest log-rank P value was selected. Both median and maximally separated Kaplan-Meier plots are presented in the Human Protein Atlas, and genes with log rank P values less than 0.001 in maximally separated Kaplan-Meier analysis were defined as prognostic genes. If the group of patients with high expression of a selected prognostic gene has a higher observed event than expected event, it is an unfavorable prognostic gene; otherwise, it is a favorable prognostic gene. Genes with a median expression less than pTPM 1 were lowly expressed, and classified as unprognostic in the database even if they exhibited significant prognostic effect in the survival analysis.

For the mass spectrometry datasets, CPTAC harmonized the proteomic datasets using consistent workflows across all cancer types. Briefly, protein abundance was quantified using Tandem Mass Tag (TMT11) labeling, with pre-calculated protein assemblies and peptide ratios used for downstream analysis. Normal samples were included for certain cancers to enable comparison with healthy tissue. Protein expression data was analyzed using log-transformed ratios obtained from the protein builds of each respective dataset. Samples of poor quality and excluded samples annotated by CPTAC were removed. Differential expression between cancerous and matched normal tissues was calculated through fold changes and evaluated by statistical tests (Wilcoxon rank-sum) with adjusted p-values for multiple testing (Bonferroni). Significant proteins were identified, revealing cancer-specific dysregulation.

What is presented in the resource?

Kaplan-Meier survival plots which show the prognostic association between RNA expression of each protein-coding gene and patient survival of each of the 31 cancer types in the TCGA and and validation sets were generated. A summary of significant prognostic results is provided in the gene summary page. In addition, the Kaplan-Meier survival plots as well as a scatter plot showing the correlation between RNA expression of the gene and patient survival of a specific cancer type are shown in a cancer type specific gene summary page. The page is interactive, and users can select a subgroup of patients based on, for example, tumor stage and generate specific plots for the selected subgroup on the website immediately. The user can also use any specific expression cutoff (pTPM value) to produce different Kaplan-Meier and scatter plots. An example of cancer specific Kaplan-Meier and scatter plots for a gene is shown as below.



The RNA expression levels were summarized across 31 cancer types from the TCGA and and validation sets for 19973 protein-coding genes. The results are presented as exemplified below for a gene enriched in liver cancer.


Similarly, the protein levels were determined across 20 cancer types for all protein-coding genes, and the results are presented as shown for the same gene as above in the bar plot below.


Moreover, to exemplify protein expression patterns both within one cancer type and between different types of cancer, a multitude of IHC images for 15302 protein-coding genes in 20 human cancer types are also provided in this resource. An example of an IHC image for a gene from a selected cancer patient is shown below.


Results from the CPTAC pan-cancer proteomic analyses are visualized on the Human Protein Atlas website as boxplots, featuring interactive volcano plots that compare protein expression between cancers and normal tissues.

Volcano plots illustrating the comparison of protein expression between each different cancer type and the corresponding normal tissue using tandem mass tag (TMT) mass spectrometry analysis from the CPTAC dataset can be displayed as exemplified for colon cancer in the plot below, by clicking on a cancer type in the Cancer proteomes box . The x-axis represents the log2 fold change of reporter intensity between cancer and normal, with positive values indicating proteins upregulated in cancer tissue and negative values indicating those downregulated. The y-axis represents the -log10 adjusted p-value from a Wilcoxon rank-sum test, where higher values indicate stronger statistical significance after multiple testing correction.

Proteins highlighted in red are significantly upregulated in cancer tissue, while those in blue are significantly downregulated compared to normal tissue. Examples from the colon cancer dataset identifies a set of upregulated proteins (GMPS, HSPA14, TAOK1, TBCE, and FEN1) and a set of downregulated proteins (CLEC4M, CLEC4G, ITGA9, OIT3 and TNS2). Gray points represent non-significant proteins based on the log2 fold change and adjusted p-value thresholds.