Cell Line - Methods summary


The Cell Lines section contains information on genome-wide RNA expression profiles of human protein-coding genes in human cell lines. The transcriptomics analysis covers 1206 human cell lines, corresponding to 28 cancer types, one non-cancerous group and one uncategorised group of cell lines, and includes classification based on specificity, distribution and expression clusters. Based on the transcriptomics profiles, cell lines were evaluated for their consistency to the corresponding TCGA (The Cancer Genome Atlas) disease cohort to help researchers to select the best cell lines as in vitro models for cancer research. In addition, based on biological data mining, for each cell line, the relative activity of 14 cancer-related pathways and 43 cytokines were inferred and presented to characterize the phenotype of the cell line.

Key publication

Jin H et al. (2023) "Systematic transcriptional analysis of human cell lines for gene expression landscape and tumor representation" Nat Commun 14, 5417 (2023).

What can you learn from the Cell Lines section?

Learn about:

  • if a gene is enriched in cell lines from a particular cancer type (specificity)
  • which genes have a similar expression profile across the cell lines (expression cluster)
  • the catalogue of genes elevated in each of the cell lines
  • which cell line has the most consistent expression profile to its corresponding TCGA disease cohort (i.e., the best cell lines for cancer study)
  • cancer-related pathway and cytokine activity of each cell line

How has the data been generated?

A genome-wide expression analysis of 1206 human cell lines, including 1132 cancer cell lines, was performed using RNA-seq with early-split samples as duplicates. Here, RNA-seq profiles of cell lines generated by the HPA (n = 69) and the Cancer Cell Line Encyclopedia (CCLE 2019; n = 1019) were integrated, with the 33 common cell lines averaged for their gene expression.

How has the data been analyzed?

The transcript abundance of each protein-coding gene was estimated using the average TPM value of the individual samples for each cell line. The transcriptomics data was then used to

  • (i) classify the gene expression specificity in different cancer types and the distribution across all cell lines
  • (ii) evaluate the consistency between the cell lines and the corresponding TCGA disease cohort
  • (iii) estimate the cancer-related pathway (PROGENy) and cytokine (CytoSig) activity (with non-protein-coding genes included for calculation)
  • (iv) find the highest correlating genes and further to classify all genes according to their cell line-specific expression

What is presented in the section?

The RNA expression levels were determined for all protein-coding genes (n = 20162) across the 1206 human cell lines and the results are presented on the gene summary page of the Cell Lines section as exemplified in the figure below.

On the cell line category specific pages, which are accessed by clicking on the piechart or the colored boxes on the Cell Line section page, plots showing the cancer-related pathway (PROGENy) and cytokine (CytoSig) activity relative to the average expression of all analyzed cell lines as the baseline are displayed.
For 26 TCGA disease cohorts the ranking list of the cell lines based on gene expression similarity to the corresponding disease cohort is shown. This can be served as a reference for cell line selection for in vitro experiments when studying a specific cancer type.

How has the classification of all protein-coding genes been done?

A genome-wide classification of the protein-coding genes with regard to cell line distribution across all cancer cell lines as well as specificity across 28 cancer types has been performed using between-sample normalized data (nTPM). The results can serve as a reference for researchers interested in expression profiles of human cell lines at both the disease level and cell line level. The genes were classified according to specificity into (i) cancer enriched genes with at least four-fold higher expression levels in one cell line cancer type as compared with any other analyzed cell line cancer types; (ii) group enriched genes with enriched expression in a small number of cell line cancer types (2 to 10); and (iii) cancer enhanced genes with only moderately elevated expression. In addition, all genes were classified according to distribution in which each gene is scored according to the presence (expression levels higher than a cut-off) in the cell lines. The cell line cancer enriched and group enriched genes are displayed in the interactive plot below, in which clicking on the red and orange circles results in gene lists for the corresponding enriched and group enriched genes, respectively.

Finally, a new classification has been introduced in which genes are clustered based on similarity in expression across the cell lines. The results are presented as an interactive UMAP plot in which mouse-over displays general information for the clusters and the clicking on a cluster will display more information and plots regarding that specific cluster, as well as, a clickable list of all clusters.

How was the similarity of the cell lines to the corresponding TCGA cancer cohorts analysed?

The 1132 cancer cell lines were analyzed for their representability of the corresponding TCGA disease cohorts. Considering that tumor samples also harbor immune and other cells, here, TCGA samples with a low tumor purity score (< 0.7, https://www.nature.com/articles/ncomms3612) were excluded from the analysis. The similarity between cell lines and the corresponding TCGA cohort was estimated by two different approaches:

  • (i) Spearman’s correlation coefficient (ρ) between every cancer cell line and its corresponding TCGA cohorts was estimated at the gene level. For this, for each gene in a TCGA cohort, the nTPM values were averaged per cohort. Then, for each TCGA cohort, Spearman’s ρ was calculated between the averaged nTPM values and the nTPM values of the disease-matched cell lines based on the common 20,053 protein-coding genes.
  • (ii) The enrichment of the TCGA cohort elevated genes (i.e., the union of enriched, group enriched, and enhanced genes in the TCGA cohort) in cell lines was evaluated by gene set enrichment analysis (GSEA). The concept is that genes that have an elevated expression in a TCGA cohort can be considered as the cohort signature, and their high expression should be reflected by cell line models. To test this, for the 28 cell line cancer types, gene expression was averaged per disease, resulting in the mean expression for each of the 28 cell line cancer types. Then, the average expression per disease was further averaged as the disease baseline expression. After that, for every cell line, we calculated the fold change of every gene relative to the disease baseline expression, followed by the log2 transformation of the fold change. Finally, for each cell line, gene log2 fold changes were sorted from high to low, followed by the GSEA of the TCGA cohort elevated genes against the sorted gene list. It is expected that cell lines showing high concordance to the matched TCGA cancer type should present high log2 fold changes of the elevated genes of that TCGA cohort relative to the disease baseline expression. The results were represented as the normalized enrichment score (NES), with a positive value showing high consistency between a cell line and a disease-matched TCGA cohort. The cell lines were then ranked based on Spearman’s (ρ) and NES from high to low, respectively. Finally the two ranking lists were combined, and cell lines were reordered according to their average rank.

How has the pathway and cytokine analysis been done?


For all 1206 analyzed cell lines, the activity of a total of 14 cancer-related pathways were inferred using the PROGENy, a package that relies on biological data mining of publicly available data to obtain cancer-related pathway responsive genes for human and mouse (Schubert M et al. (2018)). For this, read counts for HPA and CCLE cell lines quantified by Kallisto were re-analyzed without filtering out the non-protein-coding genes to ensure a broadened coverage of cancer pathway responsive genes. The read counts of the 1206 cell lines were normalized by DESeq2 with respect to the size factor of each cell line and were further transformed by variance stabilizing transformation into log2 space. To calculate the relative pathway’s activities across all cell lines, the normalized values were centered by subtracting the mean value per gene. Then, the R package decoupleR was used to calculate the relative pathway’s activities based on the top 100 signature genes per pathway obtained from the R package progeny (Schubert M et al. (2018)). By default, the decoupleR was executed using the top performer methods benchmarked (i.e., mlm for multivariate linear model, ulm for univariate linear model, and wsum for weighted sum) and the results were integrated to obtain a consensus z-score to represent the pathway activity. Here, a consensus z-score above 1 or below -1 was considered significant.


The activity of 43 CytoSig cytokines was inferred based on the gene expression profile of the 1206 cell lines by the package CytoSig (Jiang P et al. (2021)). Gene expression data were processed in the same way as for PROGENy analysis. Also, DESeq2 normalized expression values were centered per gene as suggested. The CytoSig program was executed with 10,000 permutations, and the results were presented as z-scores to represent the relative cytokine activities, with a p-value < 0.05 as significant.