Cell Line - Methods summary
The Cell Lines section contains information on genome-wide RNA expression profiles of human protein-coding genes in human cell lines. The transcriptomics analysis covers 1206 human cell lines, corresponding to 28 cancer types, one non-cancerous group and one uncategorised group of cell lines, and includes classification based on specificity, distribution and expression clusters. Based on the transcriptomics profiles, cell lines were evaluated for their consistency to the corresponding TCGA (The Cancer Genome Atlas) disease cohort to help researchers to select the best cell lines as in vitro models for cancer research. In addition, based on biological data mining, for each cell line, the relative activity of 14 cancer-related pathways and 43 cytokines were inferred and presented to characterize the phenotype of the cell line.
Jin H et al. (2023) "Systematic transcriptional analysis of human cell lines for gene expression landscape and tumor representation" Nat Commun 14, 5417 (2023).
What can you learn from the Cell Lines section?
How has the data been generated?
A genome-wide expression analysis of 1206 human cell lines, including 1132 cancer cell lines, was performed using RNA-seq with early-split samples as duplicates. Here, RNA-seq profiles of cell lines generated by the HPA (n = 69) and the Cancer Cell Line Encyclopedia (CCLE 2019; n = 1019) were integrated, with the 33 common cell lines averaged for their gene expression.
How has the data been analyzed?
The transcript abundance of each protein-coding gene was estimated using the average TPM value of the individual samples for each cell line. The transcriptomics data was then used to
What is presented in the section?
The RNA expression levels were determined for all protein-coding genes (n = 20162) across the 1206 human cell lines and the results are presented on the gene summary page of the Cell Lines section as exemplified in the figure below.
On the cell line category specific pages, which are accessed by clicking on the piechart or the colored boxes on the Cell Line section page, plots showing the cancer-related pathway (PROGENy) and cytokine (CytoSig) activity relative to the average expression of all analyzed cell lines as the baseline are displayed.
How has the classification of all protein-coding genes been done?
A genome-wide classification of the protein-coding genes with regard to cell line distribution across all cancer cell lines as well as specificity across 28 cancer types has been performed using between-sample normalized data (nTPM). The results can serve as a reference for researchers interested in expression profiles of human cell lines at both the disease level and cell line level. The genes were classified according to specificity into (i) cancer enriched genes with at least four-fold higher expression levels in one cell line cancer type as compared with any other analyzed cell line cancer types; (ii) group enriched genes with enriched expression in a small number of cell line cancer types (2 to 10); and (iii) cancer enhanced genes with only moderately elevated expression. In addition, all genes were classified according to distribution in which each gene is scored according to the presence (expression levels higher than a cut-off) in the cell lines. The cell line cancer enriched and group enriched genes are displayed in the interactive plot below, in which clicking on the red and orange circles results in gene lists for the corresponding enriched and group enriched genes, respectively.
Finally, a new classification has been introduced in which genes are clustered based on similarity in expression across the cell lines. The results are presented as an interactive UMAP plot in which mouse-over displays general information for the clusters and the clicking on a cluster will display more information and plots regarding that specific cluster, as well as, a clickable list of all clusters.
How was the similarity of the cell lines to the corresponding TCGA cancer cohorts analysed?
The 1132 cancer cell lines were analyzed for their representability of the corresponding TCGA disease cohorts. Considering that tumor samples also harbor immune and other cells, here, TCGA samples with a low tumor purity score (< 0.7, https://www.nature.com/articles/ncomms3612) were excluded from the analysis. The similarity between cell lines and the corresponding TCGA cohort was estimated by two different approaches:
How has the pathway and cytokine analysis been done?
For all 1206 analyzed cell lines, the activity of a total of 14 cancer-related pathways were inferred using the PROGENy, a package that relies on biological data mining of publicly available data to obtain cancer-related pathway responsive genes for human and mouse (Schubert M et al. (2018)). For this, read counts for HPA and CCLE cell lines quantified by Kallisto were re-analyzed without filtering out the non-protein-coding genes to ensure a broadened coverage of cancer pathway responsive genes. The read counts of the 1206 cell lines were normalized by DESeq2 with respect to the size factor of each cell line and were further transformed by variance stabilizing transformation into log2 space. To calculate the relative pathway’s activities across all cell lines, the normalized values were centered by subtracting the mean value per gene. Then, the R package decoupleR was used to calculate the relative pathway’s activities based on the top 100 signature genes per pathway obtained from the R package progeny (Schubert M et al. (2018)). By default, the decoupleR was executed using the top performer methods benchmarked (i.e., mlm for multivariate linear model, ulm for univariate linear model, and wsum for weighted sum) and the results were integrated to obtain a consensus z-score to represent the pathway activity. Here, a consensus z-score above 1 or below -1 was considered significant.
The activity of 43 CytoSig cytokines was inferred based on the gene expression profile of the 1206 cell lines by the package CytoSig (Jiang P et al. (2021)). Gene expression data were processed in the same way as for PROGENy analysis. Also, DESeq2 normalized expression values were centered per gene as suggested. The CytoSig program was executed with 10,000 permutations, and the results were presented as z-scores to represent the relative cytokine activities, with a p-value < 0.05 as significant.