Normalization of transcriptomics data

For both the HPA and GTEx transcriptomics datasets, the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets into consensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all samples within each data source (HPA + GTEx human tissues, HPA immune cell types, HPA cell lines) were normalized separately using Trimmed mean of M values (TMM) to allow for between-sample comparisons. The resulting normalized transcript expression values, denoted nTPM, were calculated for each gene in every sample. nTPM values below 0.1 are not visualized on the Atlas sections.

For the brain dataset, an additional normalization was performed using linear regression to do the correction for inter-individual variation using the removeBatchEffect in the R package Limma with subject as a batch parameter. To reduce the technical variation between MGI and illumina platforms, 19 reference samples were included and run on both platforms. Intensity normalization based on reference samples was conducted to minimize technical variation between two platforms.

Consensus transcript expression levels for each gene were summarized in 51 human tissues based on transcriptomics data from the two sources HPA and GTEx. The consensus nTPM value for each gene and tissue type represents the maximum nTPM value based on HPA and GTEx. For tissues with multiple sub-tissues (brain regions, immune cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type and the total number of tissue types in the human tissue consensus set is 37.

The FANTOM5 dataset was normalized separately on the sample level using TMM. The normalized Tags Per Million for each gene were calculated based on the average of all individual samples for each human tissue.

Mouse and pig transcriptomic data generated by the HPA in collaboration with BGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 13 brain regions for mouse brain and 15 regions for pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region.

Single cell type clusters were normalized separately from other transcriptomics datasets using TMM. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded.

Classification of transcriptomics data

The consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, immune cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all nTPM values in 40 tissues, 154 single cell types, 13 main regions of each mammalian brain,18 immune cell types or 1132 cell lines grouped into 28 cancer types and using a cutoff value of 1 nTPM as a limit for detection across all tissues or cell types.

Explanation of the specificity category

Category	Description
Enriched	nTPM in a particular tissue/region/cell type at least four times any other tissue/region/cell type
Group enriched	nTPM in a group (of 2-5 tissues, brain regions, single cell types or cell lines, or 2-10 immune cell types) at least four times any other tissue/region/cell line/immune cell type/cell type
Enhanced	Enhanced: nTPM in a one or several tissues, brain regions, cell lines, immune cell types or single cell types that has at least four times the mean of all tissue/region/cell types
Low specificity	nTPM ≥ 1 in at least one tissue/region/cell type but not elevated in any tissue/region/cell type
Not detected	nTPM < 1 in all tissue/region/cell types

An additional category "elevated", containing all genes in the first three categories (tissue/cell line/cell type enriched, group enriched and tissue/cell line/cell type enhanced), has been used for some parts of the analysis. TS/CS-score (Tissue Specificity/Cell Specificity score) is calculated for “elevated” tissues/cell lines. TS/CS-score is calculated as the fold change from the tissue/cell line with highest RNA to the tissue/cell line with second highest RNA.

Explanation of the distribution category

Category	Description
Detected in single	Detected in a single tissue/region/cell type
Detected in some	Detected in more than one but less than one third of tissues/regions/cell types
Detected in many	Detected in at least a third but not all tissues/regions/cell types
Detected in all	Detected in all tissues/regions/cell types
Not detected	nTPM < 1 in all tissues/regions/cell types