Transcriptomics

HPA RNA-seq data overview

In total, 1206 cell lines, 40 human tissues, 193 samples from micro-dissected areas and regions of the human brain, and 18 immune cell types as well as total peripheral blood mononuclear cells (PBMC) have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene. Additionally, 19 mouse tissue samples and 32 pig tissue samples collected from the brain and retina of the animals were sampled and analyzed by RNA-seq.

Normal tissue specimens were collected with consent from patients and all samples were anonymized in accordance with approval from the local ethics committee (ref #2011/473) and Swedish rules and legislation. All tissues were collected from the Uppsala Biobank and RNA samples were extracted from frozen tissue sections.

For a total number of 186 normal tissue samples mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2x100 bases.

Normalization of transcriptomics data

For both the HPA and GTEx transcriptomics datasets, the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets into consensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all samples within each data source (HPA + GTEx human tissues, HPA immune cell types, HPA cell lines) were normalized separately using Trimmed mean of M values (TMM) to allow for between-sample comparisons. The resulting normalized transcript expression values, denoted nTPM, were calculated for each gene in every sample. nTPM values below 0.1 are not visualized on the Atlas sections.

For the brain dataset, an additional normalization was performed using linear regression to do the correction for inter-individual variation using the removeBatchEffect in the R package Limma with subject as a batch parameter. To reduce the technical variation between MGI and illumina platforms, 19 reference samples were included and run on both platforms. Intensity normalization based on reference samples was conducted to minimize technical variation between two platforms.

Consensus transcript expression levels for each gene were summarized in 51 human tissues based on transcriptomics data from the two sources HPA and GTEx. The consensus nTPM value for each gene and tissue type represents the maximum nTPM value based on HPA and GTEx. For tissues with multiple sub-tissues (brain regions, immune cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type and the total number of tissue types in the human tissue consensus set is 37.

The FANTOM5 dataset was normalized separately on the sample level using TMM. The normalized Tags Per Million for each gene were calculated based on the average of all individual samples for each human tissue.

Mouse and pig transcriptomic data generated by the HPA in collaboration with BGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 13 brain regions for mouse brain and 15 regions for pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region.

Single cell type clusters were normalized separately from other transcriptomics datasets using TMM. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded.

Classification of transcriptomics data

The consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, immune cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all nTPM values in 40 tissues, 154 single cell types, 13 main regions of each mammalian brain,18 immune cell types or 1132 cell lines grouped into 28 cancer types and using a cutoff value of 1 nTPM as a limit for detection across all tissues or cell types.

Explanation of the specificity category

Category Description
Enriched nTPM in a particular tissue/region/cell type at least four times any other tissue/region/cell type
Group enriched nTPM in a group (of 2-5 tissues, brain regions, single cell types or cell lines, or 2-10 immune cell types) at least four times any other tissue/region/cell line/immune cell type/cell type
Enhanced Enhanced: nTPM in a one or several tissues, brain regions, cell lines, immune cell types or single cell types that has at least four times the mean of all tissue/region/cell types
Low specificity nTPM ≥ 1 in at least one tissue/region/cell type but not elevated in any tissue/region/cell type
Not detected nTPM < 1 in all tissue/region/cell types


An additional category "elevated", containing all genes in the first three categories (tissue/cell line/cell type enriched, group enriched and tissue/cell line/cell type enhanced), has been used for some parts of the analysis. TS/CS-score (Tissue Specificity/Cell Specificity score) is calculated for “elevated” tissues/cell lines. TS/CS-score is calculated as the fold change from the tissue/cell line with highest RNA to the tissue/cell line with second highest RNA.

Explanation of the distribution category

Category Description
Detected in single Detected in a single tissue/region/cell type
Detected in some Detected in more than one but less than one third of tissues/regions/cell types
Detected in many Detected in at least a third but not all tissues/regions/cell types
Detected in all Detected in all tissues/regions/cell types
Not detected nTPM < 1 in all tissues/regions/cell types

Gene clustering of transcriptomics data

The RNA expression data has been used to classify protein-coding genes into expression clusters for tissues, single cell types, immune cells, and cell lines.

Clustering Number of tissues, cell types or cell lines Sample aggregation level
Tissue 78 Averaged nTPM expression per tissue type (40 HPA and 38 GTEX tissue types)
Single cell type 1175 Averaged nCPM expression per cell type cluster
Cell lines 1206 nTPM expression of individual cell lines
Immune cells 103 Averaged nTPM expression per immune cell
Brain 193 Averaged nTPM expression per brain region


Data preprocessing

For each dataset, genes with expression level > 1 in at least one of the samples were selected. The data was genewise scaled to Z-scores to account for differences in dynamic ranges between genes across samples. After scaling, the expression data was projected into a lower dimensional space using Principal Component Analysis (PCA), where a number of components were selected to satisfy Kaiser’s rule (eigenvalue ≥ 1) and explaining at least 80% of the total variance. Gene to gene distances were calculated as the Spearman correlation of gene expression across samples, and transformed to Spearman distance (1 - Spearman correlation).


Gene clustering

Based on the distances, a k-nearest neighbors (kNN) graph was computeted based on 20 nearest neighbors, which was subsequently to find clusters of similarly expressed genes via Louvain clustering. To account for the stochasticity in the louvain algorithim, the clustering was performed 100 times. The results were later collapsed into a single consensus clustering. Confidence of the gene-to-cluster assignment was calculated as the fraction of times that the gene was assigned to the cluster.


Cluster annotation

The clustering generated for each of the datasets is manually annotated to assign a specificity and function to each cluster. The annotation is based on overrepresentation analysis towards biological databases, including Gene Ontology, Reactome, PanglaoDB, TRRUST, and KEGG, as well as HPA classifications including subcellular location, protein class, secretion location and classification, and specificity toward tissues, single cell types, immune cells, brain regions, and cell lines. A reliability score is manually set for each cluster indicating the confidence of specificity and function assignment.

Clustering visualization

The clustering results are visualized in a UMAP. Colored polygons were generated to represent the main contiguous masses of genes corresponding to the same cluster. First, for each cluster, the two-dimensional density was estimated in the UMAP, and an area enveloping 95% of the total density was determined. The areas were moderated to include contiguous areas corresponding to at least 5% of the total area in the UMAP space. Finally, contiguous areas were converted to two-dimensional polygons per each cluster.


GTEx RNA-seq data

The Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 36 of their tissue types was mapped based on RSEMv1.3.0 (v8) and the resulting TPM values have been included in the Human Protein Atlas for all corresponding genes that could be mapped from Gencode v26 to Ensembl version 109. The GTEx retina data are based on EyeGEx data from Ratnapriya et al., Nature Genetics 2019 and transcript abundance estimation was performed using Kallisto v0.48.0 using Ensembl version 109 as reference genome.

Tissue GTEx tissue Number of samples
Adipose tissue Adipose - Subcutaneous 714
Adipose - Visceral (Omentum) 587
Adrenal gland Adrenal Gland 295
Amygdala Brain - Amygdala 181
Blood vessel Artery - Aorta 472
Artery - Coronary 268
Artery - Tibial 691
Breast Breast - Mammary Tissue 514
Caudate Brain - Caudate (basal ganglia) 300
Cerebellum Brain - Cerebellar Hemisphere 277
Brain - Cerebellum 266
Cerebral cortex Brain - Anterior cingulate cortex (BA24) 233
Brain - Cortex 270
Brain - Frontal Cortex (BA9) 269
Cervix Cervix - Ectocervix 24
Cervix - Endocervix 23
Colon Colon - Sigmoid 419
Colon - Transverse 479
Endometrium Uterus - Endometrium 27
Esophagus Esophagus - Mucosa 614
Fallopian tube Fallopian Tube 29
Heart muscle Heart - Atrial Appendage 461
Heart - Left Ventricle 452
Hippocampus Brain - Hippocampus 255
Hypothalamus Brain - Hypothalamus 257
Kidney Kidney - Cortex 104
Kidney - Medulla 11
Liver Liver 262
Lung Lung 604
Nucleus accumbens Brain - Nucleus accumbens (basal ganglia) 285
Ovary Ovary 193
Pancreas Pancreas 362
Pituitary gland Pituitary 313
Prostate Prostate 282
Putamen Brain - Putamen (basal ganglia) 254
Retina Retina 105
Salivary gland Minor Salivary Gland 181
Skeletal muscle Muscle - Skeletal 818
Skin Skin - Not Sun Exposed (Suprapubic) 651
Skin - Sun Exposed (Lower leg) 754
Small intestine Small Intestine - Terminal Ileum 207
Spinal cord Brain - Spinal cord (cervical c-1) 204
Spleen Spleen 277
Stomach Stomach 407
Substantia nigra Brain - Substantia nigra 183
Testis Testis 414
Thyroid gland Thyroid 684
Urinary bladder Bladder 77
Vagina Vagina 170

FANTOM5 CAGE data

The Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi H et al. (2012)), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 60 of their tissues was obtained from the FANTOM5 repository and mapped to Ensembl version 109.

Tissue FANTOM5 tissue Sample description FANTOM5 sample id
Adipose tissue Adipose tissue 65,65,76 years, mixed FF:10010-101C1
Amygdala Amygdala 76 years, female FF:10151-102I7
Appendix Appendix 29 years, male FF:10189-103D9
Breast Breast 77 years, female FF:10080-102A8
Caudate Caudate nucleus 76 years, female FF:10164-103B2
Cerebellum Cerebellum 22-68 years, mixed FF:10083-102B2
Cerebellum 76 years, female FF:10166-103B4
Cervix Cervix 40,46,57,65 years, female FF:10013-101C4
Colon Colon 62,83,84 years, mixed FF:10014-101C5
Corpus callosum Corpus callosum 24-68 years, mixed FF:10042-101F6
Ductus deferens Ductus deferens 24 years, male FF:10196-103E7
Endometrium Uterus 23-63 years, female FF:10100-102D1
Epididymis Epididymis 24 years, male FF:10197-103E8
Esophagus Esophagus 68,74,75 years, mixed FF:10015-101C6
Frontal lobe Frontal lobe 32-61 years, mixed FF:10040-101F4
Gallbladder Gall bladder 57 years, male FF:10198-103E9
Globus pallidus Globus pallidus 76 years, female FF:10161-103A8
Globus pallidus 60 years, female FF:10175-103C4
Heart muscle Heart 70,73,74 years, mixed FF:10016-101C7
Left ventricle 73 years, female FF:10078-102A6
Left atrium 40 years, male FF:10079-102A7
Hippocampus Hippocampus 76 years, female FF:10153-102I9
Hippocampus 60 years, female FF:10169-103B7
Insular cortex Insula 20-68 years, mixed FF:10039-101F3
Kidney Kidney 60,62,63 years, female FF:10017-101C8
Liver Liver 64,69,70 years, mixed FF:10018-101C9
Locus coeruleus Locus coeruleus 76 years, female FF:10165-103B3
Locus coeruleus 60 years, female FF:10182-103D2
Lung Lung 46,65,94 years, mixed FF:10019-101D1
Lung - right lower lobe 29 years, male FF:10075-102A3
Lymph node Lymph node 30 years, male FF:10077-102A5
Medial frontal gyrus Medial frontal gyrus 76 years, female FF:10150-102I6
Medial temporal gyrus Medial temporal gyrus 76 years, female FF:10156-103A3
Medial temporal gyrus 60 years, female FF:10183-103D3
Medulla oblongata Medulla oblongata 18-64 years, mixed FF:10038-101F2
Medulla oblongata 76 years, female FF:10155-103A2
Medulla oblongata 60 years, female FF:10174-103C3
Nucleus accumbens Nucleus accumbens 23-56 years, mixed FF:10037-101F1
Occipital cortex Occipital cortex 76 years, female FF:10163-103B1
Occipital lobe Occipital lobe 27 years, male FF:10076-102A4
Occipital pole Occipital pole 22-68 years, mixed FF:10036-101E9
Olfactory bulb Olfactory region 87 years, female FF:10195-103E6
Ovary Ovary 47,75,84 years, female FF:10020-101D2
Pancreas Pancreas 52 years, male FF:10049-101G4
Paracentral gyrus Paracentral gyrus 22-69 years, mixed FF:10035-101E8
Parietal lobe Parietal lobe 35-89 years, mixed FF:10034-101E7
Parietal lobe 76 years, female FF:10157-103A4
Parietal lobe 60 years, female FF:10171-103B9
Pituitary gland Pituitary gland 76 years, female FF:10162-103A9
Placenta Placenta female FF:10021-101D3
Pons Pons 18-54 years, mixed FF:10033-101E6
Postcentral gyrus Postcentral gyrus 44-52 years, mixed FF:10032-101E5
Prostate Prostate 73,79,93 years, male FF:10022-101D4
Putamen Putamen 60 years, female FF:10176-103C5
Retina Retina 24-65 years, mixed FF:10030-101E3
Salivary gland Salivary gland 16-60 years, mixed FF:10093-102C3
Parotid gland 23 years, male FF:10199-103F1
Submaxillary gland 24 years, male FF:10202-103F4
Seminal vesicle Seminal vesicle 24 years, male FF:10201-103F3
Skeletal muscle Skeletal muscle 55,79,79 years, mixed FF:10023-101D5
Skeletal muscle - soleus muscle male FF:10282-104F3
Small intestine Small intestine 15,40,85 years, mixed FF:10024-101D6
Smooth muscle Smooth muscle 20-68 years, male FF:10048-101G3
Spinal cord Spinal cord 76 years, female FF:10159-103A6
Spinal cord 60 years, female FF:10181-103D1
Spleen Spleen 39,50,70 years, male FF:10025-101D7
Substantia nigra Substantia nigra 76 years, female FF:10158-103A5
Temporal cortex Temporal lobe 32-61 years, mixed FF:10031-101E4
Testis Testis 34,53,86 years, male FF:10026-101D8
Testis 14-64 years, male FF:10096-102C6
Thalamus Thalamus 76 years, female FF:10154-103A1
Thymus Thymus 0.5,0.5,0.83 years old infant years, male FF:10027-101D9
Thyroid gland Thyroid 67,68,78 years, mixed FF:10028-101E1
Tongue Tongue 28 years, male FF:10203-103F5
Tonsil Tonsil 22-61 years, mixed FF:10047-101G2
Urinary bladder Bladder 55,58,79 years, mixed FF:10011-101C2
Vagina Vagina 68 years, female FF:10204-103F6