TranscriptomicsHPA RNA-seq data overviewIn total, 1206 cell lines, 40 human tissues, 193 samples from micro-dissected areas and regions of the human brain, and 18 immune cell types as well as total peripheral blood mononuclear cells (PBMC) have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene. Additionally, 19 mouse tissue samples and 32 pig tissue samples collected from the brain and retina of the animals were sampled and analyzed by RNA-seq. The HPA Human brain sample set (n=193) is based on micro-dissected areas and regions of the human brain. The analysis is a collaboration with Human Brain Tissue Bank (HBTB; Semmelweis University, Budapest) in accordance with approval from the Committee of Science and Research Ethic of the Ministry of Health Hungary (ETT TUKEB: 189/KO/02.6008/2002/ETT) and the Semmelweis University Regional Committee of Science and Research Ethic (No. 32/1992/TUKEB) to remove human brain tissue samples, collect, store and use them for research. Samples were collected by Prof. Palkovits and RNA was extracted from frozen brain punches. The human brain dataset is based on 966 samples of 193 regions analyzed using the MGI DNBSEQ-T7 platform. The human prefrontal cortex dataset includes 165 samples from 3 male and 3 female donors providing a detailed overview of protein expression in 17 subregions of the prefrontal cortex and 3 reference cortical regions was analyzed using the Illumina sequencing platform. Mouse brainFor mouse tissue, samples were collected and handled in accordance with Swedish laws and regulation, and all experiments were approved by the local ethical committee (Stockholms Norra Djurförsöksetiska Nämd N183/14). The animal experiments conformed to the European Communities Council Directive (86/609/EEC), and all efforts were made to minimize the suffering and the number of animals used. WT male (n = 2) and female (n = 2) C57BL/6J mice (2 month old) were obtained from Charles River Laboratories and maintained under standard conditions on a 12-hour day/night cycle, with water and food ad libitum. After washing out the blood, brains, pituitary gland, and spinal cord were quickly removed from the skull and spine and placed in ice-cold sterile PBS to make the tissue stiff and easier to dissect. The entire brain was carefully dissected into 17 sub-regions on an ice-cold surface. Retina samples were collected by separating the retina from the pigment layer in warm (37°C) PBS, pH 7.4. All dissected regions were placed in a 1.5 ml Eppendorf tube and snap-frozen in liquid nitrogen. Samples were stored at -80°C until further processing for the RNA extraction. Transcript expression of all brain regions, pituitary and retina were analysed. Tissue was homogenized mechanically using a TissueLyser LT (Qiagen) and total RNA was prepared using the RNeasy Mini isolation kit (Qiagen). This generated high-quality RNA, with 84% of the samples having RNA Integrity Number (RIN) values higher than 8.0 and only one sample removed due to a very low RIN value (less than 6.0). In total, 75 samples were subsequently used for library construction with Illumina TruSeq Stranded mRNA reagents. The Illumina HiSeq2500 platform was used for sequencing at approximately 20 million reads depth. Pig brainThe pig tissue samples were collected and analyzed in collaboration with BGI. Pig brain used for mRNA analysis were collected and handled in accordance with national guidance for large experimental animals and under permission of the local ethical committee (ethical permission numbers No.44410500000078 and BGI-IRB18135) as well as conducted in line with European directives and regulations. The experimental minipigs (Chinese Bama Minipig) were provided by the Peral Lab Animal Sci & Tech Co.,Ltd (Permit number SYXK2017-0123). Male (n = 2) and female (n = 2) Chinese Bama minipigs (1 year old), were housed in a specific pathogen-free stable facility under standard conditions. The brain was cut in coronal slabs at the level of 1) frontal lobe/olfactory tract, 2) optic chiasm and 3) between hypothalamus and cerebral peduncle. Slabs were divided in 2 hemispheres exposing all main brain structures. For mRNA analysis, pieces of cerebral cortex and cerebellum were collected, based on a sampling strategy collecting a representative sample that contained all cell layers. All other regions were dissected and collected completely. Two samples (somatosensory cortex and periaqueductal gray) are missing from female 1 due to the fact that these two regions could not be identified with 100% certainty, and thus were excluded. Duplicate samples were taken from olfactory bulb from female 2, resulting in totally 119 brain samples and additional 8 samples (retina and pituitary gland), all in all 127 samples. All samples were stored at -80° C until RNA was extracted within one month. Allen Mouse brain ISH datasetThe Allen Brain Atlas (ABA) is an open access database focusing on the brain, and includes both human and mouse expression data. The ABA is a part of the Allen Institute for Brain Science, which is one of the three branches of the Allen Institute. The Mouse brain In situ hybridization (ISH) data provides information on where in the adult mouse brain each gene is expressed (Lein ES et al. (2007)). We have imported the expression values available through the ABA API (© 2004 Allen Institute for Brain Science, Allen Mouse Brain Atlas) and show the regional expression grouped in the same manner as the other datasets visualized on the HPA Brain Atlas. The Allen mouse brain ISH data was mapped to the mouse gene annotation of Ensembl version 109 using the probe nucleotide sequences provided through the Allen mouse brain API together with the blast program package. The mouse genes where then mapped to human genes using Ensembl orthologue data with a one-to-one restriction. Normalization of transcriptomics dataFor both the HPA and GTEx transcriptomics datasets, the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets into consensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all samples within each data source (HPA + GTEx human tissues, HPA immune cell types, HPA cell lines) were normalized separately using Trimmed mean of M values (TMM) to allow for between-sample comparisons. The resulting normalized transcript expression values, denoted nTPM, were calculated for each gene in every sample. nTPM values below 0.1 are not visualized on the Atlas sections. For the brain dataset, an additional normalization was performed using linear regression to do the correction for inter-individual variation using the removeBatchEffect in the R package Limma with subject as a batch parameter. To reduce the technical variation between MGI and illumina platforms, 19 reference samples were included and run on both platforms. Intensity normalization based on reference samples was conducted to minimize technical variation between two platforms. Consensus transcript expression levels for each gene were summarized in 51 human tissues based on transcriptomics data from the two sources HPA and GTEx. The consensus nTPM value for each gene and tissue type represents the maximum nTPM value based on HPA and GTEx. For tissues with multiple sub-tissues (brain regions, immune cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type and the total number of tissue types in the human tissue consensus set is 37. The FANTOM5 dataset was normalized separately on the sample level using TMM. The normalized Tags Per Million for each gene were calculated based on the average of all individual samples for each human tissue. Mouse and pig transcriptomic data generated by the HPA in collaboration with BGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 13 brain regions for mouse brain and 15 regions for pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region. Single cell type clusters were normalized separately from other transcriptomics datasets using TMM. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded. Classification of transcriptomics dataThe consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, immune cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all nTPM values in 40 tissues, 154 single cell types, 13 main regions of each mammalian brain,18 immune cell types or 1132 cell lines grouped into 28 cancer types and using a cutoff value of 1 nTPM as a limit for detection across all tissues or cell types. Explanation of the specificity category
Explanation of the distribution category
Gene clustering of transcriptomics dataThe RNA expression data has been used to classify protein-coding genes into expression clusters for tissues, single cell types, immune cells, and cell lines.
Data preprocessingFor each dataset, genes with expression level > 1 in at least one of the samples were selected. The data was genewise scaled to Z-scores to account for differences in dynamic ranges between genes across samples. After scaling, the expression data was projected into a lower dimensional space using Principal Component Analysis (PCA), where a number of components were selected to satisfy Kaiser’s rule (eigenvalue ≥ 1) and explaining at least 80% of the total variance. Gene to gene distances were calculated as the Spearman correlation of gene expression across samples, and transformed to Spearman distance (1 - Spearman correlation).
Gene clusteringBased on the distances, a k-nearest neighbors (kNN) graph was computeted based on 20 nearest neighbors, which was subsequently to find clusters of similarly expressed genes via Louvain clustering. To account for the stochasticity in the louvain algorithim, the clustering was performed 100 times. The results were later collapsed into a single consensus clustering. Confidence of the gene-to-cluster assignment was calculated as the fraction of times that the gene was assigned to the cluster.
Cluster annotationThe clustering generated for each of the datasets is manually annotated to assign a specificity and function to each cluster. The annotation is based on overrepresentation analysis towards biological databases, including Gene Ontology, Reactome, PanglaoDB, TRRUST, and KEGG, as well as HPA classifications including subcellular location, protein class, secretion location and classification, and specificity toward tissues, single cell types, immune cells, brain regions, and cell lines. A reliability score is manually set for each cluster indicating the confidence of specificity and function assignment. Clustering visualizationThe clustering results are visualized in a UMAP. Colored polygons were generated to represent the main contiguous masses of genes corresponding to the same cluster. First, for each cluster, the two-dimensional density was estimated in the UMAP, and an area enveloping 95% of the total density was determined. The areas were moderated to include contiguous areas corresponding to at least 5% of the total area in the UMAP space. Finally, contiguous areas were converted to two-dimensional polygons per each cluster. GTEx RNA-seq dataThe Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 36 of their tissue types was mapped based on RSEMv1.3.0 (v8) and the resulting TPM values have been included in the Human Protein Atlas for all corresponding genes that could be mapped from Gencode v26 to Ensembl version 109. The GTEx retina data are based on EyeGEx data from Ratnapriya et al., Nature Genetics 2019 and transcript abundance estimation was performed using Kallisto v0.48.0 using Ensembl version 109 as reference genome.
FANTOM5 CAGE dataThe Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi H et al. (2012)), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 60 of their tissues was obtained from the FANTOM5 repository and mapped to Ensembl version 109.
|