Single cell type - Methods summary Key publications How has the data been generated?How has the classification of all protein-coding genes been done?What is presented?What is the difference between cell type, cell group, and cell class?Data overview

Single cell type - Methods summary

The single cell type atlas aims to create a comprehensive map of protein-coding gene expression across cell types found in the major adult tissues. We achieved this by systematically integrating and re-analyzing single cell and single nucleus RNA sequencing datasets from 34 healthy tissues under a common pipeline. This section summarizes the data processing and analysis methods for of 36 scRNA-seq or snRNA-seq datasets, 1175 individual cell type clusters, and 154 final cell types.

Method details
Access method details for a thorough description of the data workflow and analysis procedures.

Key publications

Karlsson M et al. (2021) “A single cell type transcriptomics map of human tissues” Sci Adv.

Shi M et al. (2025) "A resource for whole-body gene expression map of human tissues based on integration of single cell and bulk transcriptomics" Genome Biol.

How has the data been generated?

Collection of scRNA-seq and snRNA-seq data

The 34 scRNA-seq/snRNA-seq datasets included were systematically selected through an extensive literature search on single cell transcriptomic databases and studies featuring healthy adult human tissue. These datasets were respectively retrieved from the Single Cell Expression Atlas, the Human Cell Atlas, the Gene Expression Omnibus (GEO), EMBL-EBI Biostudies, and the Tabula Sapiens. A complete list of datasets and their references is presented here.

Gene expression quantification

For most datasets, we obtained the raw sequencing files (FASTQ files). To obtain the gene counts, we mapped the reads (inside the FASTQ files) to the human genome reference with gene annotations based on Ensemble 109. To correct for the residual RNA in solution inside droplets, each sample went through ambient RNA correction procedure (Soupx). We applied a multistep quality control procedure to ensure removal of poor-quality cells and technical artifacts. This involved removal of probable doublets (two cells inside one droplet), removal of droplets with high mitochondrial content, and removal of high RNA content outliers. Detection thresholds were additionally carefully adjusted so that we captured low RNA content cells, and to ensure representation of rare cell types.

Clustering and cell type annotation

Following cell quality control, each dataset was individually processed through a cell type annotation procedure involving cell clustering. At a very basic level, cell clustering works by reducing the complex and high-dimensional gene expression into its principal components, followed by running the clustering algorithm (Leiden) to group cells under a generic label. The gene expression profile of these cell clusters is later investigated using known cell-specific marker genes to identify the underlying cell type. Each cluster we assigned two levels of classification: (1) A detailed cell type annotation, that prioritises resolution within a tissue (e.g., Arterial endothelial cells), and (2) a main cell type annotation, that harmonizes cluster labels across all datasets included (e.g., Vascular endothelial cells).

Normalisation and data integration

After cluster labeling, we used a pseudobulk approach to integrate expression profiles across all tissues, creating a single, unified reference for each cell type. The process followed these steps:

Pseudoubulk aggregation: We averaged the raw gene counts across all cells within a cluster to create a single pseudobulk profile per cluster. This resulted in 1175 distinct cell cluster expression profiles.
Normalisation: We selected protein-coding genes only and subsequently scaled the pseudobulked cluster data into counts per million (CPM), resulting in pCPM expression values. The pCPM values were then normalized using TMM normalization (Trimmed Mean of M-values). TMM corrects for technical bias by calculating scaling factors based on the expression of shared genes between clusters. The resulting TMM-normalized pCPM values are referred to as nCPM.
Cross-Tissue Integration: To obtain a single expression profile per cell type, we followed a two-step integration procedure:
- Within-Dataset Aggregation: Clusters belonging to the same cell type within the same tissue dataset were integrated using a weighted mean (weighted by the cell count per cluster) to obtain one nCPM profile per cell type per tissue dataset.
- Cross-Dataset Aggregation: The final, unique cell type expression profile was computed by averaging the nCPM profiles for the same cell type across all tissue datasets. This resultsed in 154 unique cell types.

Cell type hierarchical organisation

To facilitate effective data presentation and usability, we introduced ahierarchical structure to the cell type annotations: cell types (n = 154), cell type groups (n = 53), and cell classes (n = 15). The cell type and cell type Group levels contain expression data used for downstream gene classification. In contrast, the cell classes serve an organizational purpose.

The cell type group expression profile is derived from the cell type data using a max pooling aggregation method. For every gene within a cell type group, the maximum nCPM value observed across constituent cell types is retained.

Immunohistochemistry on tissue microarrays

For confirming scRNA-seq profiles and cell type specificity at the protein level, antibody-based protein expression profiling of normal human tissue types was generated using immunohistochemistry (IHC) on tissue microarrays (TMAs), as described in more detail in the Tissue section.

How has the classification of all protein-coding genes been done?

Gene Classification and Specificity Scoring

We analyzed the processed nCPM data to classify every protein-coding gene based on its expression pattern. This was done independently on both the cell type and cell type group data.

Specificity Categories: Genes were assigned one of five categories (Enriched, Group Enriched, Enhanced, Low Specificity, or Not Detected) based on how much higher their nCPM was in a particular cell type compared to all other cell types.
Distribution Categories: Genes were also classified based on their breadth of expression (Detected in Single, Some, Many, or All) across the cell types.
Tau Score: A quantitative score was calculated for every gene to numerically assess its cell type specificity. The tau score ranges from 0, reflecting a gene’s equal expression across all cell types, to 1 for perfectly specific genes. Visit the detailed methods for a more in depth explanation of the categorisation criteria and calculation procedure.

Gene clustering

The processed single cell types data was used to cluster genes according to their expression across clusters. Genes detected in at least one cell type (nCPM > 1) were taken into account for annotation. The procedure involved genewise scaling the nCPM expression and extracging the principal components (PCA). Based on these principal components, the gene to gene spearman distances were calculated. Subsequently, a neighborhood graph was computet and based on this the louvain clustering algorithm was set to run 100 times. The consensus clusters calculated from the 100 iterations was taken as the final cluster assignment. This procedure resulted in 110 expression clusters. These gene clusters were subsequently mannuyally annotated, based on overrepresentation analysis across divers biological databases and our own diverse specificity annotations. The results

The clustering of 19294 genes showing expression above cut of in the in single cell types resulted in 110 expression clusters, which have been manually annotated to describe common features in terms of function and specificity. The result of the cluster analysis is presented as a UMAP based on gene expression, where each cluster has been summarized as colored areas containing most of the cluster genes. The interactive results are available here.

What is presented?

The data is presented as interactive UMAP plots and summarizing bar plots, displaying the expression of each gene in each cluster or single cell type, including information on cell type specificity from a body-wide perspective. The data is linked to protein expression profiles in the Tissue section, presenting the single cell type specificity as high-resolution histological images.

What is the difference between cell type, cell group, and cell class?

Below is an example illustrating the herarchy of cell class, grouped cell type, cell type and cell type detail.

Cell type clusters often has detailed names, such as arterial, capillary or venous endothelial cells. There are 1175 cell type clusters in total.
The data includes 154 different cell types, such as vascular endothelial cells and lymphatic endothelial cells. The specificity category is available at the cell type level, as well as grouped cell type level.
Cell types are grouped into 53 cell type groups, to provide a general overview (the plot on gene summery pages) and results in a specificity category with less noice.
There are 15 cell classes, endothelial and mural cells is one of them, cell class is used for the color codes and knowledge based summary pages.

All the cluster annotations and cell type information, including respective cell type group is listed in the cluster data list. More details and an interactive way to explore the cell type hierarchy is found here.

Data overview

The data we present here encompasses the transcriptome of 1,217,972 cells after cell filtering and quality control. The individual cells were grouped into 1175 cell clusters and manually annotated with both detailed and main cell type names. Clusters that met quality and confidence criteria were then integrated to define 154 final cell types, which were again grouped once more onto 53 broader cell type groups.

There are several lists available for download, providing a complete overview and expression data across the different level of details.

Data type	Count	Description	Cover (nr genes)
RNA expression	36	RNA read count for genes per cell across 36 datasets	20162
RNA expression	1175	RNA expression for genes across 1175 clusters	20162
RNA expression	53	RNA expression levels per gene and cell type group	20162
RNA expression	154	RNA expression levels per gene and cell type	20162