Single cell type atlas - Methods

Single cell transcriptomics datasets

In total, 34 different datasets were analyzed. These datasets were respectively retrieved from the Single Cell Expression Atlas, the Human Cell Atlas, the Gene Expression Omnibus (GEO), EMBL-EBI Biostudies and the Tabula Sapiens. The complete list of datasets and their references is shown here .

Inclusion criteria

The goal of this single cell type atlas is to comprehensively map and compare the distribution and specificity of protein-coding gene expression within cell types across the entire human body. The included datasets are based on a systematic literature search on single cell RNA sequencing (scRNAseq) studies and databases featuring healthy adult human tissue. The following data selection criteria were applied:

  1. Single cell transcriptomic datasets were limited to those based on the 10X Genomics Chromium platform (v2 or v3 chemistry);
  2. Single cell RNA sequencing was performed on single cell suspension from tissues without pre-enrichment of cell types;
  3. Only studies with > 4,000 cells and 20 million read counts were included. Crucially, we chose datasets that best portrayed the cellular diversity within each tissue.

An exception was made for the adrenal gland (~ 1.9 million reads, snRNAseq), due to the reduced availability of datasets for this tissue.

We prioritized scRNAseq datasets, however, we included single nucleus RNA sequencing (snRNAseq) datasets for the brain, heart, skeletal muscle, adipose tissue, kidney, and pituitary gland, as they provided a more accurate representation of the cell type diversity.

Data aquisition

We obtained and processed raw sequencing data for most datasets, with the exception of the brain and kidney datasets. For datasets sourced from the GEO repository (retina, pituitary gland, small intestine, colon, rectum, liver, pancreas, testis, epididymis, breast, fallopian tube, endometrium, heart, skeletal muscle, adipose tissue, skin), we extracted FASTQ files using SRA Toolkit’s fasterq-dump. FASTQ files from the Tabula Sapiens consortia (eye, lung, tongue, salivary gland, bladder, prostate, vasculature, blood, none marrow, lymph node, spleen, thymus) were directly downloaded from their S3 AWS Dataset Registry. FASTQ files of the adrenal gland, ovary, and placenta were directly downloaded through Biostudies repository; the Esophagus FASTQ files were directly downloaded from the Human Cell Atlas’ Data Explorer repository. The stomach dataset was kindly aligned for us by the original publishers (Tsubosaka et al.) under the same speficiation we explain below.

The representation of the brain we present under the single cell type atlas is based on the datasets of the cerebral cortex, cerebellum, midbrain, basal ganglia, hypothalamus, and hippocampus published by Siletti et al. From each region, we randomly sampled 20,000 neurons and 20,000 non-neuron cells. Additionally, we manually included all the neurons labelled by the pubslishers with the cluster_id 314 (to include Purkinje cells), 400 (cholinergic neurons), and 724, 725 (magnocellular secretory cells). We split the sampled datasets into two: a neuron dataset and a non-neuron dataset.

RNA quantification and cell quality control

For RNA quantification, the FASTQ files were mapped to the human reference GRCh38.p13 (Homo_sapiens.GRCh38.dna.toplevel.fa) using Ensemble 109 (Homo_sapiens.GRCh38.109.gtf). Exceptions to these alignment steps were the Brain and Kidney datasets, for which we accessed processed count data directly. FASTQ files were aligned using STAR (v2.7.11a), using the following flags: --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloCellFilter EmptyDrops_CR --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR --soloFeatures Velocyto. For each sample inside each dataset, the ambient RNA was estimated and corrected using the SoupX (v1.6.2) automated method under standard settings based on the sum-combined spliced and unspliced matrices.

Our quality control pipeline was performed using Scanpy (v1.10.1) in Python (v3.10.13). We applied a three-step tissue-specific quality control and filtering process, demanded by the integration of diverse sequencing technologies and datasets. (1) We removed the top 5% of droplets with the highest doublet score, determined by scrubulet (v0.2.3), and cells with high mitochondrial content ( >60%). For most datasets, cells under 1000 UMI counts were removed, but due to variations in sequencing depth and the risk of excluding rare cell types or those with inherently low transcriptomic content, the minimum UMI counts threshold was adjusted for several tissues: 900 UMI counts for rectum, pituitary gland, small intestine, stomach, kidney, heart, and esophagus, 800 UMI counts for breast, skin; 500 UMI counts for ovary. The adrenal gland was set at a 300 UMI count threshold due to low sequencing depth. For snRNAseq datasets: adrenal gland, heart, adipose tissue, and skeletal muscle; droplets with an unspliced transcript composition under 40%, 50%, 59%, 71% respectively were removed. (2) We filtered the top 0.1% of droplets by gene counts and UMI counts to remove technical outliers and the top 5% of droplets by mitochondrial content. (3) The third filtering step involved removing droplets with 5% or more hemoglobin-related genes (defined by GO:0005833) and the top 5% of cells by ribosomal RNA content. For the bone marrow, lymph node, spleen and blood, cells beloning to the erythrocytes and megakaryocytes lineage, as well as lymphoid and myeloid cells, were searched among the excluded cells and included back into the dataset, to characterize also those cells that have inherently low RNA content.

Cell clustering

For each dataset, the valid cells were subsequently clustered and annotated in a semi-supervised manner. Initial processing involved excluding genes not detected in at least three cells and removing the non-coding RNA MALAT1. Counts were then normalized using scanpy’s normalize_total function under standard settings, followed by a loge(X+1) transformation. The highly variable genes were selected using scanpy’s highly_variable_genes function, and based on these genes, PCA was computed. Based on the first 40 principal components, a k-nearest neighbors graph based on 15 nearest neighbors was calculated, which was subsequently used to compute the UMAP and Leiden clusters. Leiden clustering ran under standard parameters; resolution was set to 1. To obtain a more refined annotation, selected clusters were subclustered by re-running the clustering pipeline on the isolated subset of cells. If multiple cell types were revealed through sub-clustering, these were consequently annotated. If remaining doublets were revealed through clustering based these were manually removed. The clustering pipeline would then be re-run to ensure reproducible clusters based on the final set of cell barcodes.

Cell type annotation

The 1175 identified cell clusters were manually annotated based on an extensive survey of >600 well-known tissue and cell type-specific markers, including both markers from the original publications, and additional markers used in pathology diagnostics. Two levels of annotation were manually assigned based on marker expression: a detailed cell type annotation aimed for prioritising resolution inside a given dataset and a more general main cell type annotation designed for harmonization across tissue datasets. Clusters that could not be unambiguously assigned a main cell type were excluded from downstream aggregation and gene classification. The expression of the most relevant markers are presented on a heatmap under each tissue, to clarify cell type annotation.

Data normalisation

After clustering, all the cells of all 34 datasets were integrated throughout a downstream analysis pipeline based on pseudobulk. The raw counts per gene were averaged across all cells within a cluster. This resulted in 1175 pseudobulk profiles, one for each cluster. We filtered for protein-coding genes only (n = 20162), and these were scaled to counts per million (CPM), resulting in what we call protein counts per million (pCPM). The pCPM values were then TMM normalized to allow for between-cluster comparisons. TMM normalization (Trimmed means of M values) is based on NOISeq’s (v.2.46.0) implementation, specifying refColumn to ‘median’, logratioTrim to 0.3, sumTrim to 0.3, and doWeighting to FALSE. Clusters annotated as erythrocytes or platelets were excluded from the TMM normalisation factor calculations due to their low RNA content, and manually assigned a TMM factor of 1 to ensure a more accurate factor estimation. The resulting TMM normalized pCPM values we refer to as nCPM, and they are the final expression values shown in the single cell type atlas.

Data integration

To obtain a single expression value per cell type, we followed a within-dataset aggregation, followed by a cross-dataset aggregation:

  1. Within-Dataset aggregation: We computed the weighted mean of nCPM counts for clusters assigned to the same cell type within a single dataset, using the number of cells within each cluster as the weighting factor. More formally (Formula 1), x^j represents the aggregated vector of nCPM values for a cell type within a dataset. Here, xj is the nCPM vector for cluster j; n is the total number of clusters being aggregated; and wj is the cell count used as the weighting factor for cluster j.

    x ^ j = ∑ j = 1 n w j x j ∑ j = 1 n w j
    (1)
  2. Cross-dataset aggregation: The same cell types in different datasets were averaged across datasets to obtain a final transcript profile per cell type. This resulted in 154 cell types across all tissues.

Excluded from the cross-dataset aggregation and subsequent gene classification were clusters with mixed cell types, clusters with low cell type annotation confidence, and cell types within a tissue that comprised less than 30 cells or their aggregated profile contained fewer than 10,000 detected genes. We retained, however, a small number of clusters below the 30-cell threshold, provided they demonstrated more than 10,000 detected genes, to preserve representation of rare cell types. A total of 161 clusters out of 1175 clusters were excluded from the cross dataset integration and downstream analysis.

Cell type hierarchical organisation

To facilitate effective data presentation and usability, we introduced a three-level hierarchical structure to the cell type annotations: Cell Types (n = 154), Cell Type Groups (n = 53), and Cell Classes (n = 15). The Cell Type and Cell Type Group levels contain expression data used for downstream gene classification. In contrast, the Cell Classes serve an organizational purpose.

The Cell Type Group expression profile is derived from the Cell Type data using a max pooling aggregation method. For every gene within a Cell Type Group, the maximum nCPM value observed across constituent Cell Types is retained.

Gene Classification

Our downstream gene classification was performed independently on two data sets: the nCPM expression profiles at the Cell Type level and the max-pooled from the Cell Type Group level, as explained in the previous section. This classification involved assigning to each protein coding gene one of five specificity categories, one of five distribution categories and an associated Tau score. The criteria used are defined in bellow.

Gene specificity classification criteria

Category Description
Enriched nCPM in a particular cell type/cell type group at least four times any other cell type/cell type group
Group enriched nCPM in a group (of 2-15 cell types, 2-5 cell type groups) at least four times any other cell type/cell type group
Enhanced nCPM in one or several cell type/cell type group that has at least four times the mean of all cell type/cell type group
Low specificity nCPM ≥ 1 in at least one cell type/cell type group but not elevated in any cell type/cell type group
Not detected nCPM < 1 in all cell type/cell type group


Gene distribution classification criteria

Category Description
Detected in single Detected in a single cell type/cell type group
Detected in some Detected in more than one but less than one third of cell type/cell type group
Detected in many Detected in at least a third but not all cell type/cell type group
Detected in all Detected in all cell type/cell type group
Not detected nCPM < 1 in all cell type/cell type group


Tau score

A tau (Ï„) specificity score was calculated for each protein coding gene following the formula proposed by Ynai et al. Formally, it's calculation is presented bellow (Formula 2). The calculation uses the loge(nCPM+1) transformed expression levels, where n represents the total number of cell types or cell type groups, xi represents the loge(nCPM+1) expression level of gene i and x^i represents the max scaled gene expression of gene i.

τ = ∑ i = 1 n 1 - x ^ i n - 1 ; x ^ i = x i max 0 ≤ i ≤ n x i
(2)

Cell type dendrogram

The cell type dendrogram presented on the Single Cell Type resource shows the relationship between the single cell type groups based on genome-wide expression. The dendrogram is based on agglomerative clustering of 1 - Spearman's rho between cell types using Ward's criterion. The dendrogram was then transformed into a hierarchical graph, and link distances were normalized to emphasize graph connections rather than link distances. Link width is proportional to the distance from the root, and links are colored according to cell type group if only one cell type group is present among connected leaves.


Gene clustering of transcriptomics data

The RNA expression data has been used to classify protein-coding genes into expression clusters for tissues, single cell types, immune cells, and cell lines.

Clustering Number of tissues, cell types or cell lines Sample aggregation level
Tissue 78 Averaged nTPM expression per tissue type (40 HPA and 38 GTEX tissue types)
Single cell type 1175 Averaged nCPM expression per cell type cluster
Cell lines 1206 nTPM expression of individual cell lines
Immune cells 103 Averaged nTPM expression per immune cell
Brain 193 Averaged nTPM expression per brain region


Data preprocessing

For each dataset, genes with expression level > 1 in at least one of the samples were selected. The data was genewise scaled to Z-scores to account for differences in dynamic ranges between genes across samples. After scaling, the expression data was projected into a lower dimensional space using Principal Component Analysis (PCA), where a number of components were selected to satisfy Kaiser’s rule (eigenvalue ≥ 1) and explaining at least 80% of the total variance. Gene to gene distances were calculated as the Spearman correlation of gene expression across samples, and transformed to Spearman distance (1 - Spearman correlation).


Gene clustering

Based on the distances, a k-nearest neighbors (kNN) graph was computeted based on 20 nearest neighbors, which was subsequently to find clusters of similarly expressed genes via Louvain clustering. To account for the stochasticity in the louvain algorithim, the clustering was performed 100 times. The results were later collapsed into a single consensus clustering. Confidence of the gene-to-cluster assignment was calculated as the fraction of times that the gene was assigned to the cluster.


Cluster annotation

The clustering generated for each of the datasets is manually annotated to assign a specificity and function to each cluster. The annotation is based on overrepresentation analysis towards biological databases, including Gene Ontology, Reactome, PanglaoDB, TRRUST, and KEGG, as well as HPA classifications including subcellular location, protein class, secretion location and classification, and specificity toward tissues, single cell types, immune cells, brain regions, and cell lines. A reliability score is manually set for each cluster indicating the confidence of specificity and function assignment.

Clustering visualization

The clustering results are visualized in a UMAP. Colored polygons were generated to represent the main contiguous masses of genes corresponding to the same cluster. First, for each cluster, the two-dimensional density was estimated in the UMAP, and an area enveloping 95% of the total density was determined. The areas were moderated to include contiguous areas corresponding to at least 5% of the total area in the UMAP space. Finally, contiguous areas were converted to two-dimensional polygons per each cluster.