Comparing RNA and ProteinThe relationship between RNA and protein expression has been investigated and debated for decades. Although the central dogma of molecular biology suggests that RNA abundance should directly translate into protein abundance, the relationship is often far more complex and highly context-dependent. Differences in RNA and protein stability, half-life, degradation rates, translational efficiency, and post-translational regulation can all contribute to substantial discrepancies between RNA and protein levels, even within the same cell type. Technical factors also influence RNA–protein comparisons. Differences in sampling procedures, detection sensitivity, dynamic range, and experimental platforms can contribute to variation between RNA-seq and proteomics datasets. As a result, some genes show strong and predictable RNA–protein concordance, whereas others display weak or inconsistent relationships that may reflect biological regulation, protein secretion, transport, or technical limitations. When comparing RNA and protein data, it is important to consider the technical differences between the datasets and the biological material included in each sample. Additional details regarding sample preparation and technical considerations are available on the DVP details page. Despite these limitations, the integrated dataset contains numerous examples of strong agreement between RNA and protein, as well as clear examples of discordance, further highlighting the complex and context-dependent relationship between transcription and protein abundance.
Two complementary strategies for comparing RNA and proteinSystematic comparison of RNA and protein data remains an important challenge in molecular biology. Integrating these complementary data modalities improves our understanding of gene regulation, cellular specialization, and tissue biology.
Direct comparison between RNA and protein is inherently difficult, and most studies report low-to-moderate correlations between transcript and protein abundance. To facilitate exploration of the integrated datasets, this resource provides gene-centric comparisons where protein intensity profiles are displayed alongside matched single-cell or single-nucleus RNAseq data on gene summary pages. Two complementary strategies are used for global comparison between RNA and protein:
These approaches provide different perspectives on RNA–protein concordance and can be used together to explore both elevated and broadly expressed genes. In addition, users can investigate individual cell types and genes directly using the search and filtering functions. For example, using the search fields, users can identify 6874 proteins detected in alveolar cell types, of which 66 are found exclusively in alveolar cells (relative to the 24 DVP cell types included in the classification). In addition, 126 proteins are classified as alveolar cell enriched, and 54 of these show overlap or partial overlap with RNAseq data. One example is ATP-binding cassette subfamily A member 3 (ABCA3), which is detected across multiple tissues and cell types but shows consistent enrichment in lung alveolar cells at both the RNA and protein level. Interestingly, despite this shared enrichment, ABCA3 exhibits relatively low RNA–protein agreement scores (detection agreement = 0.33; expression agreement = 0.58), indicating that reliance solely on correlation or agreement metrics would provide a less comprehensive view of biological consistency. Therefore, the resource enables users to explore complementary comparative strategies that integrate both qualitative overlap and quantitative agreement across data modalities.
Specificitiy overlapHPA cell type specificity categories were applied to the DVP proteomics dataset and analogously to the matched RNA dataset used for comparison. Because specificity classification depends on the set of included cell types, classifications in this comparison may differ from those shown in the general Single Cell resource, which is based on a broader tissue and cell type cluster set. For category level comparisons, the three elevated specificity categories (cell type enriched, group enriched, and cell type enhanced) are combined into the general elevated category. This comparison is limited to genes classified as elevated in both RNAseq and DVP datasets. Specificity overlap is classified as:
Figure 1. Bar chart showing the numbers for the different overlap categories, click on the bars for the full gene list for respective categories. Figure 2. Specificity categories for DVP data (protein) Figure 3. Specificity categories for single cell/nuclei RNAseq data (RNA) Gene-centric classification based on RNA or protein data resulted in a few take home messages:
Cell type comparison - rule exceptionsMost cell types contain examples of both strong and weak RNA–protein concordance. However, a small number of biologically and technically motivated exceptions were included in the overlap definitions:
Examples of overlapping profilesBiglycan (BGN) is a leucine-rich proteoglycan and a major component of the extracellular matrix, where it contributes to collagen organization and tissue structural integrity. It is particularly abundant in connective tissues such as bone, cartilage, and vasculature, and also plays roles in inflammation, wound healing, and cellular signaling. Elevated BGN expression has been associated with tissue remodeling and fibrotic or inflammatory processes in several diseases. We see overlapping profiles in RNA and protein for BGN, classified as smooth muscle enriched in both RNA and protein data. Asparaginase Like 1 (ASRGL1) encodes an enzyme involved in amino acid metabolism through its asparaginase and β-aspartyl peptidase activities. The protein is expressed in several tissues and is thought to contribute to protein turnover and cellular metabolic homeostasis. We see overlapping profiles in RNA and protein data for ASRGL1, classified as enriched in fallopian tube cells, by both RNA and protein data.
Bactericidal Permeability Increasing Protein (BPI) is an antimicrobial protein primarily produced by neutrophils and plays an important role in innate immune defense. It binds bacterial lipopolysaccharide (LPS), neutralizes endotoxin activity, and promotes bacterial killing by increasing membrane permeability. Because of its strong antibacterial and anti-inflammatory properties, BPI is often associated with mucosal immunity and inflammatory responses. We see overlapping profiles in RNA and protein data for BPI, classified as neutrophil enriched. Agreement scoresTwo agreement scores were calculated for genes present in both datasets across all included cell types:
The detection agreement score binarizes each cell type as either detected or not detected independently for RNAseq and DVP data, and reports the proportion of cell types with concordant detection states. The expression agreement score is calculated similarly, but instead uses above- or below-median expression per gene within each modality. Both scores range from 0 = fully discordant to 1 = fully concordant. Agreement scores are shown on gene summary pages and can be searched using three intervals (0–0.6, 0.6–0.8 and 0.8–1). Agreement scores provide an alternative comparison strategy, particularly for genes classified as low cell type specificity in one or both datasets. This is especially relevant for housekeeping genes, which are broadly expressed and therefore rarely classified as elevated. 5808 proteins show an expression agreement score below 0.6, whereas 5331 proteins score under 0.6 for the detection agreement score. 2295 proteins score above 0.8 in expression agreement score, while 4903 score over 0.8 in detection agreement score. For example, housekeeping genes such as GAPDH and LDHA show strong detection agreement despite differences in detection sensitivity and dynamic range between RNAseq and DVP data. Many housekeeping proteins involved in metabolism, translation, or protein production are classified as elevated based on protein data but not RNA data. This likely reflects the broader dynamic range and sensitivity of proteomics measurements for highly abundant proteins. GAPDH is detected in all samples and score 1.0 in detection agreement, while only 0.33 in expression agreement due to the variation in values. The disagreement in expression levels is due to single nuclei RNAseq datasets (with lower dynamic range than single cell RNAseq) in comparison to the DVP data with high dynamic range. RNA and protein levels for enzymes often show only modest agreement because they are regulated on different timescales and through multiple post-transcriptional and post-translational mechanisms. In addition, proteins have a higher dynamic range and are buffered by stability and cellular demand, meaning changes in RNA abundance do not translate linearly into protein levels. This is observed for lactate dehydrogenase A (LDHA), supporting high-energy demand states, and detected in all cell types, the detection agreement score is 1, whereas the expression agreement score is 0.58. However, LDHA is only elevated based on DVP - in skeletal myofibers, and RNAseq suggest low cell type specificity. Among the 1374 genes classified as low cell type specificity in both RNA and protein datasets, almost all show a detection agreement score above 0.6 (1345), and 463 show expression agreement score above 0.6. Among these proteins is where you find many houskeeping proteins, such as RPL24 and RPLP0. RPL24 shows broad expression across most cell types, although expression levels are generally slightly lower in single-nucleus RNA-seq data (brain, kidney, heart muscle and skeletal muscle). RPLP0 is classified as elevated in pancreatic islets and epithelial cells at the protein level, likely reflecting high protein production activity in these cell types. Limitations and considerationsAs described above, a small number of cell types were treated as overlap exceptions despite differences between RNA and protein localization. These exceptions primarily reflect technical limitations of DVP sampling and tissue complexity. DVP samples are enriched for specific cell types but are not composed exclusively of a pure single cell population. Small tissue fragments and neighboring cell types may therefore contribute to the detected protein signal. Users are encouraged to inspect the associated high-resolution confocal images and tissue masks to better understand the sampled material and potential sources of overlap or discordance. The matched Single Cell Type data used for RNA to protein comparison is limited by the single nuclei RNAseq data representing several of the tissue types, brain, kidney, heart muscle and skeletal muscle are represented by single nuclei RNAseq instead of single cell RNAseq, due to complex (brain), large (muscle) or tightly bound (kidney) cell types, making full cell extraction impossible in an unbiased way. Technical limitationsAll expression analyses require arbitrary thresholds for defining detection and specificity. For RNAseq, nCPM = 1 is used as the detection cutoff. However, genes expressed below this threshold may still be biologically relevant. Similarly, proteomics data is influenced by peptide detectability, protein abundance, dynamic range, and instrument sensitivity. Some proteins are consistently difficult to detect, whereas highly abundant proteins may dominate measurements. These technical limitations should be considered when interpreting RNA–protein concordance and specificity classifications. Sample variation is another technical factor. For DVP, a protein value was set to missing for a cell type if more than half of its replicates lacked a measurement for that protein. In general, protein detection was defined as a non-missing intensity value. This means that there are cases were individual samples show detection while the value reported is not detected. Downloadable data includes data at sample level if users are interested to explore further. Blood proteinsBlood proteins frequently show weak overlap between RNA and protein localization, which is expected due to protein transport through blood circulation. For example, plasma proteins may be transcribed predominantly in hepatocytes while the corresponding proteins are detected broadly across vascularized tissues. Integration of DVP data with single cell RNAseq data demonstrates that RNA–protein concordance varies substantially across different classes of plasma proteins, ALB and APOA4 show highly liver specific RNA expression but widespread protein detection, PROC and APOA5 show intermediate concordance and ASGR1 and PPY display strong RNA–protein agreement. Together, these examples highlight the heterogeneous relationship between RNA expression and circulating plasma protein abundance. Missmatched blood proteinsWell-know blood proteins, like Albumin (ALB) is highly expressed (RNA) in the hepatocytes, while the protein is detected thoughout the whole body, and high levels can be seen it vascularized tissues, such as lung and muscle tissues.
Blood proteins with correlating RNA and proteinA total of 67 proteins secreted into blood show overlap between RNA and protein specificity profiles. One example is ASGR1, which demonstrates strong concordance between hepatocyte RNA expression and protein abundance. |