Uniform Manifold Approximation and Projection (UMAP) is an analytical technique for reducing the dimensionality of a data set (Becht E et al. (2018)). The Cell Atlas UMAP Explorer is generated using the large collection of confocal microscopy images showing the subcellular localization patterns of human proteins. A machine learning model trained to classify the subcellular locations in these images used to extract 1024 features from each image in the Subcellular Section of the Human Protein Atlas (Ouyang W et al. (2019)). The dimensionality of this dataset is then reduced by uniform manifold approximation and projection (UMAP). The result is displayed in a two- or three dimensional scatter plot, where each data point represents one image. This tool provides a new way to visualize and explore the highly dimensional protein localization data that makes up the Subcellular Section, projected in a two- or three dimensional space. By coloring the data points, each representing one image, according to subcellular localizations it is evident that images of proteins localizing to the same compartment tend to cluster together. Overlaying the UMAP projection with different data can allow you to find new features, and identify interesting groups of genes, in a large and complex data set.
Clicking a data point in the plot displays the corresponding image together with information about gene name, cell line, annotated subcellular location(s), and antibody. The legend below the UMAP can be used to toggle the different subcellular locations on and off in the UMAP. Click on one location in the legend to only display data points for images with an annotation of that structure. You can select multiple subcellular locations at the same time. Clicking again on one of the selected subcellular locations will deselect it, while clicking on Clear filter will reset and display all data points in the UMAP again. Images with annotations of multiple locations, representing multilocalizing proteins, are shown in grey.
A strength of the HPA database is the gene-centric integrations of a large collection of different datasets. The Search function allows you to search for an individual gene, but also to perform complex filtering of the data points in the UMAP. Using pre-defined search terms, images can be filtered based on general gene information (eg. gene name or chromosome location) as well as data from all different sections of the HPA (eg tissue expression or prognostic cancer association). Read more about about how to use the Search function here.