Jump to content

Draft:Clustergrammer

From Wikipedia, the free encyclopedia

Clustergrammer

[edit]

Clustergrammer is a web-based interactive tool designed for visualizing and analyzing high-dimensional data through heatmaps. It was developed by the Ma'ayan Laboratory at the Icahn School of Medicine at Mount Sinai. The tool addresses the limitations of static heatmaps by integrating interactive features, facilitating the analysis of complex biological datasets, including genomics and proteomics

Introduction

[edit]

Clustergrammer is a visualization tool specifically designed for high-dimensional data commonly encountered in computational biology and data science [1]. Unlike traditional static heatmaps, it enables users to explore data interactively by zooming, panning, clustering, and reordering rows and columns. The tool is applicable across various domains, including gene expression analysis, protein interaction networks, and single-cell data visualization. By leveraging web-based technologies, Clustergrammer provides accessible and shareable visualizations that simplify the interpretation of complex datasets [2].

Features

[edit]

Interactive Heatmaps

[edit]

Clustergrammer enables users to create interactive heatmaps that allow for dynamic exploration of data. Features [3]include

The interactive heatmap displayed was generated using Clustergrammer to visualize gene expression data from the Cancer Cell Line Encyclopedia (CCLE)
  • Zooming and Panning: Users can navigate large datasets efficiently
  • Filtering and Reordering: Rows and columns can be reordered by hierarchical clustering, sum, variance, or labels.
  • Search and Highlighting: Specific rows or columns can be located quickly using search functions. (upload a gif we just downloaded)

Interactive Dimensionality Reduction

[edit]
The interactive heatmap using Clustergrammer when PCA applied to the CCLE.

Dimensionality reduction techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) simplify high-dimensional data for visualization. Clustergrammer enhances this process by allowing users to filter rows based on sum or variance, focusing on the most informative data points. This interactive filtering helps identify how specific dimensions affect clustering patterns. For smaller datasets, it uses animations to show the impact of these changes, aiding in data interpretation.

Clustering Algorithms

[edit]
The interactive heatmap using Clustergrammer when clustering applied to the CCLE.

Clustergrammer employs hierarchical clustering algorithms, with support for additional methods such as K-means clustering. Users can visualize dendrograms, toggle between clustering levels, and extract enriched clusters.

Interactive Dendrograms: Clustergrammer employs interactive dendrograms to represent hierarchical clustering of data rows and columns. Instead of displaying the entire tree, it shows one slice at a time using gray trapezoids. Users can adjust the dendrogram slider to explore different clustering levels, revealing larger or smaller clusters. Interacting with these trapezoids highlights specific clusters, provides detailed information, and allows exporting of row or column names. For gene-level data, users can send clustered genes to Enrichr for enrichment analysis, facilitating deeper biological insights.

Customization Options

[edit]

The tool provides various customization features:

  • Users can adjust the opacity, highlight categories, and crop data subsets for detailed exploration.
  • Integrations with external APIs, such as Enrichr, allow for enrichment analysis directly within the visualization.

Applications

[edit]

1. High-Dimensional Data Visualization

Clustergrammer is a powerful tool for analyzing large and complex datasets by creating interactive heatmaps. These visualizations enable researchers to examine high-dimensional data intuitively, even when datasets contain thousands of rows and columns. This makes it particularly useful for summarizing, filtering, and interpreting large-scale experiments or studies.

2. Gene Expression Analysis

Widely used in genomics, Clustergrammer aids in analyzing gene expression data, including single-cell RNA sequencing (scRNA-seq) [4]>. By visualizing relationships among genes or samples, the tool helps researchers identify meaningful patterns, clusters, and correlations, offering insights into underlying biological processes or gene functions.

3. Biological Network Visualization

The tool is applied to represent biological networks such as protein-protein interactions, metabolic pathways, or gene regulatory networks. Clustergrammer’s clustering capabilities help pinpoint highly interconnected nodes or significant components, which are often critical in understanding the system's overall function or discovering key biomarkers.

4. Hierarchical Clustering

Clustergrammer supports hierarchical clustering, a method for organizing data into groups based on similarity. This is essential for categorizing features like genes, conditions, or samples into clusters, revealing relationships and structures within the data. Such clustering is especially valuable in understanding biological datasets, where interconnectedness is common.

5. Single-Cell Data Analysis

In single-cell studies, Clustergrammer is instrumental in exploring datasets derived from technologies like 10X Genomics. It allows researchers to classify cells based on gene expression signatures, visualize population structures, and assess how cells relate to one another, helping to uncover novel cell types or states.

6. Comparative Data Analysis

Clustergrammer facilitates the comparison of multiple datasets or experimental conditions. By visualizing and contrasting data in heatmaps, researchers can quickly identify similarities or differences between groups, aiding in hypothesis generation or validation.

Technical details

[edit]

Architecture

[edit]

Clustergrammer operates on a modular architecture comprising:

  • Backend: Built using Python, with key libraries such as NumPy and SciPy for data processing.
  • Frontend: Employs JavaScript and D3.js for rendering interactive visualizations.
  • Integration: The tool supports integration with Jupyter Notebooks and REST APIs, enabling seamless workflow incorporation.

Core Libraries are Clustergrammer-PY and Clustergrammer-JS.

Clustergrammer2

[edit]

Clustergrammer2 is a specialized Jupyter widget that enables interactive visualization of high-dimensional biological data. Developed using widget-ts-cookiecutter[5]> and regl WebGL library [6]>, it focuses on analyzing single-cell datasets, particularly RNA sequencing data. The tool supports exploration of large-scale data, like the analysis of gene expression patterns across thousands of cells [7]. For example, researchers have used it to examine 2,700 PBMCs and identify cell types based on gene expression signatures.

Clustergrammer-JS

[edit]

Clustergrammer-JS is a frontend and JavaScript visualization library that generates interactive heatmaps in web browsers. Built on D3.js and SVG technology, it renders complex data in an explorable format with features like:

  • Data filtering options (Data filtering capabilities encompass three main categories: value-based, categorical, and interactive filtering. Value filters allow threshold-based row/column manipulation, handling of numerical criteria, and removal of sparse data points. Category-based filtering enables grouping by metadata, visibility toggling of specific groups, and filtering based on clustering outcomes. Interactive selections provide manual row/column control, subset data visualization, and dynamic content reordering, allowing users to explore and analyze complex datasets efficiently through both preprocessing and real-time filtering options.)
  • Customizable information displays on hover
  • Seamless web application integration

The library works with JSON data produced by Clustergrammer-PY and provides developers the tools to embed dynamic visualizations in their web projects. Its source code and installation details are available on [8]

Clustergrammer-PY

[edit]

Clustergrammer-PY is a backend Python package that enables users to create dynamic heatmap visualizations through automated data analysis. The tool processes input data to generate JSON files that power interactive web-based displays via Clustergrammer-JS.

Key features include:

  • Data preprocessing capabilities like hierarchical clustering and multiple normalization options
  • Support for both file-based and DataFrame inputs
  • Integration with major scientific Python libraries (The library demonstrates broad compatibility through integration with essential scientific Python packages, including NumPy for matrix operations, Pandas for DataFrame processing, SciPy for statistical analysis, and scikit-learn for machine learning capabilities.)
  • Cross-version compatibility (Its cross-version support ensures functionality across both Python 2.7 and Python 3.x versions, maintaining backward compatibility through consistent function implementations and careful management of package dependencies.)

The package handles data transformation and prepares structured JSON output suitable for visualization. Users can access it through the source code repository [9].

Implementation Guide

[edit]

Clustergrammer is accessible through multiple platforms, including its web-based interface, Python API, and Jupyter Notebook integration. Below is a step-by-step guide to implementing Clustergrammer in various scenarios:

1. Using the Web Interface

[edit]

The easiest way to use Clustergrammer is through its web interface:

  1. Visit the Clustergrammer Web Tool.[10]
  2. Upload a CSV or TSV file containing your high-dimensional data.
  3. Use the interactive heatmap to explore, filter, and cluster your data dynamically

2. Python API: Clustergrammer-PY

[edit]

The Python API provides advanced users with full control over preprocessing and visualization. Follow these steps to use the API:

Step 1: Installation
[edit]

Install the Clustergrammer-PY library using pip:

pip install clustergrammer-py
Step 2: Import the Library
[edit]

Start by importing the Clustergrammer-PY module:

from clustergrammer import Network
Step 3: Load and Preprocess Data
[edit]

Initialize the Network object and load the data:

net = Network()
net.load_df(data)
Step 4: Apply Clustering
[edit]

Use the built-in clustering algorithms:

net.cluster()
Step 5: Save and Visualize Results
[edit]

Save the clustered data as a JSON file for visualization:

net.write_json_to_file('viz', 'clustergrammer_output.json')

3. Jupyter Notebook Integration

[edit]

To Visualize Clustergrammer heatmaps directly within Jupyter Notebooks, use the Clustergrammer2 widget

1.Install the clustergrammer2 package

pip install clustergrammer2

2.Import and use the widget in a Jupyter Notebook:

import clustergrammer2
from clustergrammer2 import CGM

# Initialize the Clustergrammer2 object
cgm = CGM()

# Load data into the widget
cgm.load_data(data)

# Display the interactive heatmap
cgm.widget()

This integration allows for seamless interaction with heatmaps during data exploration.

4. Integration with REST APIs

[edit]

Clustergrammer supports REST API endpoints for automation:

  1. Prepare a JSON-formatted data file as described in the Clustergrammer documentation.
  2. Use tools like curl or Python’s requests library to send POST requests to the API:
import requests

# Define API endpoint and data payload
url = "https://clustergrammer_api_url"
payload = {"data": data.to_json()}

# Send POST request
response = requests.post(url, json=payload)

# Retrieve clustered data
clustered_data = response.json()

Case studies

[edit]

1) Visium Spatial Transcriptomics Data from 10X Genomics[11]

[edit]

This case study examines a tool for analyzing high-dimensional spatial transcriptomics data using Clustergrammer2, bqplot, and voila, focusing on the V1_Mouse_Brain_Sagittal_Anterior Visium dataset from 10x Genomics. It integrates spatial tissue data with high-dimensional gene expression analysis, offering researchers an approach to studying the mouse brain's cellular and molecular organization. By associating spatial patterns with gene expression variability, this case study is relevant for research in neuroscience and genomics.

The study combines spatial and high-dimensional data through interactive panels. The left panel displays spatial tissue data and includes a UMAP-based clustering view to organize spots by gene expression similarity. The right panel shows the top 250 variable genes across ~2,500 spots, excluding ribosomal and mitochondrial genes. Hierarchical clustering, supported by Clustergrammer2, enables the identification of co-expressed genes, visualization of tissue-specific expression, and interaction with heatmaps to explore relationships between genes, cell clusters, and spatial locations.

For enhanced interpretation, this case study incorporates single-cell RNA-seq data (~14,000 cortical cells) from the Allen Institute as a reference, facilitating cell type annotation of the Visium data. This integration aligns spatial gene expression patterns with expected cell type distributions, aiding the identification of functional cell populations and regulatory networks. By integrating spatial and high-dimensional data, this case study highlights the utility of interactive visualization tools in biological and medical research.

2) CODEX Single Cell Multiplexed Imaging Dashboard[12]

[edit]

This case study examines the application of Clustergrammer2 and CODEX, a highly multiplexed cytometric approach developed by [13], to analyze spatially resolved single-cell data from mouse spleens. The dataset includes ~5,000 single cells derived from a segmented spleen image, where ~30 surface markers were measured. This combination of spatial resolution and high-dimensional data allows for a detailed examination of the cellular composition and organization within the spleen.

Clustergrammer2 was used to hierarchically cluster cells based on their marker profiles, identifying patterns of co-expression and cellular heterogeneity. Spatial context was incorporated using the Jupyter Widget bqplot, which visualized single-cell locations through Voronoi plots. The heatmap generated by Clustergrammer2 was linked to the spatial map via a dashboard built with voila, converting Jupyter notebooks into interactive web-based dashboards. This linkage enabled users to interact dynamically with the heatmap and highlight corresponding cells in the spatially resolved map, facilitating the exploration of relationships between cellular phenotypes and their spatial distribution.

This case study underscores the value of linked views for analyzing spatially resolved, high-dimensional single-cell data. The integration of clustering and visualization tools allows researchers to uncover meaningful biological patterns and spatial relationships. The dashboard, hosted on MyBinder, offers a replicable and accessible platform for data exploration, showcasing the potential of interactive visualization tools in advancing spatial multi-omics research.

3)scRNA-seq Gene Expression 2,700 PBMC[14]

[edit]

This case study examines the application of single-cell RNA sequencing (scRNA-seq) to analyze gene expression across thousands of individual cells, offering insights into cellular heterogeneity. The dataset, consisting of 2,700 peripheral blood mononuclear cells (PBMCs) obtained from 10X Genomics, includes thousands of gene expression measurements per cell, facilitating high-dimensional analysis.

Clustergrammer2 was utilized to explore the dataset interactively. Bulk gene expression signatures from CIBERSORT were used to assign tentative cell type labels to each cell. This approach enabled the clustering of cells based on gene expression profiles and the identification of patterns of co-expression among genes, providing insights into the diversity and functionality of immune cell populations within the PBMC dataset.

The study highlights the value of Clustergrammer2's dynamic visualization capabilities in combination with scRNA-seq data for uncovering biologically relevant patterns and relationships. The data and analysis workflow are accessible on GitHub through clustergrammer2-notebooks, allowing researchers to replicate and expand upon the analysis, underscoring its utility as a resource for studying immune system dynamics.

4)Lung Cancer Post-Translational Modification and Gene Expression Regulation[15]

[edit]

This case study examines the regulation of lung cancer at the post-translational modification (PTM) level, focusing on processes such as phosphorylation, acetylation, and methylation, which play critical roles in cellular signaling and tumor progression. Collaborators at Cell Signaling Technology Inc utilized Tandem Mass Tag (TMT) mass spectrometry to measure differential PTMs across 42 lung cancer cell lines and non-cancerous lung tissues. Additionally, gene expression data for 37 of these lung cancer cell lines was sourced from the Cancer Cell Line Encyclopedia (CCLE), facilitating a comprehensive and integrative analysis.

The dataset was analyzed interactively using the Jupyter notebook CST_Data_Viz.ipynb, where Clustergrammer2 enabled visualization of PTM data, gene expression data, and their combined profiles. Through hierarchical clustering, researchers identified co-regulated clusters of PTMs and genes, which were linked to distinct lung cancer subtypes. These clusters revealed molecular pathways potentially contributing to cancer heterogeneity. Enrichment analysis further elucidated the biological processes and signaling pathways associated with these clusters, offering deeper insights into the mechanisms driving tumor progression.

By integrating PTM and gene expression data, this case study highlights the potential to uncover novel molecular patterns and pathways specific to lung cancer subtypes. This dynamic approach advances the understanding of tumor biology and aids in identifying potential therapeutic targets, demonstrating the importance of visualization tools like Clustergrammer2 in cancer research.

References

[edit]
  1. ^ Clustergrammer documentation: https://clustergrammer.readthedocs.io/
  2. ^ Fernandez, Nicolas F.; Gundersen, Gregory W.; Rahman, Adeeb; Grimes, Mark L.; Rikova, Klarisa; Hornbeck, Peter; Ma’ayan, Avi (2017). "Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data". Scientific Reports. 7. doi:10.1038/s41598-017-01819-3 (inactive 2024-11-20).{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)
  3. ^ "Clustergrammer Documentation". Read the Docs. Retrieved 2024-11-19.
  4. ^ Jovic, D.; Liang, X.; Zeng, H.; Lin, L.; Xu, F.; Luo, Y. (2022). "single cell RNA". Clinical and Translational Medicine. 12 (3): e694. doi:10.1002/ctm2.694. PMC 8964935. PMID 35352511.
  5. ^ "widget-ts-cookiecutter". GitHub.
  6. ^ "regl". GitHub.
  7. ^ "Clustergrammer2 GitHub Repository". GitHub. Icahn School of Medicine at Mount Sinai. Retrieved 2024-11-19.
  8. ^ "Clustergrammer-JS GitHub Repository". GitHub. Retrieved 2024-11-19.
  9. ^ "Clustergrammer-PY GitHub Repository". GitHub. MaayanLab. Retrieved 2024-11-19.
  10. ^ "ClusterGrammer Webtool".
  11. ^ "Visium Spatial Transcriptomics Data from 10X Genomics". GitHub.
  12. ^ "CODEX Single Cell Multiplexed Imaging Dashboard". GitHub.
  13. ^ Goltsev, Y.; Samusik, N.; Kennedy-Darling, J.; Bhate, S.; Hale, M.; Vazquez, G.; Black, S.; Nolan, G. P. (2018). "Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging". Cell. 174 (4): 968–981.e15. doi:10.1016/j.cell.2018.07.010. PMC 6086938. PMID 30078711.
  14. ^ "(scRNA-seq)". GitHub.
  15. ^ "Lung Cancer Post-Translational Modification and Gene Expression Regulation". GitHub.