Single-cell RNA sequencing (scRNA-seq) allows to investigate transcriptional programs of individual cell as well as the composition and dynamics of cell populations. In cancer studies, scRNA-seq is the fundamental method for comprehending cellular heterogeneity of tumors. Sequenced mRNA reads, identified by barcodes or unique molecular identifiers (UMI), are aligned against the reference genome and the count matrix is calculated, which captures the expression levels of genes across individual cells. Here we demonstrate the exploration of single-cell transcriptomics data using the Seurat software package [36].

We obtained scRNA-Seq data of matched blood, normal, para-cancer, polyp, and colorectal cancer tissues for four patients from the study of Zheng et al. [37] (accession code in the GEO database “GSE161277”). The data were available in the Market Exchange (MEX) format (alternatively, the HDF5 format can also be used). This dataset was used as input for the Seurat software pipeline depicted in Figures 9, 12, and 13. Additionally, a metadata file was provided containing the assignments of samples to five tissue origins - blood, normal, para-cancer, adenoma, and carcinoma – according to the annotations obtained from the GEO database. The first step in the Seurat pipeline (Figure 9) is preprocessing, which essentially means removing low quality cells. A typical filtering step, among many others, involves excluding from consideration any cells possessing fewer than a specific count of features (i.e., genes) as well as features expressed in fewer than a certain count of cells. As seen in Figure 9, we used two integer variables to control preprocessing: “minimum number of cells = 3” and “minimum number of features = 200”. Figure 10 shows the distribution of the number of genes expressed by individual cells across samples.

 

Objectives

  • Streamline Single-Cell Analysis for Colorectal Cancer: Provide a simplified, user-friendly interface through g.nome for researchers and clinicians to perform single-cell transcriptomics analysis of colorectal cancer. This objective focuses on enabling users to conduct in-depth cellular heterogeneity studies without the need for extensive bioinformatics skills, thereby accelerating research outputs.
  • Leverage Public Data To Analyze Single-Cell Data to Unravel CRC Complexity: Utilize publicly available single-cell RNA sequencing datasets from studies on colorectal cancer to deepen our understanding of the cellular composition and dynamics within CRC tissues. By leveraging g.nome’s capabilities, this initiative aims to dissect the transcriptomic layers of CRC, leading to a better understanding of the disease's molecular basis.
  • Characterize Cell Populations in CRC: Utilize g.nome to analyze publicly available single-cell RNA sequencing datasets from colorectal cancer studies. This approach aims to reveal the cellular composition and dynamics within CRC tissues, enhancing the understanding of how different cell populations contribute to cancer progression and response to treatment.
  • Define and Compare Cellular States in CRC: Employ g.nome to conduct comparative analyses of cell populations in colorectal cancer versus normal tissue. The goal is to identify distinct cellular states and their transitions within CRC, which can illuminate mechanisms of disease progression and potential intervention points.

 

Workflow

  1. Data Acquisition:
    • Source single-cell RNA sequencing data from existing studies, focusing on multiple tissue types related to colorectal cancer.
  2. Quality Control and Preprocessing:
    • Remove low-quality data points and normalize the remaining data to facilitate accurate comparative analysis.
  3. Feature Identification and Integration:
    • Identify key genetic features that vary significantly across cells and integrate data from different sources to ensure a cohesive analysis framework.
  4. Data Analysis and Interpretation:
    • Perform advanced clustering to group similar cell types and use bioinformatics tools to annotate these clusters based on known genetic markers.
    • Interpret the results to identify distinct cellular behaviors and their implications in colorectal cancer.

 

Single-cell transcriptomics analysis of colorectal cancer

Single-cell RNA sequencing (scRNA-seq) allows to investigate transcriptional programs of individual cell as well as the composition and dynamics of cell populations. In cancer studies, scRNA-seq is the fundamental method for comprehending cellular heterogeneity of tumors. Sequenced mRNA reads, identified by barcodes or unique molecular identifiers (UMI), are aligned against the reference genome and the count matrix is calculated, which captures the expression levels of genes across individual cells. Here we demonstrate the exploration of single-cell transcriptomics data using the Seurat software package [36].

We obtained scRNA-Seq data of matched blood, normal, para-cancer, polyp, and colorectal cancer tissues for four patients from the study of Zheng et al. [37] (accession code in the GEO database “GSE161277”). The data were available in the Market Exchange (MEX) format (alternatively, the HDF5 format can also be used). This dataset was used as input for the Seurat software pipeline depicted in Figures 9, 12, and 13. Additionally, a metadata file was provided containing the assignments of samples to five tissue origins - blood, normal, para-cancer, adenoma, and carcinoma – according to the annotations obtained from the GEO database. The first step in the Seurat pipeline (Figure 9) is preprocessing, which essentially means removing low quality cells. A typical filtering step, among many others, involves excluding from consideration any cells possessing fewer than a specific count of features (i.e., genes) as well as features expressed in fewer than a certain count of cells. As seen in Figure 9, we used two integer variables to control preprocessing: “minimum number of cells = 3” and “minimum number of features = 200”. Figure 10 shows the distribution of the number of genes expressed by individual cells across samples.

Screenshot 2024-05-13 at 3.50.12 PM

Figure 1. Initial Seurat scRNA-seq analysis workflow: preprocessing, normalization, identification of variable features, integration/removal of batch effects, and scaling of data.

 

Screenshot 2024-05-13 at 3.50.25 PM

Figure 2. Distribution of the number of genes expressed by individual cells, highlighting low-quality cells and potential batch effects. A dashed line marks the threshold of 200 genes, considered a suitable minimum.

 

Next, data needs to be normalized to account for sequencing depth and to enable the comparison of gene expression between different cells. Expression values of individual genes (feature counts) for each cell are divided by the total expression for that cell, multiplied by a scaling factor, and then log transformed. The most informative features (genes) that exhibit high cell-to-cell variation are identified using the Seurat function FindVariableFeatures. In Figure 11 the top 2000 most variable features are shown in red while genes that are not significantly variable between cells are marked black. For the top 10 most variable features gene names are also provided; nine of these genes are immunoglobulins and DEF5A is also involved in immune response.  To find shared cell types and states single-cell datasets obtained from different patients, batches, and conditions must be harmonized using the integration step of the Seurat pipeline. This is achieved by pinpointing the pairwise correspondences of cells between individual datasets and then transforming datasets into a shared space [38]. Finally, the data gets scaled by the standard deviation of each feature and centered to have a mean of zero.

Screenshot 2024-05-13 at 3.55.46 PM

Figure 3. Visualization of the most variable genes, with the top 2000 shown in red. The most variable features include several immunoglobulins and the gene DEF5A, related to immune response.

 

The second part of the Seurat workflow implemented in g.nome (Figure 12) is aimed at identifying cellular populations by grouping similar cells based on the expression patterns of the 2000 most variable genes. This means that the dimensionality of the problem has already been reduced roughly ten-fold, and further dimensionality reduction is typically conducted by principal component analysis (PCA). The next two steps – FindNeighbours and FindClusters  – are used to find the closest data points to each cell and then to group neighboring cells into clusters. Finally, these groups are visualized by way of an unannotated single cell transcriptomic map (shown in Figure 14, left panel) using mathematical techniques such as t-distributed stochastic neighbor embedding (t-SNE) [39] or uniform manifold approximation and projection (UMAP) [40]. In our case, this map contains 37 different cell clusters. 

Screenshot 2024-05-13 at 3.50.35 PM

Figure 4. Workflow for identifying cellular populations: grouping cells by expression patterns of variable genes using dimensionality reduction techniques like PCA.

 

Screenshot 2024-05-13 at 3.50.40 PM

Figure 5. Single-cell transcriptomic map visualizing clusters of similar cells using techniques such as t-SNE or UMAP. This map shows 37 distinct cell clusters.

 

Obtaining clusters of cells is relatively straightforward. A much more challenging task is to associate the resulting clusters with biological states. Using the singleR package [41] it is possible to automatically annotate cell identities by correlating reference transcriptomic data sets of pure cell types with single-cell gene expression (Figure 14, right panel). In our case we used the Human Primary Cell Atlas [42] as the reference. SingleR identifies variable genes among the cell types in the reference set and then computes Spearman correlation between the input single-cell transcriptome with each sample in the reference set. The resulting annotation is a useful first approximation, but it is entirely generic and has limited relevance for cancer cell analysis. To derive CRC-specific cell annotation we applied the marker-based strategy, which exploits known relationships between a set of marker genes and cell types. Following the approach described by Zheng et al. [37] we investigated the expression of 43 canonical CRC-related marker genes, which are uniquely or predominantly expressed in CRC cells. For example, genes GUCA2A, MUC2, and DCN are known to exhibit high expression levels in enterocytes, goblets, and fibroblasts in colorectal cancer samples. The resulting t-SNE characterization of CRC cells is presented in Figure 15.

image-png-May-06-2024-06-57-55-6253-PM image-png-May-06-2024-06-58-01-9783-PM

 

Figure 6. Cell type annotation process using the SingleR package, which correlates known transcriptomic data sets with the single-cell data to assign cell identities.

 

image-png-May-06-2024-06-58-22-2446-PM

 

Figure 7. Detailed t-SNE plot annotated based on the expression of 43 canonical colorectal cancer (CRC) marker genes, highlighting cells like enterocytes and fibroblasts associated with CRC.

 

Conclusions

The study effectively utilized g.nome to analyze single-cell RNA sequencing data from colorectal cancer tissues, revealing distinct cellular populations and states. By comparing these to normal tissues, we enhanced our understanding of cellular dynamics and their roles in cancer progression.