Single-cell RNA sequencing (scRNA-seq) allows to investigate transcriptional programs of individual cell as well as the composition and dynamics of cell populations. In cancer studies, scRNA-seq is the fundamental method for comprehending cellular heterogeneity of tumors. Sequenced mRNA reads, identified by barcodes or unique molecular identifiers (UMI), are aligned against the reference genome and the count matrix is calculated, which captures the expression levels of genes across individual cells. Here we demonstrate the exploration of single-cell transcriptomics data using the Seurat software package [36].
We obtained scRNA-Seq data of matched blood, normal, para-cancer, polyp, and colorectal cancer tissues for four patients from the study of Zheng et al. [37] (accession code in the GEO database “GSE161277”). The data were available in the Market Exchange (MEX) format (alternatively, the HDF5 format can also be used). This dataset was used as input for the Seurat software pipeline depicted in Figures 9, 12, and 13. Additionally, a metadata file was provided containing the assignments of samples to five tissue origins - blood, normal, para-cancer, adenoma, and carcinoma – according to the annotations obtained from the GEO database. The first step in the Seurat pipeline (Figure 9) is preprocessing, which essentially means removing low quality cells. A typical filtering step, among many others, involves excluding from consideration any cells possessing fewer than a specific count of features (i.e., genes) as well as features expressed in fewer than a certain count of cells. As seen in Figure 9, we used two integer variables to control preprocessing: “minimum number of cells = 3” and “minimum number of features = 200”. Figure 10 shows the distribution of the number of genes expressed by individual cells across samples.
Discover more insights in our full CRC whitepaper, including additional methodologies like differential expression and survival analysis:
Workflow
Single-cell transcriptomics analysis of colorectal cancer
Single-cell RNA sequencing (scRNA-seq) allows to investigate transcriptional programs of individual cell as well as the composition and dynamics of cell populations. In cancer studies, scRNA-seq is the fundamental method for comprehending cellular heterogeneity of tumors. Sequenced mRNA reads, identified by barcodes or unique molecular identifiers (UMI), are aligned against the reference genome and the count matrix is calculated, which captures the expression levels of genes across individual cells. Here we demonstrate the exploration of single-cell transcriptomics data using the Seurat software package [36].
We obtained scRNA-Seq data of matched blood, normal, para-cancer, polyp, and colorectal cancer tissues for four patients from the study of Zheng et al. [37] (accession code in the GEO database “GSE161277”). The data were available in the Market Exchange (MEX) format (alternatively, the HDF5 format can also be used). This dataset was used as input for the Seurat software pipeline depicted in Figures 9, 12, and 13. Additionally, a metadata file was provided containing the assignments of samples to five tissue origins - blood, normal, para-cancer, adenoma, and carcinoma – according to the annotations obtained from the GEO database. The first step in the Seurat pipeline (Figure 9) is preprocessing, which essentially means removing low quality cells. A typical filtering step, among many others, involves excluding from consideration any cells possessing fewer than a specific count of features (i.e., genes) as well as features expressed in fewer than a certain count of cells. As seen in Figure 9, we used two integer variables to control preprocessing: “minimum number of cells = 3” and “minimum number of features = 200”. Figure 10 shows the distribution of the number of genes expressed by individual cells across samples.
Figure 1. Initial Seurat scRNA-seq analysis workflow: preprocessing, normalization, identification of variable features, integration/removal of batch effects, and scaling of data. |
Figure 2. Distribution of the number of genes expressed by individual cells, highlighting low-quality cells and potential batch effects. A dashed line marks the threshold of 200 genes, considered a suitable minimum. |
Next, data needs to be normalized to account for sequencing depth and to enable the comparison of gene expression between different cells. Expression values of individual genes (feature counts) for each cell are divided by the total expression for that cell, multiplied by a scaling factor, and then log transformed. The most informative features (genes) that exhibit high cell-to-cell variation are identified using the Seurat function FindVariableFeatures. In Figure 11 the top 2000 most variable features are shown in red while genes that are not significantly variable between cells are marked black. For the top 10 most variable features gene names are also provided; nine of these genes are immunoglobulins and DEF5A is also involved in immune response. To find shared cell types and states single-cell datasets obtained from different patients, batches, and conditions must be harmonized using the integration step of the Seurat pipeline. This is achieved by pinpointing the pairwise correspondences of cells between individual datasets and then transforming datasets into a shared space [38]. Finally, the data gets scaled by the standard deviation of each feature and centered to have a mean of zero.
Figure 3. Visualization of the most variable genes, with the top 2000 shown in red. The most variable features include several immunoglobulins and the gene DEF5A, related to immune response. |
The second part of the Seurat workflow implemented in g.nome (Figure 12) is aimed at identifying cellular populations by grouping similar cells based on the expression patterns of the 2000 most variable genes. This means that the dimensionality of the problem has already been reduced roughly ten-fold, and further dimensionality reduction is typically conducted by principal component analysis (PCA). The next two steps – FindNeighbours and FindClusters – are used to find the closest data points to each cell and then to group neighboring cells into clusters. Finally, these groups are visualized by way of an unannotated single cell transcriptomic map (shown in Figure 14, left panel) using mathematical techniques such as t-distributed stochastic neighbor embedding (t-SNE) [39] or uniform manifold approximation and projection (UMAP) [40]. In our case, this map contains 37 different cell clusters.
Figure 4. Workflow for identifying cellular populations: grouping cells by expression patterns of variable genes using dimensionality reduction techniques like PCA. |
Figure 5. Single-cell transcriptomic map visualizing clusters of similar cells using techniques such as t-SNE or UMAP. This map shows 37 distinct cell clusters. |
Obtaining clusters of cells is relatively straightforward. A much more challenging task is to associate the resulting clusters with biological states. Using the singleR package [41] it is possible to automatically annotate cell identities by correlating reference transcriptomic data sets of pure cell types with single-cell gene expression (Figure 14, right panel). In our case we used the Human Primary Cell Atlas [42] as the reference. SingleR identifies variable genes among the cell types in the reference set and then computes Spearman correlation between the input single-cell transcriptome with each sample in the reference set. The resulting annotation is a useful first approximation, but it is entirely generic and has limited relevance for cancer cell analysis. To derive CRC-specific cell annotation we applied the marker-based strategy, which exploits known relationships between a set of marker genes and cell types. Following the approach described by Zheng et al. [37] we investigated the expression of 43 canonical CRC-related marker genes, which are uniquely or predominantly expressed in CRC cells. For example, genes GUCA2A, MUC2, and DCN are known to exhibit high expression levels in enterocytes, goblets, and fibroblasts in colorectal cancer samples. The resulting t-SNE characterization of CRC cells is presented in Figure 15.
Figure 6. Cell type annotation process using the SingleR package, which correlates known transcriptomic data sets with the single-cell data to assign cell identities. |
Figure 7. Detailed t-SNE plot annotated based on the expression of 43 canonical colorectal cancer (CRC) marker genes, highlighting cells like enterocytes and fibroblasts associated with CRC. |
Conclusions
The study effectively utilized g.nome to analyze single-cell RNA sequencing data from colorectal cancer tissues, revealing distinct cellular populations and states. By comparing these to normal tissues, we enhanced our understanding of cellular dynamics and their roles in cancer progression.
Interested in more? Discover additional insights by downloading the full document.