Colorectal cancer (CRC) is a leading cause of cancer-related mortality worldwide, accounting for almost 10% of deaths.¹ The incidence of CRC is rising due to population aging and various dietary, lifestyle, and environmental factors.¹ Genetic factors, including mutations in tumor suppressor genes and oncogenes, play a crucial role in CRC pathophysiology and often drive changes in gene expression.² Studying these transcriptomic changes provides insights into the molecular mechanisms underlying CRC and enables the identification of differentially expressed genes, which can serve as biomarkers for diagnosis, prognosis, and treatment response.³
Re-analyzing public RNA-seq data offers an invaluable resource for such studies, allowing researchers to validate findings and discover novel insights. However, acquiring the data, running a pipeline at scale, and creating visualizations to derive insights can be challenging. Accessing large public datasets requires efficient data management and substantial computational resources.⁴ Running bioinformatics pipelines involves integrating various tools and ensuring reproducibility and scalability, which can be complex and time-consuming. Additionally, creating meaningful visualizations that accurately represent the data and uncover significant patterns necessitates advanced skills and tools.⁵
The Almaden Genomics g.nome® platform makes it easier to manage and analyze large RNA-seq datasets. g.nome streamlines the entire end-to-end process, from data acquisition to visualization, helping researchers quickly uncover valuable insights into gene expression patterns. To demonstrate this, we conducted a differential gene expression analysis on RNA-seq data from CRC patients, re-analyzing four transcriptomics datasets from the Sequence Read Archive (SRA) referenced in the study by Hosseini & Nemati.5 These datasets included samples of colorectal tumor tissues and adjacent normal tissues, providing a rich resource for understanding the molecular mechanisms of CRC. This re-analysis highlights the value of public data in validating findings and discovering new insights.
We conducted a differential gene expression analysis on RNA-seq data from CRC patients, re-analyzing four transcriptomics datasets from the Sequence Read Archive (SRA) referenced in the study by Hosseini & Nemati5. The datasets included samples of colorectal tumor tissues and adjacent normal tissues (Table 1).
Differential expression analysis helps to uncover molecular mechanisms underlying cancer progression by revealing genes that are significantly upregulated or downregulated in cancerous tissues compared to normal tissues. We conducted differential analysis of gene expression between tumor and adjacent normal tissues of CRC patients, based on the four transcriptomics datasets from the Sequence Read Archive (SRA) used in the recent publication of Hosseini & Nemati5 (Table 1).
SRA Accession Number |
# of Colorectal Tumor Tissues |
# of Adjacent Normal Tissues |
Study Name |
SRP219837 |
5 |
5 |
Epigenomics landscape of colorectal cancer |
SRP301216 |
5 |
6 |
Identification of genes associated with the onset of colorectal cancer by transcriptomic analyses of the adenoma-carcinoma sequence |
SRP344867 |
5 |
4 |
RNA Sequencing in Adenoma-Cancer Transition |
SRP245232 |
3 |
3 |
RNA-seq of CRC tissues |
Table 1. RNA-seq datasets used in this study. |
Figure 1. Selecting data for the guided workflow in g.nome. |
The g.nome platform provides a user-friendly, guided workflow to simplify the process of downloading data for bioinformatics analysis. Here we see the setup for executing the nf-core fetchngs workflow. This workflow facilitates the efficient fetching of NGS (Next-Generation Sequencing) data from public repositories.
By using g.nome's guided workflow, researchers can streamline data acquisition processes, reduce setup time, and minimize errors, making it easier to manage large-scale RNA-seq datasets for further analysis.
Figure 2. Using the guided workflow to process the data. |
Here we demonstrate how to use the g.nome platform for Bulk RNA-seq data analysis. The step-by-step process includes:
Using the g.nome platform, researchers can efficiently conduct comprehensive RNA-seq data analyses, from data acquisition to visualization, thereby accelerating the discovery of new insights in genomic research.
Figure 3. MA plots, heatmaps and volcano plots are among the visualizations used for differential expression analysis. |
Pre-processing of read files was performed using fastp. According to fastp, approximately 95% and 90% of bases in the sequencing run SRR16832117 had base quality values better than 30 and 20, respectively. fastp reports the number of reads that passed the quality filter as well as the number of reads that failed due to low quality, too many unresolved bases, and insufficient length. It also reports the number of reads with adapters trimmed as well as the number of bases trimmed due to adapters.
Figure 4. Read quality assessment by FASTQC. The graph shows per base sequence quality for forward reads corresponding to the sequencing run SRR16832117 from the study of Orouji et al.12
Filtered direct and reverse reads were aligned against the pre-indexed reference sequence of the human genome using the STAR spliced read alignment algorithm. The alignments are stored in the widely used BAM (Binary Alignment Map) format. Using SRR16832117 as an example, there were 39 million reads with an average length of 294, of which 92% could be uniquely mapped to the genome sequence, 4.6% were mapped to multiple loci, and the remaining reads could not be mapped due to various reasons. In the final processing step, g.nome uses Picard tools to verify mate-pair information between reads and to mark duplicates originating from a single fragment of DNA.
The dataset of high-quality alignments stored in BAM files serves as input to the main workflow for differential expression analysis. Note that the pre-processing and alignment workflow is used here as a sub-workflow. This illustrates the useful feature of g.nome, which allows combining simple sub-workflows into more complex processing pipelines hierarchically and reusing specific workflows in different contexts. The featureCounts algorithm quantifies the abundance of RNA transcripts in sequencing data by assigning reads to genes based on their alignment to the human genome. The resulting gene count matrix is used by DESeq2 to compare gene expression levels across different experimental conditions or groups to identify genes that are differentially expressed.
Key metrics include log fold change (indicating the magnitude and direction of change in gene expression) and p-value (measuring the statistical significance of the observed changes). The top 10 differentially expressed genes identified in this analysis are visualized in the volcano plot below. These genes have known roles in CRC, such as KRT23, which promotes CRC cell proliferation, and ETV4, which enhances tumor progression and metastasis.
Figure 5. Volcano plot visualizing the genes differentially expressed between CRC samples and normal adjacent tissues. The top ten differentially expressed genes listed in Table 2 are highlighted by larger circles. |
Gene Name |
Log2 Fold Change |
Adjusted p-value |
Protein Name |
Known Roles in Colorectal Cancer |
KRT23 |
6.7 |
6.9e-23 |
Keratin, type I |
Promotes CRC growth by activating human telomerase reverse transcriptase, associated with CRC cell proliferation and migration. [6] |
ETV4 |
4.0 |
1.7e-22 |
ETS translocation variant 4 |
Enhances CRC cell proliferation, invasion, and metastasis, influences the tumor microenvironment. [7, 8] |
CPNE7 |
4.6 |
8.0e-22 |
Copine-7 |
Suggested as a prognostic factor and therapeutic target in CRC. [9] |
GRIN2D |
4.1 |
6.4e-21 |
Glutamate receptor ionotropic |
Angiogenic tumor endothelial marker specific to CRC vessels, correlated with improved survival in CRC patients. [10] |
BEST4 |
6.1 |
3.1e-20 |
Bestrophin-4 |
Suppresses epithelial-to-mesenchymal transition in CRC cells. [11] |
STRA6 |
4.5 |
6.6e-19 |
Receptor for retinol uptake |
Plays a key role in colon cancer stem cell maintenance and contributes to high-fat diet-induced colon carcinogenesis. [12] |
CDH3 |
5.8 |
2.0e-18 |
Cadherin-3 |
Elevated expression in CRC, demethylation linked to advanced CRC. [13, 14] |
KLK6 |
9.5 |
1.9e-17 |
Kallikrein-6 |
Prognostic biomarker due to dramatic upregulation in CRC, key role in colon cancer cell migration and invasiveness. [15, 16] |
LARGE2 |
3.7 |
3.1e-17 |
Xylosyl- and glucuronyltransferase |
Increased levels in CRC compared to benign colonic epithelial cells, involved in CRC cell migration and adhesion. [17] |
MMP7 |
7.5 |
1.2e-16 |
Matrilysin |
Required for CRC tumor formation, affects drug resistance. [18, 19] |
Table 2. Top 10 genes differentially expressed between tumor and normal tissues |
Re-analyzing public RNA-seq data using the g.nome platform demonstrates the potential of leveraging existing datasets to gain new insights into cancer biology. This approach not only validates previous findings but also allows for the discovery of novel biomarkers and therapeutic targets. The g.nome platform's ease of use and robust analytical capabilities make it an invaluable tool for researchers aiming to perform complex bioinformatics analyses with minimal programming effort.