Survival analysis is vital in cancer research, offering insights into patient outcomes and treatment efficacy. Key clinical endpoints include Overall Survival (OS), which categorizes patients as alive or deceased. While OS is straightforward, it may be influenced by non-cancer-related deaths, potentially skewing study results. Consequently, measures like the progression-free interval (PFI) are often preferred in clinical trials. The Cancer Genome Atlas (TCGA) project has compiled extensive clinicopathological and molecular data for over 11,000 tumors across 33 cancer types.



  1. Streamline Survival Analysis for Colorectal Cancer: Provide a simplified, user-friendly interface through g.nome® for researchers and clinicians to perform survival analysis of colorectal cancer patients. This objective focuses on enabling users to conduct in-depth survival studies without the need for extensive bioinformatics skills, thereby accelerating research outputs.
  2. Leverage Public Data for Survival Analysis to Unravel CRC Complexity: Utilize publicly available TCGA datasets from studies on colorectal cancer to deepen our understanding of the genetic and molecular factors influencing patient survival. By leveraging g.nome’s capabilities, this initiative aims to dissect the prognostic layers of CRC, leading to a better understanding of the disease's molecular basis.
  3. Identify Prognostic Biomarkers in CRC: Utilize g.nome to analyze publicly available gene expression data from colorectal cancer studies. This approach aims to reveal key biomarkers that influence patient survival, enhancing the understanding of how different genetic factors contribute to cancer prognosis and response to treatment.
  4. Compare Survival Outcomes Based on Genetic Expression: Employ g.nome to conduct comparative analyses of survival outcomes in colorectal cancer patients with varying levels of gene expression. The goal is to identify distinct genetic markers and their impact on patient prognosis, which can illuminate mechanisms of disease progression and potential intervention points.



We analyzed clinicopathologic annotations and gene expression profiles from 459 colorectal cancer patients, sourced from the Integrated TCGA Pan-Cancer Clinical Data Resource. Using the Survival R package, Kaplan-Meier curves were generated by splitting patients into high and low gene expression groups and performing log-rank tests to compare survival times.



  1. Data Acquisition:
    • Source gene expression data and clinicopathologic annotations from TCGA for colorectal cancer patients.
  2. Quality Control and Preprocessing:
    • Remove low-quality data points and normalize the remaining data to facilitate accurate survival analysis.
  3. Feature Identification and Integration:
    • Identify key genetic features that vary significantly across patients and integrate data from different sources to ensure a cohesive analysis framework.
  4. Data Analysis and Interpretation:
    • Perform Kaplan-Meier survival analysis to group patients based on gene expression levels and use bioinformatics tools to identify significant differences in survival times.
    • Interpret the results to identify genetic markers associated with prognosis and their implications in colorectal cancer.


Data Acquisition

Gene expression data and clinicopathologic annotations for colorectal cancer patients were obtained from The Cancer Genome Atlas (TCGA) via the Pan-Cancer Atlas initiative. This resource, accessible at GDC Pan-Cancer Atlas, offers a comprehensive and standardized dataset across multiple cancer types. For this study, we extracted data specific to colorectal cancer, encompassing detailed gene expression profiles and clinical information for 459 patients. The data underwent preprocessing to eliminate low-quality entries and normalize expression levels, thereby providing a robust foundation for subsequent survival analysis.

PCA Results

The Principal Component Analysis (PCA) plot in Figure 1 depicts the variation in gene expression data among colorectal cancer patients. Each point represents an individual patient, with their gene expression profiles projected onto the first two principal components. The plot reveals a central clustering of points, indicating that the majority of patients exhibit similar gene expression patterns, suggesting a common molecular signature within the cohort. Conversely, the spread of points along both principal components highlights the variability in gene expression profiles among patients. This variability may correspond to different molecular subtypes of colorectal cancer or differences in disease progression and treatment responses. Identifying these patterns is crucial for understanding the heterogeneity of colorectal cancer and could inform the development of personalized therapeutic strategies.

crc n2 fig 1

Figure 1. Principal Component Analysis (PCA) of Gene Expression Data in Colorectal Cancer Patients. The plot illustrates a dense central cluster of patients with similar gene expression patterns, alongside a distribution indicating variability within the cohort.


Analysis of Multiple Samples from the Same Patient

In a further analysis, we examined data points where multiple samples were taken from the same colorectal cancer patient. The PCA plot in Figure 2 highlights one such patient, indicated by the red points. This visualization allows us to assess the consistency and variation within samples from the same individual.

The PCA plot reveals two distinct clusters, suggesting the presence of batch effects. Batch effects are systematic variations in the data due to differences in sample processing or experimental conditions, rather than biological differences. These effects can obscure true biological signals and lead to misleading interpretations if not properly accounted for. Therefore, to ensure data integrity and eliminate potential biases, samples with duplicates were removed.

crc n2 fig 2

Figure 2. Principal Component Analysis (PCA) of Gene Expression Data with Multiple Samples from the Same Colorectal Cancer Patient Highlighted. Red points indicate samples from the same patient. The two distinct clusters observed are likely due to batch effects, emphasizing the need for batch correction in the analysis.


Kaplan-Meier Survival Analysis

The Kaplan-Meier survival curve in Figure 3a illustrates the overall survival of colorectal cancer patients stratified by high and low expression levels of the gene IFNB1. The red line represents patients with high IFNB1 expression, while the blue line represents those with low IFNB1 expression. The analysis shows a significant difference in survival between the two groups, with high IFNB1 expression associated with poorer survival outcomes (P = 7.77e-04). Patients with lower IFNB1 expression demonstrate a higher fraction of survival over time, indicating that IFNB1 may be a prognostic biomarker for colorectal cancer. Figure 3b presents a box plot of IFNB1 expression levels among colorectal cancer patients, divided into high and low expression groups. The high expression group shows a wider range of expression levels, with several outliers, while the low expression group exhibits minimal variation and consistently low expression levels. This clear separation in expression levels corroborates the stratification used in the survival analysis and underscores the potential relevance of IFNB1 expression in prognostic assessments.

crc n2 fig 3 1 crc n2 fig 3 2

Figure 3. (a) Kaplan-Meier Survival Curve for IFNB1 Expression in Colorectal Cancer Patients. High IFNB1 expression is associated with worse overall survival compared to low IFNB1 expression. (b) Box Plot of IFNB1 Expression Levels in Colorectal Cancer Patients. The high expression group (left) shows significant variation, whereas the low expression group (right) has uniformly low levels, indicating distinct expression profiles between the two groups.



Almaden Genomics’ g.nome is a comprehensive data analysis solution for life sciences research. This case study showcases its application in transcriptome analysis for colorectal cancer. The platform enables the construction of complex computational pipelines through modular connections, allowing quick results and visualizations. It supports tertiary analyses such as pathway enrichment and provides robust version control for traceability and reproducibility. The cloud-based platform integrates seamlessly, facilitating the transition from discovery to production without additional infrastructure or extensive coding.