In genomics research, the challenge of accurately identifying rare genetic variants through genotype imputation has long been a stumbling block in understanding the genetic basis of diseases and enhancing the precision of polygenic risk scores. Traditional methods have often fallen short, especially when tackling the vast and diverse datasets.
In the study, "Empowering GWAS Discovery through Enhanced Genotype Imputation," the team at SelfDecode introduce a novel genotype imputation method named Selphi, which demonstrates superior accuracy compared to existing methods, especially for rare variants. Utilizing comprehensive datasets from the 1000 Genomes Project, TOPMed, and UK Biobank, the research showcases significant improvements in the detection of genetic associations within Genome-Wide Association Studies (GWAS). This enhanced accuracy is particularly critical for advancing our understanding of the genetic basis of diseases and improving the prediction accuracy of polygenic risk scores. This could lead to better disease prediction, personalized medicine, and targeted healthcare interventions.
The methodology behind Selphi involves an innovative approach to enhance genotype imputation accuracy, focusing particularly on rare genetic variants. This includes the use of a modified Li and Stephens Hidden Markov Model, enhanced with the Positional Burrow Wheeler Transform for better identification of haplotype matches and a heuristic approach to filter out less likely matches. Selphi employs advanced computational techniques to handle large-scale genomic data by utilizing vectorized matrix functionalities and parallel processing. This approach increases efficiency in large data environments, enabling rapid imputation across extensive genomic datasets while maintaining high accuracy. This enables the rapid imputation of vast numbers of genotypes while maintaining high levels of accuracy, particularly useful for handling datasets from sources like the 1000 Genomes Project or UK Biobank. By utilizing these computational strategies, Selphi can process extensive genomic information more quickly and effectively than traditional methods.
A significant factor in the rapid development and testing of Selphi was the use of Almaden Genomics' g.nome platform which facilitated the testing of over 30 different parameter combinations, crucial for optimizing Selphi’s performance. The platform significantly reduced the iteration timeline , allowing the team to perform extensive testing in a fraction of the usual time. Additionally, g.nome's plug-and-play functionality and visually intuitive interface made it accessible for team members without extensive coding expertise, thus broadening the scope of participation within the team. Typically, setting up and iterating over bioinformatics pipelines can take weeks or even months; however, with g.nome, the team experienced a 70% reduction in runtime and a 25-fold reduction in runtime costs. This efficiency enabled the rapid deployment and optimization of Selphi, significantly accelerating the research process and allowing for the timely generation of high-quality data.
Selphi's workflow and benchmarking. (a) The first step in the workflow merges the reference panel and target data into a unified PBWT data structure (1). The algorithm then scans the reference panel searching for matches to reference haplotypes of a minimum length (2). At each marker, the algorithm retains the longest matches, prioritizing haplotypes with more total matches across the chromosome (3-4). Dynamic haplotype selection follows, where the matches are mapped and filtered to adjust the number of retained matches at each marker, based on the distribution of match lengths (5). An HMM forward-backward algorithm is employed (6). Transitions between variant states are utilized to compute weights for each haplotype at each marker. The weights aid in determining the significance of each haplotype within the population (7). The final step interpolates allele probabilities with the haplotypes from the reference panel (8). (b) Relative enrichment (red background) and depletion (green background) of error counts with respect to average for Beagle5.4 (blue), IMPUTE5 (magenta), Minimac4 (yellow) and Selphi (green) across chromosomes 1-22 of the 1000 Genomes Project (1KG) and for chromosome 20 of the TOPMed dataset. (c) Relative enrichment and depletion of error counts with respect to average and error count per sample for chromosomes 1-22 of the UK Biobank dataset. Map shows improvement in imputation accuracy across UK counties against Beagle5.4. Image, caption source: Marino, Adriano & Mahmoud, Abdallah & Bohn, Sandra & Lerga-Jaso, Jon & Novkovic, Biljana & Manson, Charlie & Loguercio, Salvatore & Terpolovsky, Andrew & Matushyn, Mykyta & Torkamani, Ali & Yazdi, Puya. (2023). Empowering GWAS Discovery through Enhanced Genotype Imputation. 10.1101/2023.12.18.23300143. Creative Commons license |
The benchmarking of Selphi's performance was conducted by comparing it with other leading imputation tools: Beagle5.4, IMPUTE5, and Minimac4. This comparison utilized chromosomes 1-22 from the 1000 Genomes Project, which is often used as a standard dataset for testing imputation accuracy due to its diverse representation of global populations.
Selphi demonstrated superior performance, registering the lowest number of errors across all minor allele frequency (MAF) intervals and ancestral backgrounds. It showed particularly remarkable results for rare (MAF 0.05-2%) and ultra-rare (MAF 0.05-0.1%) variants, with an average improvement of 13% and 21%, respectively. Notably, Selphi's improvement was significantly pronounced in East Asian and African super-populations, indicating its enhanced utility in diverse genomic backgrounds.
GWAS and PRS power analysis. (a) Number of GWAS hits in which Selphi or Beagle obtained higher significance, plotted by ratio bin. Variants that surpassed GWAS suggestive threshold (P < 10 -5 ) were analyzed. A ratio below 1.05 was considered as an equivalent result for both Beagle and Selphi. (b) Squared correlation (r 2 ) for betas and P values obtained from imputed sets and compared to hc-WGS across 50 UK biobank phenotypes by MAF. Nominally significant (P < 0.05) trait-associated hits collected by the GWAS Catalog were retrieved. Lower and upper limits of the forest plot represent the confidence interval from bootstrap resampling. (c) GWAS examples of imputed sets along with hc-WGS results. Red diamond indicates known GWAS signals. (d) PRS drop in accuracy when comparing imputed sets with hc-WGS, assessed through relative risk and area under the curve (AUC). Image, caption source: Marino, Adriano & Mahmoud, Abdallah & Bohn, Sandra & Lerga-Jaso, Jon & Novkovic, Biljana & Manson, Charlie & Loguercio, Salvatore & Terpolovsky, Andrew & Matushyn, Mykyta & Torkamani, Ali & Yazdi, Puya. (2023). Empowering GWAS Discovery through Enhanced Genotype Imputation. 10.1101/2023.12.18.23300143. Creative Commons license |
The benchmarking was further solidified by evaluating each method's accuracy using squared correlation (r²), concordance (P0), and F-score metrics. Across all these metrics and ancestries, Selphi remained the most accurate method, indicating its robustness and efficiency in genotype imputation.
Selphi by SelfDecode addresses a critical need in genomics research for better genotype imputation, particularly for rare genetic variants. Its methodological innovations and proven accuracy across diverse datasets enhance Genome-Wide Association Studies (GWAS), offering new insights into the genetic basis of diseases. The application of Selphi has the potential to improve disease prediction and the development of personalized medicine. As genomics research continues to evolve, tools like Selphi are essential for making significant progress in understanding complex genetic factors and advancing healthcare tailored to individual genetic profiles.