Pan-genome Sequence Alignment Algorithm Research
Abstract
With the rapid development of sequence alignment technologies, their role in responding to public health emergencies and epidemic prevention has become increasingly prominent. Compared to traditional linear reference sequences, the pan-genome effectively reduces reference bias and misalignment. However, when facing long reference sequences with abundant repetitive regions, candidate positions may proliferate, thereby compromising alignment efficiency and accuracy. This paper proposes a novel alignment algorithm that integrates elastic degenerate strings with the longest common substring strategy. Single nucleotide polymorphisms (SNPs) are linearly encoded using elastic degenerate symbols, while minimizer indexing is combined with a longest common substring interval–based seed filtering strategy to effectively reduce candidate positions. Experimental validation on simulated datasets demonstrates that the proposed method significantly improves both recall and precision rates, while also achieving high accuracy in fragment localization. Further experiments on real datasets reveal that this method outperforms existing mainstream alignment tools in terms of alignment sensitivity, indicating that the longest common substring–based filtering strategy is well suited for complex genomic regions. Overall, this approach provides an alternative technical pathway for enhancing the accuracy of pan-genome sequence alignment and offers a feasible algorithmic framework for subsequent research on pan-genome alignment and viral mutation analysis.
Keywords
Download Options
Introduction
At the end of 2019, the outbreak of pneumonia caused by the novel coronavirus (SARS-CoV-2) rapidly spread worldwide, leading to severe economic losses and significant healthcare burdens[1][2]. Existing studies have identified single-nucleotide polymorphism (SNP) sites and related genes potentially associated with pneumonia symptoms, and some mutation sites have even been shown to influence vaccine efficacy, thereby posing a potential risk of further enhancing the transmissibility of COVID-19[3][4]. When dealing with highly variable viruses such as SARS-CoV-2, researchers must rapidly collect a large number of samples from infected individuals. Using high-throughput sequencing and sequence alignment technologies, they are able to analyze the pathogenic mechanisms, transmission routes, and evolutionary relationships of the virus in depth. These studies provide crucial scientific evidence for vaccine development and epidemic control. However, performing pairwise sequence alignment between short reads and long reference genomes presents substantial computational challenges, requiring not only algorithmic efficiency but also accuracy in short-read alignment. Therefore, improving the efficiency and accuracy of existing sequence alignment algorithms is of significant research value and practical importance for advancing viral genomics research and epidemic surveillance.
With the advancement of sequencing technologies and in consideration of the length and complexity of genomic sequences, an increasing number of alignment algorithms construct auxiliary index structures for reference genomes to accelerate the alignment process. Based on the type of index used, alignment algorithms can generally be divided into two categories: suffixbased indexing and hash-based indexing. Suffix-based methods typically employ suffix arrays[5], Burrows-Wheeler Transform BWT) [6], and FM-index[7] for alignment, and are suitable for large-scale genomic data such as the human reference genome. Although such methods do not require additional positioning operations, they consume substantial memory resources.
Conclusion
The algorithm achieves linear representation of SNP variants by introducing elastic-degenerate symbols, and combines minimizer indexing with a seed filtering strategy based on longest common substring intervals. This approach allows largescale read alignment tasks to balance both sensitivity and accuracy.
Testing on simulated datasets demonstrated that the proposed method achieves high recall and precision across reads of varying lengths, with significantly lower positional errors compared to other alignment tools. This indicates that the method can more accurately map short reads back to the reference genome, thereby improving alignment accuracy. Further evaluation on real SARS-CoV-2 sequencing datasets showed that the proposed method outperforms other tools in terms of sensitivity in most datasets, with its advantage becoming particularly pronounced in long-read alignment tasks. These results validate the effectiveness of the proposed method for viral short sequence alignment and suggest that it provides a viable computational approach for large-scale genome alignment, as well as for pangenome-based alignment and viral variant analysis.
However, this work primarily focuses on aligning sequencing reads to reference genomes. Extracting biological interpretations from these differences requires close integration of biological and bioinformatics analyses to fully leverage the data. Additionally, while the dynamic programming-based seed filtering strategy enables optimal semi-global alignment, computational efficiency remains an area for improvement. Future work will focus on optimizing the algorithm’s structure and implementation to enhance its generalizability and runtime performance, thereby broadening its applicability in complex genomic environments.