Conference Paper

Constructing suffix array during decompression

Mahmoud M.
Abouelhoda M.I.
Kandil A.
Elbialy A.

The suffix array is an indexing data structure used in a wide range of applications in Bioinformatics. Biological DNA sequences are available to download from public servers in the form of compressed files, where the popular lossless compression program gzip [1] is employed. The straightforward method to construct the suffix array for this data involves decompressing the sequence file, storing it on disk, and then calling a suffix array construction program to build the suffix array. This scenario, albeit feasible, requires disk access and throws away valuable information in the compressed file. In this paper, we present an algorithm that constructs the suffix array during the decompression requiring no disk access and making use of the decompression information to construct the suffix array. © 2008 IEEE.