TR18-141 Authors: Sandip Sinha, Omri Weinstein

Publication: 13th August 2018 01:54

Downloads: 73

Keywords:

The Burrows-Wheeler Transform (BWT) is among the most influential discoveries in text compression and DNA storage. It is a \emph{reversible} preprocessing step that rearranges an $n$-letter string into runs of identical characters (by exploiting context regularities), resulting in highly compressible strings, and is the basis for the ubiquitous \texttt{bzip} program. Alas, the decoding process of BWT is inherently sequential and requires $\Omega(n)$ time even to retrieve a \emph{single} character.

We study the succinct data structure problem of locally decoding short substrings of a given text under its \emph{compressed} BWT, i.e., with small redundancy $r$ over the \emph{Move-To-Front} based (\texttt{bzip}) compression. The celebrated BWT-based FM-index (FOCS '00), and other related literature, gravitate toward a tradeoff of $r=\tilde{O}(n/\sqrt{t})$ bits, when a single character is to be decoded in $O(t)$ time. We give a near-quadratic improvement $r=\tilde{O}(n\cdot \lg t/t)$. As a by-product, we obtain an \emph{exponential} (in $t$) improvement on the redundancy of the FM-index for counting pattern-matches on compressed text. In the interesting regime where the text compresses to $n^{1-o(1)}$ bits, these results provide an $\exp(t)$ \emph{overall} space reduction. For the local decoding problem, we also prove an $\Omega(n/t^2)$ cell-probe lower bound for ``symmetric" data structures.

We achieve our main result by designing a compressed Rank (partial-sums) data structure over BWT. The key component is a locally-decodable Move-to-Front (MTF) code: with only $O(1)$ extra bits per block of length $n^{\Omega(1)}$, the decoding time of a single character can be decreased from $\Omega(n)$ to $O(\lg n)$. This result is of independent interest in algorithmic information theory.