50 years Univerity of Lübeck

Institute for Theoretical Computer Science


On the Complexity of Optimal Grammar-Based Compression

Jan Arpe and Rüdiger Reischuk

The task of grammar-based compression is to find a small context-free grammar generating exactly one given string.
We investigate the relationship between grammar-based compression of strings over unbounded and bounded alphabets.
Specifically, we show how to transform a grammar for a string over an unbounded alphabet into a grammar for a block
coding of that string over a fixed bounded alphabet and vice versa. From these constructions, we obtain asymptotically
tight relationships between the minimum grammar sizes for strings and their block codings. Finally, we exploit an
improved bound of our construction for overlap-free block codings to show that a polynomial time algorithm for
approximating the minimum grammar for binary strings within a factor of c yields a polynomial time algorithm for
approximating the minimum grammar for strings over arbitrary alphabets within a factor of 24c+ε (for arbitrary ε>0).
Since the latter problem is known to be NP-hard to approximate within a factor of 8569/8568, we provide a first step
towards solving the long standing open question whether minimum grammar-based compression of binary strings is
NP-complete.