TLDR: DEFLATE decompressor in 3K of RAM
For a Pebble app I've been writing, I need to send images from the phone to
the watch and cache them in persistent storage on the watch. Since the
persistent storage is very limited (and the Bluetooth connection is relatively
slow) I need these to be as small as possible, and so my original plan was to
use the PNG format and
gbitmap_create_from_png_data(). However, I
discovered that this function is not supported on the earlier firmware used by
the Pebble Classic. Since PNGs are essentially DEFLATE compressed bitmaps, my
next approach was to manually compress the bitmap data. This meant that I
needed a decompressor implementation ("inflater") on the watch.
The major constraint for Pebble watchapps is memory. On Pebble Classic apps
have 24K of RAM available for the compiled code (
.text), global and static
.bss) and heap (
malloc()). There is an additional 2K
for the stack (local variables). The decompressor implementation needed to have
both small code size and variable usage. I discovered tinf which seemed to
fit the bill, and tried to get it working.
Initially, trying to decompress something simply crashed the app. It took some
debug prints to determine that code in
tinf_uncompress() wasn't even being
executed, and I realised that it was exceeding the 2K stack limit. I changed
TINF_DATA struct to be allocated on the heap to get past this. At this
stage it was using 1.2K of
.text, 1.4K of
.bss, 1K of stack, and 1.2K of
heap (total 4.8K). I set about optimising the implementation for memory usage.
Huffman coding is a method to represent frequently used symbols with fewer
bits. It uses a tree (otherwise referred to as a dictionary) to convert symbols
to bits and vice versa. DEFLATE can use Huffman coding in two modes: dynamic
and fixed. In dynamic mode, the compressor constructs an optimal tree based on
the data being compressed. This results in the smallest representation of the
actual input data; however, it has to include the computed tree in the output
in order for a decompressor to know how to decode the data. In some cases the
space used to serialise the tree negates the improvement in the input
representation. In this case the compressor can used fixed mode, where it uses
a static tree defined by the DEFLATE spec. Since the decompressor knows what
this static tree is, it doesn't need to be serialised in the output.
The original tinf implementation builds this fixed tree in
caches it in global variables. Whenever it encounters a block using the fixed
tree it has the tree immediately available. This makes sense when you have
memory to spare, but in this case we can make another tradeoff. Instead we can
store the fixed tree in the same space used for the dynamic tree, and rebuild
it every time it is needed. This saves 1.2K of
.bss at the expense of some
additional CPU usage.
The dynamic trees are themselves serialised using Huffman encoding (yo dawg).
tinf_decode_trees() needs to first build the code tree used to deserialise
the dynamic tree, which the original implementation loads into a local variable
on the stack. There is an intermediate step between the code tree and dynamic
tree however (the bit length array), and so we can borrow the space for the
dynamic instead of using a new local variable. This saves 0.6K of stack.
With the stack saving I was able to move the heap allocation back to the stack.
(Since the stack memory can't be used for anything else it's kind of free
because it allows the non-stack memory to be used for something else.) The end
result is 1.2K of
.text, 0.2K of
.bss and 1.6K of stack (total 3.0K), with
only 1.4K counting against the 24K limit. That stack usage is pretty tight
though (trying to use
app_log() inside tinf causes a crash) and is going to
depend on the caller using limited stack. My modified implementation will
allocate 1.2K on the heap by default, unless you define
zlib or gzip adds 0.4K of
.text. You can find the code on bitbucket.