pebble

Memory optimised decompressor for Pebble Classic

TLDR: DEFLATE decompressor in 3K of RAM

For a Pebble app I've been writing, I need to send images from the phone to the watch and cache them in persistent storage on the watch. Since the persistent storage is very limited (and the Bluetooth connection is relatively slow) I need these to be as small as possible, and so my original plan was to use the PNG format and gbitmap_create_from_png_data(). However, I discovered that this function is not supported on the earlier firmware used by the Pebble Classic. Since PNGs are essentially DEFLATE compressed bitmaps, my next approach was to manually compress the bitmap data. This meant that I needed a decompressor implementation ("inflater") on the watch.

The constraint

The major constraint for Pebble watchapps is memory. On Pebble Classic apps have 24K of RAM available for the compiled code (.text), global and static variables (.data and .bss) and heap (malloc()). There is an additional 2K for the stack (local variables). The decompressor implementation needed to have both small code size and variable usage. I discovered tinf which seemed to fit the bill, and tried to get it working.

Initially, trying to decompress something simply crashed the app. It took some debug prints to determine that code in tinf_uncompress() wasn't even being executed, and I realised that it was exceeding the 2K stack limit. I changed the TINF_DATA struct to be allocated on the heap to get past this. At this stage it was using 1.2K of .text, 1.4K of .bss, 1K of stack, and 1.2K of heap (total 4.8K). I set about optimising the implementation for memory usage.

Huffman trees

Huffman coding is a method to represent frequently used symbols with fewer bits. It uses a tree (otherwise referred to as a dictionary) to convert symbols to bits and vice versa. DEFLATE can use Huffman coding in two modes: dynamic and fixed. In dynamic mode, the compressor constructs an optimal tree based on the data being compressed. This results in the smallest representation of the actual input data; however, it has to include the computed tree in the output in order for a decompressor to know how to decode the data. In some cases the space used to serialise the tree negates the improvement in the input representation. In this case the compressor can used fixed mode, where it uses a static tree defined by the DEFLATE spec. Since the decompressor knows what this static tree is, it doesn't need to be serialised in the output.

The original tinf implementation builds this fixed tree in tinf_init() and caches it in global variables. Whenever it encounters a block using the fixed tree it has the tree immediately available. This makes sense when you have memory to spare, but in this case we can make another tradeoff. Instead we can store the fixed tree in the same space used for the dynamic tree, and rebuild it every time it is needed. This saves 1.2K of .bss at the expense of some additional CPU usage.

The dynamic trees are themselves serialised using Huffman encoding (yo dawg). tinf_decode_trees() needs to first build the code tree used to deserialise the dynamic tree, which the original implementation loads into a local variable on the stack. There is an intermediate step between the code tree and dynamic tree however (the bit length array), and so we can borrow the space for the dynamic instead of using a new local variable. This saves 0.6K of stack.

The result

With the stack saving I was able to move the heap allocation back to the stack. (Since the stack memory can't be used for anything else it's kind of free because it allows the non-stack memory to be used for something else.) The end result is 1.2K of .text, 0.2K of .bss and 1.6K of stack (total 3.0K), with only 1.4K counting against the 24K limit. That stack usage is pretty tight though (trying to use app_log() inside tinf causes a crash) and is going to depend on the caller using limited stack. My modified implementation will allocate 1.2K on the heap by default, unless you define TINF_NO_MALLOC. Using zlib or gzip adds 0.4K of .text. You can find the code on bitbucket.

Syndicate content