Unishox
A hybrid encoder for Short Unicode Strings
Unishox - A hybrid encoder for compressing Short Unicode Strings

C/C++ CI DOI

In general compression utilities such as zip, gzip do not compress short strings well and often expand them. They also use lots of memory which makes them unusable in constrained environments like Arduino. So Unishox algorithm was developed for individually compressing (and decompressing) short strings.

Note: The present byte-code version is 2 and it replaces Unishox 1. Unishox 1 is still available as unishox1.c, but it will have to be compiled manually if it is needed.

Applications

  • Compression for low memory devices such as Arduino and ESP8266
  • Compression of Chat application text exchange include Emojis
  • Storing compressed text in database
  • Faster retrieval speed when used as join keys
  • Bandwidth and storage cost reduction for Cloud

Promo video

How it works

Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used.

The model used for arriving at the prefix-free code is shown below:

Promo video

The complete specification can be found in this article: A hybrid encoder for compressing Short Unicode Strings.

Compiling

To compile, just use make or use gcc as follows:

gcc -std=c99 -o unishox2 test_unishox2.c unishox2.c

For testing the compiled program, use:

./test_unishox2 -t

API

int unishox2_compress_simple(const char *in, int len, char *out);
int unishox2_decompress_simple(const char *in, int len, char *out);
int unishox2_compress_simple(const char *in, int len, char *out)
Definition: unishox2.c:842
int unishox2_decompress_simple(const char *in, int len, char *out)
Definition: unishox2.c:1343

Usage

To see Unishox in action, simply try to compress a string:

./test_unishox2 "Hello World"

To compress and decompress a file, use:

./test_unishox2 -c <input_file> <compressed_file>
./test_unishox2 -d <compressed_file> <decompressed_file>

Unishox does not give good ratios compressing large files or compressing binary files.

Character Set

Unishox supports the entire Unicode character set. As of now it supports UTF-8 as input and output encoding.

Projects that use Unishox

Credits

Issues

In case of any issues, please email the Author (Arundale Ramanathan) at arun@.nosp@m.siar.nosp@m.a.cc or create GitHub issue.