Lossless compression - Huffman Coding & RLE

03.Feb.2024, Yuriy Georgiev

Intro

Lossless compression is a form of data compression that reduces data/file sizes without sacrificing any information in the process. No details are lost along the way.
 
In this tutorial, I will introduce you to the Huffman coding and Run-Length encoding (RLE) algorithms.
 
Huffman coding is used in conventional compression formats like GZIP, BZIP2, PKZIP, JPEG, MPEG-2, etc.
It’s good to note that there is also another very popular algorithm called LZ77 which many other compression utilities and algorithms fork from.
LZ77 stands for Lempel-Ziw 1977. These are the names of the inventors — Abraham Lempel and Jacob Ziv, invented in 1977.
We also know the LZW (Lempel-Ziv-Welch), which is a version of LZ78 improved by Terry Welch.
 
There is also the widely spread Deflate algorithm, which is a combination of LZ77 and Huffman coding.  It was originally developed by Phil Katz, for version 2 of his PKZIP archiving tool and the ZIP compression format.
 
For the sake of completeness, I just want to mention the lossy compression and its use in modern computing.
Lossy compression is typically used when a file can afford to lose some data. Such as images or videos. Here, an algorithm scans image files (or video frames, which essentially are images) and reduces their size by discarding information considered less important or undetectable to the human eye.

Run-Length Encoding (RLE)

RLE is a simple method of compressing data by specifying the number of times a character (or byte for that matter), repeats followed by the value of the byte. The aim is to reduce the number of bits used to represent a set of data.

Example:
raw data: AAAAABBBCCCCNNNN = 16 bytes (128 bits)
RLE encoded: 5A3B4C4N = 8 bytes (64 bits)

In this example “A” repeats 5 times, “B” is repeated 3 times, etc. You get the idea.

Huffman Coding

The Huffman coding makes use of a priority queue and a binary tree to build a data structure representing the codes to be used during the encoding process.

 

Priority Queue

A priority queue is a type of queue that arranges elements based on their priority values.

First, the data is scanned and each byte occurrence is counted. This information is saved in a priority queue.
In the context of the Huffman coding tree, each entry of the priority queue has a priority value equal to the number of times it occurs in the data.

The entries that have the higher number of occurrences are saved at the beginning of the priority queue, followed by the ones with fewer occurrences in descending order.

For example, the string “this is a string” should have the following representation in a priority queue (note that I use the symbol _ for whitespace just for visibility purposes):

String: this_is_a_string

Sorted by num. of occurrence:

i = 3
s = 3
_ = 3
t = 2
h = 1
a = 1
r = 1
n = 1
g = 1

The priority queue looks like this: is_tharng

 

Building the Huffman tree

The priority queue is used to build a binary tree, starting from the bottom (the leafs) to the top (the root) instead of the commonly used opposite approach.

The tree is built by traversing the pqueue from back to front, taking the entries, and summing their priority values. The result is a new node that becomes their root and has the so-called “weight” which is the sum of the priority values of the leafs.

A typical node structure in C should look like this:

struct node
{
struct node *parent;
struct node *left;
struct node *right;
unsigned int weight;
unsigned char byte;
bool isRoot;
bool isLeaf;
}

Following the example string above, the process looks like this:

The new node with weight 2 becomes the root of the two leafs with weight 1 each.

The leafs now hold the actual byte data (string character). Only the leaf nodes are the ones that keep the data. All nodes above them, up to the top root node, are only weight nodes.

The result (the node with weight 2) is added back to the priority queue at the appropriate position, and the two entries that were processed (the characters “n” and “g”) are removed from the queue.

Next, we process the next two smaller values from the priority queue which are “r” and “a” with values of 1. The resulting weight node is again moved back to the priority queue and the processed two entries are removed.

So, at this point, we end up with a priority queue that looks like this: is_(2)(2)th

Where (2) represents an entry of a small tree with a root node with weight of 2.

The next step is to process the following smallest entries on our priority queue. These are the letter “h” with a value of 1, and the letter “t” with a value of 2.

Following this algorithm, we can summarize the whole process like this:

Note step 6 above. At some point we reach two entries that are different by content, one is a character “i” and the other one is a small tree. We simply proceed by merging them both into one root node.

Important note:

Have a look at the position of the nodes when merging. The one with the higher weight goes to the left of the root node, while the smaller one goes to the right.
This is an important detail because it will be later used to build a code that makes the compression efficient. It doesn’t matter if the higher value will go to the left or the right as long as it’s consistent throughout all the nodes.

 

Ultimately, we end up with a complete tree.
There is no definitive “right” way to build the tree as long as you follow the rule to always take the entries with lower values from the end of the priority queue and add the result back to the appropriate position in the queue (keeping it sorted) until you exhaust the list of entries.

At this point, we can introduce the encoding.

Each branch is encoded with simply 0 or 1. The left branches have the value of 0, and the right branches have the value of 1. This is the essence behind the compression in the Huffman coding.

Here is the tree that we just built, with the codes applied to each branch:

The encoding

To create a unique code for each character (or byte, depending on the data), you just follow the branches and write their codes.

For example, the letter “A” has the code 0110. And that will be the binary that will be written in the compressed version of the data.

Note that I’m talking about binary here. Each character is 1 byte in size, which is 8 bits. The Huffman coding algorithm enables you to compress it to a smaller size — 4 bits in the case of the character “A”. This is a 50% compression ratio.

If we follow the tree above, we can build a code for each letter, which essentially is the compression of the data.

i = 11
s = 000
_ = 001
t = 100
h = 101
a = 0110
r = 0111
n = 0100
g = 0101

The final code for the encoded string in binary is as follows:
100 101 11 000 001 11 000 001 0110 001 000 100 0111 11 0100 0101 = 50 bits (7 bytes (rounded))

If we convert the original string “this_is_a_string” to raw binary we get: 
01110100 01101000 01101001 01110011 01011111 01101001 01110011 01011111 01100001 01011111 01110011 01110100 01110010 01101001 01101110 01100111

which is 128 bits, or 16 bytes.

Here is the place to explain the purpose of the higher weighted nodes to be always on the left (or right, depending on your implementation) side of each sub-tree.

The higher the weight, the more occurrences the byte has in the original data. The closer to the root a leaf node is in the Huffman tree, the shorter the code for that byte will be produced.
This means that the bytes/characters with the most occurrences in the original data will be compressed with fewer bits, which contributes to the final efficiency of the compression algorithm.
If you are consistent in the order, you position the new nodes, the tree will be more balanced, reducing its depth and ultimately producing shorter codes for each byte of data.

Basically, the algorithm focuses on reducing the size of the duplicates.

The Decoding

I’m sure that at this point you already ask yourself how this binary encoded string can be reversed to the original data out of just zeroes and ones.

In fact, it’s very straightforward. There is one single prerequisite — you need the Huffman tree that is built out of the original data.
If you work with files (compressing and decompressing files), it should be saved at the beginning or end of the file (or another separate file just for the tree itself) where you can read it before starting the decoding.

If you read each bit from the array of the encoded string and use it for taking a direction in the Huffman tree, you will inevitably reach a leaf node. At the point you reach a leaf node, you retrieve the data out of it and start from the root of the tree with the next bit in the array of the encoded string.

The downside of Huffman coding

The Huffman coding is a fairly straightforward and useful algorithm.

However, it has its downsides. If a tree becomes too deep (which often does when you’re working with big chunks of data), the codes for some bytes may become longer than 8 bits, which not only defeats the purpose but compromises the goal.
Some of the time the compression of the most repeated bytes will compensate for the longer codes, but that’s not always the case. You may end up having negative results.
I had a case with an executable that had a -168% compression ratio — it actually became bigger after the encoding.
However, calc.exe on Windows 11 got 25% compression with pure Huffman encoding, without any extra improvements on the algorithm, nor preprocessing (other compression methods applied prior to applying Huffman coding).

Huffman coding is usually the final step of a compression procedure.
Usually, a compression consists of several phases where the data is preprocessed with different algorithms or stripped from commonly found metadata, headers (like the MZ header in Windows executables), etc. before passed to the Huffman coding phase.

An idea for optimization

You have 256 possible combinations for a value of byte. A 7 levels-deep Huffman tree can save up to 128 bytes if balanced properly before it starts building codes longer than 8 bits.

An idea I have is to build two Huffman trees with 128 bytes, 7 levels deep, and set the root nodes to respectively 0 and 1. The first bit of the code will determine which tree will be used.
This way you can exhaust all 256 values that a byte can have, ultimately providing a compression no matter what.
The tree may be predefined, then a new tree can be built based on the occurrence of the bytes and optimized in a way such it serves its purpose of providing shorter codes for the most repeated bytes.

Modding and optimizing the algorithm is pretty much a must if you want optimal results and depends on your needs and your goal.

Conclusion

Data compression is a fun topic, no doubt about it.

It’s used everywhere. In data transfer over the Internet and local networks. Keeping archives of your valuable documents, storing assets for games, etc.
It’s a big topic also. I cannot exhaust it in a single tutorial, but I hope I gave you some idea of how it works in its simplest form.

Have fun and good luck.