This weeks challenge is to build your own command like tool to compress text files. This is a challenge I first did in 1998 when my employer didn’t have much work for me to do between projects so suggested I pick a skill and polish it.
I picked C and decided to implement a few of the data structures and algorithms from my university course again in C (we did Pascal on the course). At the same time I wanted to start factoring in some of the ‘mechanical sympathy’ that I’d been reading about in Michael Abrash's excellent Graphics Programming Black Book.
Despite the name the book covers a lot more than just graphics programming - it really digs into understanding the architecture of a computer and how that has an impact on the performance of your code. For example when I re-wrote of part the compression tool that this challenge is about to take into account the hard disk buffer size (back before SSDs) the performance on a one megabyte file jumped 100x!
These days the equivalent would be understand the cache hierarchies in a modern CPU and how programming with an awareness of them can radically change the performance of your code. Scott Meyers did great talk that digs into this. You can find the slides on his website here.
The Challenge - Building A Huffman Encoder/Decoder
In the early 1950s David Huffman developed an algorithm to find the optimal prefix code that can be used for lossless data compression.
Given there is usually an unequal distribution of character occurrences in text this can then be used to compress data by giving the most commonly occurring characters the shortest prefix.
For example if we have the string aaabbc, it would normally take up 6 bytes, but if we assign each character a variable length code, with the most frequently occurring character has the shortest code we might give them the following codes:
a: 1
b: 01
c: 10
and we could reduce the string aaabbc (six bytes) to 111010110 (nine bits). It’s not quite that simple though as we need to ensure that the codes are prefix-free, that is the bit string representing one character is not a prefix of a bit string representing another.
These prefix-free codes are generated by creating a binary tree. Before we get into that let’s consider the steps involved in compression using Huffman Codes:
Read the text and determine the frequency of each character occurring.
Build the binary tree from the frequencies.
Generate the prefix-code table from the tree.
Encode the text using the code table.
Encode the tree - we’ll need to include this in the output file so we can decode it.
Write the encoded tree and text to an output field
Step Zero
In most (Fortran and COBOL are exceptions) programming languages we index arrays from the element zero onwards because arrays are represented as a pointer to the beginning and an offset from that beginning - i.e. if you’re familiar with C then array[index] is equivalent to *(array + index).
As usually, I’ll leave you to setup your IDE / editor of choice and programming language of choice. After that here’s what I’d like you to do to be ready to test your solution.
After that, please download Les Misérables, by Victor Hugo from Project Gutenberg here: https://www.gutenberg.org/files/135/135-0.txt which will be our test data for this challenge.
Step 1
In this step your goal is to build your tool to accept a filename as input. It should return an error if the file is not valid, otherwise your program should open it, read the text and determine the frequency of each character occurring within the text.
As a debugging aid you might like to optionally log out the table. Here’s some test values to check. There are 333 occurrences of ‘X’ and 223000 of ‘t’.
An alternate to logging the frequency table is to write some unit tests that take in some prepared data and then check that the counting code returns the correct values. Check your chosen programming language’s documentation for it’s unit testing library and if you’ve never tried it before consider giving test driven development a go.
Step 2
In this step your goal is to use the frequencies that you calculated in step 1 to build the binary tree. There’s a good explanation of how to do this complete with a visualisation here: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html
The examples used for the visualisation would be a good starting point for unit tests.
Step 3
In this step your goal is to use the tree to generate the prefix-code table. Once again there’s an explanation and visualisation at: https://opendsa-server.cs.vt.edu/ODSA/Books/CS3/html/Huffman.html
Step 4
In this step your goal is to write a header section to the output file (you’ll want a command line option to specify the filename). The header section will include enough information for your program to be decode the compressed file.
One way of doing this is to write out the tree, another option is to write out the character frequency table. Don’t forget you’ll need some way of knowing when the header ends and when the compressed data starts.
Step 5
This is the moment you’ve been building up to! In this step your goal is to encode the text using the code table and write the compressed content of the source text to the output file file, appending it after the header. Don’t forget translate the prefixes into bit strings and pack them into bytes to achieve the compression.
Step 6
In this step your goal is to read in the header of the encoded file and rebuild the prefix table ready to decode the text. In essence you’re going to do the reverse of step 4.
Step 7
In this step you will read in the remainder of the file and decode it using the prefix table before writing the decoded text back to the specified output file - the reverse of step 5.
At this point you should be able to compress the file with your tool, compare the file size of the input and output see that the output is smaller and then decompress the output file to create an identical copy of the input.
If so, congratulations you’ve written a working Huffman encode/decode tool!
Share Your Solutions!
Even though this is only the first coding challenge I’ve already been asked to share model solutions in as many programming languages as possible - which is an awesome idea, but I just don’t have the time and don’t know even half the programming languages out there so I’d like your help!
If you think your solution is an example of the developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message via Twitter or LinkedIn or just post about it there and tag me.
I’ll take a look at every solution and I’ll mention the best solution in the follow week’s newsletter.
Request for Feedback
I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated.
You can reach me on Twitter, LinkedIn or through SubStack
Thanks and happy coding!
John
Really nice exercise. Also I got to learn a new data structure of Huffman tree. Here is my solution in Java (Converted to command line utility using picocli)
https://github.com/varunu28/JCompress