Coding Challenge #114 - Gzip
This challenge is to build your own gzip.
Hi, this is John with this week’s Coding Challenge.
🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉
If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧
Coding Challenge #114 - Gzip
This challenge is to build your own version of gzip, the widely used file compression utility.
Gzip has been a cornerstone of computing since 1992. It’s used everywhere, compressing files on the command line, serving web content over HTTP, packaging up tarballs for distribution, and much more. Under the hood, gzip uses the DEFLATE compression algorithm (a combination of LZ77 and Huffman coding) wrapped in a simple file format defined by RFC 1952. Building your own gzip will give you a deep understanding of how data compression works, how file formats are structured, and how command-line tools handle the many options users expect.
If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.
The Challenge - Building Gzip
In this challenge you’re going to build your own version of the gzip compression tool. Your tool will compress and decompress files using the DEFLATE algorithm, producing output that is fully compatible with the standard gzip and gunzip utilities.
You can use a library for the DEFLATE algorithm if you want, but I’d really encourage you to implement it yourself by reading the spec. Data compression is one of those topics that sounds intimidating but becomes surprisingly approachable once you break it down. You’ll learn far more about how compression actually works by building LZ77 and Huffman coding from scratch than by calling a library function. The RFCs are well written and this challenge is structured to walk you through it incrementally.
Step Zero
In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.
Choose your target platform and programming language. I’d encourage you to pick a language that gives you access to low-level byte manipulation, as you’ll be working with binary data, checksums, and bit-level operations.
Before you start coding, read through RFC 1952 (the gzip file format) and RFC 1951 (the DEFLATE compressed data format). Don’t worry about understanding every detail right now, we’ll work through the key parts step by step. The important thing is to get a feel for how the pieces fit together: gzip is a wrapper format, and DEFLATE is the compression algorithm inside it.
Step 1
In this step your goal is to produce a valid .gz file using DEFLATE stored (uncompressed) blocks.
Before you tackle actual compression, get the gzip file format right first. Your tool should take a filename as an argument, wrap its contents in a valid .gz file with the correct header and trailer, and write it out with the .gz extension appended. After writing the compressed file, remove the original (this is the default gzip behaviour).
The gzip format (RFC 1952) requires a 10-byte header containing the magic number (1f 8b), the compression method (08 for DEFLATE), and flags. It also requires a trailer containing the CRC-32 checksum of the original data and the original file size (modulo 2^32).
For the DEFLATE payload, use stored blocks (block type 00 in RFC 1951, section 3.2.4). A stored block simply contains the raw data with a small header — no actual compression. This lets you get the gzip wrapper, CRC-32 calculation, and file handling working correctly before you add real compression. Your output will be valid gzip, just larger than the input.
Testing: Create a test file and compress it with your tool, then decompress it with the system gunzip to verify your format is correct:
echo "Hello, World!" > test.txt
ccgzip test.txt
gunzip test.txt.gz
cat test.txt
You should see Hello, World! and the file should be identical to the original. The .gz file will be slightly larger than the original since you’re not compressing yet — that’s fine, the important thing is that gunzip accepts it.
Step 2
In this step your goal is to implement LZ77, the first half of the DEFLATE algorithm.
LZ77 works by sliding a window over the input data and looking for sequences that have already appeared. When it finds a match, instead of storing the bytes again, it stores a (length, distance) pair — “copy 5 bytes from 12 positions back”. This is how DEFLATE eliminates repeated patterns.
Implement a sliding window (up to 32,768 bytes as per the spec) and a match-finding algorithm. For each position in the input, search the window for the longest match. If you find a match of 3 bytes or more, emit a (length, distance) pair. Otherwise, emit the literal byte.
At this point, encode your LZ77 output using DEFLATE fixed Huffman codes (RFC 1951, section 3.2.6). Fixed codes use a predefined Huffman table built into the spec, so you don’t need to build your own trees yet — you just need to emit the right bit sequences for literals, lengths, and distances.
Testing: Compress a file with your tool and decompress with gunzip:
echo "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc" > test.txt
ccgzip test.txt
gunzip test.txt.gz
cat test.txt
The .gz file should now be smaller than the original for repetitive data. Compare the compressed size to what the system gzip produces — yours won’t be as good yet (we’ll get there), but it should be noticeably smaller than the stored blocks from Step 1. Try it on a larger text file to see a more dramatic difference.
Step 3
In this step your goal is to implement Huffman coding, the second half of DEFLATE.
Fixed Huffman codes work, but they’re not optimal for every input. Dynamic Huffman codes (RFC 1951, section 3.2.7) let you build custom Huffman trees tailored to the actual frequency of symbols in your data, which gives much better compression.
Build a Huffman tree from the frequencies of the literal/length and distance symbols in your LZ77 output. Encode the tree itself into the DEFLATE block header (the spec describes exactly how to do this using code length codes), then encode the data using your custom tree.
This is the trickiest part of the challenge, take it slow and test frequently. The encoding of the Huffman tree in the block header is fiddly, with its own mini-Huffman encoding for the code lengths. The RFC walks through it methodically; follow it closely.
Testing: Compress files with your tool and compare the sizes to the system gzip:
echo "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc" > test.txt
cp test.txt test-original.txt
ccgzip test.txt
ls -la test.txt.gz
gunzip test.txt.gz
diff test.txt test-original.txt
Rename one of them to compare both. Your compressed output should now be close to what the system gzip produces. Test with a variety of files — text, source code, binary — and decompress each with gunzip to make sure everything round-trips correctly.
Step 4
In this step your goal is to decompress a .gz file using the -d flag.
When your tool is called with -d, it should read a .gz file, validate the gzip header, decompress the DEFLATE data, and write the original content to a new file with the .gz extension removed. After decompression, the .gz file should be removed.
You’ll need to reverse everything you built in the previous steps: parse the DEFLATE block headers, reconstruct the Huffman trees (for dynamic blocks), decode the symbols, and replay the LZ77 back-references to reconstruct the original data.
Your tool should validate the CRC-32 checksum and original file size stored in the gzip trailer. If the checksum doesn’t match the decompressed data, your tool should report an error and exit with a non-zero status code — this is how gzip detects corrupt files.
Testing: Compress a file with the system gzip, then decompress it with your tool:
echo "Testing decompression" > test.txt
gzip test.txt
ccgzip -d test.txt.gz
cat test.txt
You should see Testing decompression. Also test with a corrupted file — use a hex editor to change a byte in a .gz file and verify your tool reports a CRC error.
Step 5
In this step your goal is to support reading from standard input and writing to standard output, along with the -c, -k, and -f flags.
When no filename is provided, your tool should read data from stdin, compress it, and write the compressed output to stdout. When decompressing, it should read compressed data from stdin and write the decompressed output to stdout.
Implement the -c flag, which writes compressed or decompressed output to stdout instead of to a file, leaving the original file unchanged. Implement the -k flag, which keeps (does not delete) the original file after compression or decompression. And implement the -f flag, which forces overwriting of existing output files without prompting.
Testing: Test piping data through your tool:
echo "Piped data" | ccgzip | gunzip
echo "Piped data" | gzip | ccgzip -d
Both should output Piped data. Test -k and -f:
echo "Keep me" > test.txt
ccgzip -k test.txt
ls test.txt test.txt.gz
ccgzip -f test.txt
The -k flag should leave both files in place. The second compression should succeed because -f forces the overwrite of the existing .gz file.
Step 6
In this step your goal is to support compressing and decompressing multiple files and to implement compression levels.
Your tool should accept multiple filenames as arguments and process them one at a time. If any file fails, your tool should report the error and continue processing the remaining files.
Implement compression levels from -1 (fastest, least compression) to -9 (slowest, best compression), with -6 as the default. These levels control how aggressively the LZ77 stage searches for matches — at level 1, you might limit the search to short look-aheads and a smaller window, whilst at level 9, you search more thoroughly for the longest possible matches.
Testing: Test multiple files:
echo "File one" > a.txt
echo "File two" > b.txt
echo "File three" > c.txt
ccgzip a.txt b.txt c.txt
ls *.gz
You should see a.txt.gz, b.txt.gz, and c.txt.gz. Test compression levels on a larger file:
ccgzip -1 -k test.txt && mv test.txt.gz test-fast.gz
ccgzip -9 -k test.txt && mv test.txt.gz test-best.gz
ls -la test-fast.gz test-best.gz
For a sufficiently large file, the -9 version should be smaller than the -1 version.
Step 7
In this step your goal is to implement the -l, -t, and -v flags.
Implement the -l flag, which displays compression statistics for a .gz file without decompressing it. The output should include the compressed size, uncompressed size, compression ratio, and the original filename. The format should match the standard gzip output:
compressed uncompressed ratio uncompressed_name
73 26 42.3% test.txt
Implement the -t flag, which tests the integrity of a .gz file by decompressing it and validating the CRC-32 checksum without writing the output to disk. If the file is valid, it exits silently with status 0. If it’s corrupt, it reports an error.
Implement the -v flag, which displays the name and compression ratio for each file as it is processed. This is useful when compressing multiple files so you can see progress.
Testing: Test the -l flag:
gzip -k test.txt
ccgzip -l test.txt.gz
gzip -l test.txt.gz
Compare the output of both commands — they should show the same statistics. Test -t on a valid and a corrupted file. Test -v by compressing multiple files and checking that each one shows its name and ratio.
Step 8
In this step your goal is to implement recursive directory compression, preserve file metadata, and ensure full compatibility with the standard gzip tools.
Implement the -r flag, which recursively traverses directories and compresses (or decompresses) all files found within them.
The gzip header has optional fields for the original filename and the modification timestamp of the source file. Your tool should store these when compressing and restore the modification timestamp when decompressing.
Your output should be fully compatible with the system gzip and gunzip , any file compressed by your tool should decompress correctly with gunzip, and any file compressed by gzip should decompress correctly with your tool. The .gz extension should be added on compression and removed on decompression.
Testing: Test recursive compression:
mkdir -p testdir/subdir
echo "Root file" > testdir/file1.txt
echo "Sub file" > testdir/subdir/file2.txt
ccgzip -r testdir
find testdir -name "*.gz"
You should find testdir/file1.txt.gz and testdir/subdir/file2.txt.gz. Test timestamp preservation:
touch -t 202301151200.00 test.txt
ccgzip -k test.txt
ccgzip -d test.txt.gz
stat test.txt
The modification timestamp should match the original. Test full compatibility by compressing a variety of files (text, binary, empty) with both your tool and the system gzip, and cross-decompressing them to verify they produce identical output.
Going Further
Here are some ideas to take your gzip implementation further:
Add support for concatenated gzip streams (multiple gzip members in a single file)
Implement the
-rsyncableoption which makes the compressed output more friendly to rsync’s delta transfer algorithmAdd support for the
-suffixoption to use a custom file extensionBuild a parallel compression mode (like
pigz) that uses multiple CPU cores to compress data fasterAdd support for decompressing other formats that gzip can handle, such as compress (
.Z) filesExperiment with different match-finding strategies (hash chains, binary trees, optimal parsing) and measure the compression ratio and speed trade-offs
P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!
Share Your Solutions!
If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo
Request for Feedback
I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback is greatly appreciated.
You can reach me on Bluesky, LinkedIn or through SubStack
Thanks and happy coding!
John

