Coding Challenge #120 - md5sum
This challenge is to build your own md5sum.
Hi, this is John with this week’s Coding Challenge.
🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉
If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧
Coding Challenge #120 - md5sum
This challenge is to build your own version of md5sum, the command-line utility that computes and verifies MD5 message digests.
MD5 (Message-Digest Algorithm 5) was designed by Ronald Rivest in 1991 and published as RFC 1321. For decades it was the go-to hash function for verifying file integrity - you’d download a file, run md5sum on it, and compare the output to the hash published alongside it to make sure nothing got corrupted or tampered with.
These days MD5 is considered cryptographically broken (collisions can be generated in seconds on a modern laptop), but it’s still widely used for checksums, cache keys, and non-security purposes. More importantly for Coding Challenges, MD5 is a simple hash function to implement from scratch. Building it yourself is a wonderful way to demystify how hash functions work.
If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.
The Challenge - Building Your Own MD5sum
In this challenge you’re going to build your own version of md5sum. There are two tracks through this challenge. Pick the one that suits you, or do both.
Track 1 (Steps 1 - 3) gets you to a fully working md5sum clone using your language’s standard library or a third-party library for the hash computation. You’ll focus on the command-line interface, file handling, and the check mode.
Track 2 (Steps 4 - 6) takes you deeper. You’ll implement the MD5 algorithm itself from scratch by following RFC 1321, replacing the library you used in Track 1 with your own code. This is where you’ll learn how hash functions actually work - message padding, Merkle-Damgård construction, and the compression function that sits at the heart of MD5 (and SHA-1, and SHA-2).
Both tracks produce the same tool. The only difference is whether the hashing happens inside a library or inside your own code.
Step Zero
In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.
Choose your target platform and programming language. Any language will work for Track 1. For Track 2, pick a language that gives you easy access to 32-bit unsigned integer arithmetic and bitwise operations.
Before you start coding, have a play with the system md5sum so you get a feel for how it behaves:
echo -n "" | md5sum
echo -n "Hello, World!" | md5sumThe output format is the 32-character hex digest, two spaces, and the filename (or - for stdin). Note the -n flag on echo, without it, echo appends a newline, which changes the hash. This is a common gotcha when testing.
If you’re planning to do Track 2, open RFC 1321. It’s short, readable, and includes reference C code and a full test suite in the appendix. You won’t need it until Step 4, but it’s worth a skim now.
Step 1
In this step your goal is to hash the contents of a file and print the result in the standard md5sum output format.
Your tool should accept one or more filenames as command-line arguments, compute the MD5 hash of each file’s contents, and print one line per file in the format:
<32-char hex digest> <filename>Note the two spaces between the digest and the filename, this is the md5sum convention for text mode. You can use your language’s built-in MD5 library or any third-party package for the hash computation. The focus in this step is on reading files, formatting the output correctly, and handling errors (for example, printing a message to stderr and continuing if a file doesn’t exist).
Testing: Create a test file and compare your output to the system md5sum:
echo -n "Coding Challenges" > test.txt
md5sum test.txt
ccmd5 test.txtBoth should produce the same hash. Test with multiple files:
echo -n "File one" > a.txt
echo -n "File two" > b.txt
ccmd5 a.txt b.txt
md5sum a.txt b.txtThe output should match line for line. Also test what happens when you pass a file that doesn’t exist - your tool should print an error message for that file and still process the remaining files.
Step 2
In this step your goal is to support reading from standard input and to handle binary files correctly.
When no filenames are given, your tool should read from stdin, compute the hash, and print the result with - as the filename. When - is given explicitly as a filename argument, it should also read from stdin.
Your tool should also support the -b flag for binary mode. In binary mode the output uses * (space-asterisk) between the digest and the filename instead of two spaces. On modern systems the hash is computed the same way regardless of mode, the flag only affects the output format and is a holdover from systems where text and binary file reads differed. But md5sum supports it, so yours should too.
Testing:
echo -n "Hello" | ccmd5
echo -n "Hello" | md5sumBoth should output the same hash followed by -. Test binary mode:
ccmd5 -b test.txt
md5sum -b test.txtThe output should show * before the filename instead of two spaces. Test with a binary file too -- an image, a compiled executable, or /bin/ls -- and verify your hash matches the system md5sum.
Step 3
In this step your goal is to implement check mode with the -c flag.
When called with -c, your tool should read a file containing previously generated checksums (one per line, in the same format your tool produces) and verify each one. For each line, it should read the named file, compute its hash, and compare it to the stored hash. It should print the filename followed by OK or FAILED for each entry.
At the end, if any checksums failed, your tool should print a summary line to stderr saying how many didn’t match, and exit with a non-zero status code. If all checksums match, it should exit with status 0.
Implement the --quiet flag, which suppresses the OK lines and only shows failures. And implement the --status flag, which suppresses all output and only sets the exit code -- useful in scripts.
Testing: Generate a checksum file, then verify it:
ccmd5 a.txt b.txt > checksums.md5
ccmd5 -c checksums.md5You should see:
a.txt: OK
b.txt: OKNow tamper with one of the files and re-check:
echo -n "Changed" > a.txt
ccmd5 -c checksums.md5You should see:
a.txt: FAILED
b.txt: OK
ccmd5: WARNING: 1 computed checksum did NOT matchTest --quiet (only the FAILED line should appear) and --status (no output, but echo $? should show a non-zero exit code).
If you’re happy with a working md5sum clone and aren’t interested in implementing the hash algorithm itself, skip ahead to Going Further. Otherwise, read on.
Step 4
In this step your goal is to implement MD5 message padding and preprocessing, the first stage of the algorithm.
From here on you’re replacing the library hash with your own implementation. By the end of Step 6, your tool should produce identical output using code you wrote yourself.
MD5 operates on the input message in 512-bit (64-byte) blocks. Before processing, the message must be padded so its length is a multiple of 512 bits. The padding works like this:
Append a single
1bit to the message (in practice, append the byte0x80).Append zero bytes until the message length is 56 bytes short of a multiple of 64 (i.e., length mod 64 equals 56).
Append the original message length in bits as a 64-bit little-endian integer.
This padding scheme means the final block always has room for the length field, and the 0x80 byte ensures the padding is unambiguous -- you can always tell where the original message ended.
Implement the padding, then split the padded message into 512-bit blocks. For now, just verify your padding is correct by checking it against the test vectors in RFC 1321 (Appendix A.5). You should also initialise the four 32-bit state variables (A, B, C, D) to the values specified in the RFC:
A = 0x67452301
B = 0xefcdab89
C = 0x98badcfe
D = 0x10325476These are the starting values of the hash state, sometimes called the initialisation vector. They’re arbitrary constants chosen by Rivest.
Testing: The empty string "" has a length of 0 bits. After padding, you should have exactly one 64-byte block: 0x80, followed by 55 zero bytes, followed by the 64-bit length (0) in little-endian. Print your padded block as hex and verify it looks right.
The string "a" has a length of 8 bits. After padding: 0x61 0x80, 53 zero bytes, then 0x08 0x00 0x00 0x00 0x00 0x00 0x00 0x00.
Step 5
In this step your goal is to implement the MD5 compression function, the core of the algorithm.
The compression function processes one 512-bit block at a time and updates the four state variables. For each block:
Break the 64-byte block into sixteen 32-bit words (little-endian).
Initialise working variables
a, b, c, dto the current stateA, B, C, D.Run 64 rounds, divided into four groups of 16. Each round applies a different auxiliary function to three of the four working variables, adds in one of the message words and a round constant, then rotates the result.
The four auxiliary functions are:
F (rounds 0-15):
F(B, C, D) = (B AND C) OR (NOT B AND D)- a bitwise conditional: “if B then C else D”G (rounds 16-31):
G(B, C, D) = (B AND D) OR (C AND NOT D)- same idea, different arrangementH (rounds 32-47):
H(B, C, D) = B XOR C XOR D- a parity functionI (rounds 48-63):
I(B, C, D) = C XOR (B OR NOT D)- another nonlinear mixing function
Each round computes:
a = b + left_rotate((a + func(b,c,d) + message_word + round_constant), shift_amount)Then the variables are rotated: the old d becomes the new c, the old c becomes the new b, and so on. The shift amounts and round constants are specified in the RFC -- there are 64 of each, and they’re fixed values you can hard-code as a table.
After all 64 rounds, add the working variables back into the state: A += a, B += b, C += c, D += d. This addition step is what makes the construction iterative, each block’s output becomes the next block’s input.
The final hash is the state variables A, B, C, D concatenated in little-endian byte order to produce the 128-bit (16-byte) digest, which is then printed as 32 hex characters.
Testing: RFC 1321 provides test vectors in Appendix A.5. Your implementation should produce these exact hashes:
MD5("") = d41d8cd98f00b204e9800998ecf8427e
MD5("a") = 0cc175b9c0f1b6a831c399e269772661
MD5("abc") = 900150983cd24fb0d6963f7d28e17f72
MD5("message digest") = f96b697d7cb7938d525a2f31aaf161d0
MD5("abcdefghijklmnopqrstuvwxyz") = c3fcd3d76192e4007dfb496cca67e13bWork through these one at a time. If a hash doesn’t match, check your byte ordering, little-endian is the most common source of bugs. Once all five test vectors pass, swap your library hash for your own implementation and verify that your ccmd5 tool still produces the same output as the system md5sum for every file you test.
Step 6
In this step your goal is to add support for SHA-256, so your tool can operate as both md5sum and sha256sum.
SHA-256 follows the same Merkle-Damgård structure as MD5, pad the message, split into blocks, process each block through a compression function - but with different parameters. The blocks are still 512 bits, but the state is eight 32-bit words instead of four, the compression function runs 64 rounds with different operations (Ch, Maj, and two sigma functions instead of F/G/H/I), and the output is 256 bits instead of 128. The message length in the padding is big-endian rather than little-endian.
Add a --algorithm flag (or -a) that accepts md5 or sha256, defaulting to md5. When SHA-256 is selected, the output format should match the system sha256sum, same layout, just a longer digest.
If you’ve structured your code well, the padding, block processing loop, and I/O code should be shared between both algorithms, with only the compression function and initialisation differing. This is a good test of how cleanly you’ve separated concerns.
Testing: Verify against the system sha256sum and well-known test vectors:
echo -n "" | ccmd5 -a sha256
echo -n "" | sha256sumBoth should produce e3b0c44298fc1c149afbf4c898fbf90a... - (the SHA-256 of the empty string).
echo -n "Hello, World!" | ccmd5 -a sha256
echo -n "Hello, World!" | sha256sumTest check mode with SHA-256 too -- generate a checksum file with -a sha256 and verify it with -c.
Going Further
Here are some ideas to take your implementation further:
Add support for SHA-1, SHA-384, and SHA-512 to build a complete family of hash tools
Add the
-tagoutput format (BSD-style:MD5 (filename) = digest) which is what macOSmd5uses by defaultImplement HMAC-MD5 and HMAC-SHA256 using your hash functions - HMAC is how hash functions are used for message authentication in protocols like TLS
Benchmark your implementation against the system
md5sumon a large file and see how close you can get - then try optimising with SIMD intrinsics or by processing multiple blocks in parallelRead about the MD5 collision attacks (Wang et al., 2004) and try to understand how they exploit weaknesses in the auxiliary functions - it’s a fascinating bit of cryptographic history
Implement SHA-3 (Keccak), which uses a completely different construction (a sponge function) rather than Merkle-Damgård - comparing the two designs is a great way to understand why the cryptographic community moved on from MD5’s family of designs
P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!
Share Your Solutions!
If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo
Request for Feedback
I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback is greatly appreciated.
You can reach me on Bluesky, LinkedIn or through SubStack
Thanks and happy coding!
John


Rebuilding something as fundamental as md5sum from scratch is such a good way to really understand how hashing works under the hood