Coding Challenge #116 - Awk

This challenge is to build your own awk.

John Crickett

Apr 18, 2026

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #116 - Awk

This challenge is to build your own version of awk, the classic text processing language.

Awk was created in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan (the name comes from their initials). It’s a small but remarkably powerful language designed for processing structured text data.

Awk reads input line by line, splits each line into fields, and applies pattern-action rules to produce output. It sits in a sweet spot between sed (which is great for simple substitutions) and a full programming language like Perl or Python. Despite being nearly 50 years old, awk remains one of the most useful tools in a developer’s toolkit. You’ll find it in shell scripts, data pipelines, and one-liners across every Unix system in the world. Building your own awk will teach you about lexing, parsing, interpreters, and the design of small domain-specific languages.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own Awk

In this challenge you’re going to build your own version of the awk text processing tool. Your tool will read input line by line, split each line into fields, match lines against patterns, and execute actions -- producing output that is compatible with the standard POSIX awk utility.

Awk programs are built from pattern-action rules that look like this: condition { action }. For each line of input, awk checks every rule. If the condition matches, the action is executed. It’s a simple model that turns out to be surprisingly expressive.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. I’d encourage you to pick a language you’re comfortable with for building interpreters. You’ll be writing a lexer, a parser, and a tree-walking interpreter, so a language with good string handling and data structures will make your life easier.

Before you start coding, have a read through the POSIX awk specification to get a feel for the language. Don’t worry about understanding every detail -- we’ll work through the features step by step. It’s also worth playing with the system awk on your machine to get a sense of how it behaves.

Create a test file to use throughout the challenge. Here’s one you can use:

echo "John 25 London
Jane 30 New York
Bob 22 Paris
Alice 35 Tokyo
Charlie 28 Berlin" > test.txt

Step 1

In this step your goal is to support the basic print action with field splitting.

Your tool should read input from a file (or stdin if no file is given), split each line into fields on whitespace, and support the print statement. The special variable $0 refers to the whole line, $1 to the first field, $2 to the second, and so on.

At this point, you only need to handle a bare action block with no pattern -- meaning the action runs for every line of input. Focus on getting the core loop right: read a line, split it into fields, execute the action, move to the next line.

Testing: Run these commands and compare against the system awk:

ccawk '{ print }' test.txt
ccawk '{ print $0 }' test.txt
ccawk '{ print $1 }' test.txt
ccawk '{ print $1, $3 }' test.txt
echo -e "hello\\nworld" | ccawk '{ print $0 }'

The first two should print every line. The third should print just the first name from each line. The fourth should print the name and city, separated by a space (the default output field separator). The fifth should read from standard input and print each line. Your output should match awk exactly.

Step 2

In this step your goal is to support the -F flag for custom field separators and the built-in variables NR, NF, and FS.

The -F flag lets the user specify a custom field separator. For example, -F: splits on colons, which is useful for parsing files like /etc/passwd.

Implement the built-in variables NR (the current record number, starting at 1), NF (the number of fields in the current record), and FS (the field separator).

Testing: Create a CSV-like test file and test with custom separators:

echo "john:25:london
jane:30:new york
bob:22:paris" > test2.txt

ccawk -F: '{ print $1 }' test2.txt
ccawk '{ print NR, $1 }' test.txt
ccawk '{ print NF }' test.txt

The first command should print just the names from the colon-separated file. The second should print line numbers alongside names. The third should print the number of fields on each line. Compare all output against the system awk.

Step 3

In this step your goal is to support patterns, comparison operators, and regular expression matching.

Awk’s power comes from its pattern-action model. A pattern can be a comparison expression (like $2 > 25), a regular expression (like /London/), or the special patterns BEGIN and END. If a line matches the pattern, the action is executed. If there’s no action, the default is { print }.

Implement comparison operators (==, !=, <, >, <=, >=), regular expression matching with /regex/ patterns and the ~ and !~ operators, logical operators (&&, ||, !), and the BEGIN and END special patterns. BEGIN runs before any input is read, and END runs after all input has been processed. With BEGIN available, you should also support setting FS within the program (e.g. BEGIN { FS = ":" }) as an alternative to the -F flag from Step 2.

Your program should support multiple pattern-action rules. Awk checks every rule for every line, so a single line can trigger multiple actions.

Testing:

ccawk '$2 > 25 { print $1 }' test.txt
ccawk '/London/ { print $1 }' test.txt
ccawk '$1 ~ /^[AJ]/ { print }' test.txt
ccawk 'BEGIN { print "Name Age" } { print $1, $2 } END { print "Done" }' test.txt
ccawk 'BEGIN { FS = ":" } { print $1 }' test2.txt
ccawk '$2 > 25 && $2 < 35 { print $1, "mid-range" }' test.txt
ccawk '/London/ { print "City:", $3 } /^J/ { print "J-name:", $1 }' test.txt

The first should print names of people older than 25. The second should print “John”. The third should print lines where the first field starts with A or J. The fourth should print a header, all names with ages, then “Done”. The fifth should set the field separator to colon inside BEGIN and print names from the colon-separated file. The sixth should print people whose age is between 25 and 35 exclusive. The seventh demonstrates multiple rules -- John’s line matches both patterns. Compare against awk.

Step 4

In this step your goal is to support variables, arithmetic operators, and assignment operators.

Awk variables are dynamically typed -- they can hold strings or numbers and convert between the two as needed. Uninitialised variables default to 0 when used as numbers and "" when used as strings.

Implement arithmetic operators (+, -, *, /, %, ^), assignment operators (=, +=, -=, *=, /=, %=), and string concatenation (which in awk is done by placing values next to each other with no operator).

Testing:

ccawk '{ total += $2 } END { print "Total age:", total }' test.txt
ccawk '{ print $1, $2 * 2 }' test.txt
ccawk '{ name = $1 " from " $3; print name }' test.txt
ccawk 'BEGIN { x = 2; print x ^ 10 }'

The first should print the sum of all ages. The second should print names with doubled ages. The third should concatenate fields with text. The fourth should print 1024. Compare against awk.

Step 5

In this step your goal is to support control flow: if/else, while, for, do-while, and C-style for loops.

Also implement break and continue for loops, next to skip to the next input record, exit to stop processing entirely, and the ternary conditional operator (condition ? value_if_true : value_if_false).

Testing:

ccawk '{ if ($2 > 25) print $1, "senior"; else print $1, "junior" }' test.txt
ccawk '{ for (i = 1; i <= NF; i++) print $i }' test.txt
ccawk '$1 == "Bob" { next } { print }' test.txt
ccawk '{ print; if (NR == 3) exit }' test.txt
ccawk '{ print ($2 > 25) ? $1 " is senior" : $1 " is junior" }' test.txt

The first should label people as senior or junior based on age. The second should print every field on its own line. The third should skip Bob’s line and print everything else. The fourth should print only the first three lines. The fifth uses the ternary operator to produce the same senior/junior labelling in a different style. Compare against awk.

Step 6

In this step your goal is to support associative arrays and the for (key in array) construct.

Associative arrays are one of awk’s most powerful features. They’re indexed by strings (not just integers) and can be used to count, group, and aggregate data. Implement the in operator for testing membership, for (key in array) for iterating over keys, and the delete statement for removing elements.

Testing:

ccawk '{ count[$3]++ } END { for (city in count) print city, count[city] }' test.txt
ccawk '{ ages[$1] = $2 } END { if ("Bob" in ages) print "Bob is", ages["Bob"] }' test.txt
ccawk '{ a[$1] = $2 } END { delete a["Bob"]; for (k in a) print k, a[k] }' test.txt

The first should count how many people live in each city. The second should check if Bob exists and print his age. The third should delete Bob and print the rest. Note that for (key in array) iterates in an unspecified order, so don’t worry about the ordering of output lines -- just make sure the content matches.

Step 7

In this step your goal is to support the printf statement and built-in string functions.

Implement printf with C-style format strings (supporting at least %d, %f, %s, %c, and %x with width and precision specifiers).

Implement these built-in string functions: length, substr, index, split, sub, gsub, match, sprintf, tolower, and toupper.

Testing:

ccawk '{ printf "%-10s %3d %s\\n", $1, $2, $3 }' test.txt
ccawk '{ print length($1) }' test.txt
ccawk '{ print substr($1, 1, 3) }' test.txt
ccawk '{ gsub(/o/, "0", $1); print }' test.txt
ccawk '{ print toupper($1) }' test.txt

The first should print a neatly formatted table. The second should print the length of each name. The third should print the first three characters of each name. The fourth should replace all “o” characters with “0” in the first field. The fifth should print names in uppercase. Compare against awk.

Step 8

In this step your goal is to support user-defined functions and built-in arithmetic functions.

Implement user-defined functions with the syntax function name(params) { body }. Functions should support local variables (declared as extra parameters in the function signature, which is how awk handles local scope) and return values with return.

Implement the built-in arithmetic functions: int, sqrt, sin, cos, atan2, exp, log, rand, and srand.

Testing:

ccawk 'function max(a, b) { return a > b ? a : b } { print $1, max($2, 30) }' test.txt
ccawk 'BEGIN { srand(42); for (i = 0; i < 5; i++) printf "%.4f\\n", rand() }'
ccawk '{ print $1, int(sqrt($2)) }' test.txt

The first should print each name alongside the greater of their age or 30. The second should print 5 random numbers. The third should print names with the integer square root of their age. Compare against awk (except for random numbers where the seed behaviour may differ).

Step 9

In this step your goal is to support the remaining output and input features.

Implement the output built-in variables: OFS (output field separator), ORS (output record separator), and RS (record separator). When print outputs multiple fields separated by commas, it uses OFS between them. Each print statement ends with ORS.

Implement the -f flag to read the awk program from a file instead of the command line, -v var=value for setting variables before execution, and support for reading from multiple input files. Implement the FILENAME, ARGC, and ARGV built-in variables.

Testing:

ccawk 'BEGIN { OFS="-" } { print $1, $2, $3 }' test.txt
echo 'BEGIN { print "Start" } { print FILENAME, $0 }' > prog.awk
ccawk -f prog.awk test.txt
ccawk -v threshold=25 '$2 > threshold { print $1 }' test.txt
ccawk '{ print FILENAME, $0 }' test.txt test2.txt

The first should print fields separated by dashes. The second should read the program from a file and print each line with its filename. The third should use the command-line variable. The fourth should process both files and show which file each line came from. Compare against awk.

Step 10

In this step your goal is to support piping output to shell commands and the getline function.

Implement the pipe operator for print, which lets you send output to an external command: print "hello" | "sort". Awk keeps the pipe open across multiple print statements to the same command, so all the output goes to a single invocation of the command. Implement the close() function to close a pipe or file, which is needed when you want to reopen a pipe or when the external command needs to receive EOF to produce output.

Implement getline in its various forms: getline to read the next line from the current input, getline var to read into a specific variable, getline < "file" to read from a file, and "command" | getline to read from a command.

Testing:

ccawk '{ print $1 | "sort" }' test.txt
ccawk '{ while (("date" | getline line) > 0) print line; close("date") }'
ccawk 'BEGIN { while ((getline line < "test.txt") > 0) print line }'

The first should print names in sorted order. The second should print the current date. The third should read and print the test file from within a BEGIN block. The fourth is a classic awk one-liner that sums file sizes from ls output. Compare against awk.

ls -l | ccawk 'NR > 1 { total += $5 } END { print "Total bytes:", total }'

Going Further

Here are some ideas to take your awk implementation further:

Add support for multi-character record separators (which gawk supports but POSIX awk does not)
Implement the OFMT and CONVFMT variables for controlling numeric-to-string conversion
Add support for range patterns (/start/,/stop/) which match all lines between two patterns
Implement coprocess communication with |& (a gawk extension)
Add support for @include to include other awk source files
Build a bytecode compiler and virtual machine instead of a tree-walking interpreter for better performance
Add support for the ENVIRON array for accessing environment variables
Implement nextfile to skip to the next input file
Try running your awk against real-world awk scripts (there are many collected online) and see how compatible your implementation is

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo

Request for Feedback

I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback is greatly appreciated.

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenges

Discussion about this post

Ready for more?