Coding Challenges - Does AI Write Good Code? Let's Find Out.

Evaluating AI-Generated Code with the SonarQube MCP Server.

Feb 14, 2026

Hi, this is John with this week’s Coding Challenge and it’s going to be a bit different.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge - Does AI Write Good Code?

AI is changing software engineering. AI can write code faster than you or I can. That's exciting, but it creates a new problem: just because code works doesn't mean it's good. How do you know if what your LLM generated is secure, maintainable, and ready for production?

There are two things you can do. Firstly, follow industry research, like Sonar’s LLM Leaderboard, which looks at the quality, security, complexity, and maintainability of the code created using the leading LLMs. It’s well worth a read to understand the strengths and weaknesses of the models. I found it particularly eye-opening to see that GPT 5.2 High generates 50% more code than Opus 4.5 for the same tasks and Opus 4.5 was still generating around 200% more code than Gemini 3 Pro! I know which codebase I’d rather be responsible for!

Secondly, there are many tools we can leverage to evaluate aspects of code quality, maintainability and security. They include compilers, type checkers, linters, and automated code review tools like SonarQube. In today’s Coding Challenge we’re going to look at how we can leverage them to guide and evaluate AI when building software.

Step Zero

In this step your goal is to pick a Coding Challenge, technology stack and AI coding agent of your choice. If you primarily use Copilot at work, consider trying Amp Code, if you mainly use Claude, try Copilot. In short, try a different coding agent and learn something new.

Step 1

In this step your goal is to build a solution to one of the Coding Challenges using your favourite agent / LLM. I’ll go into more detail on how to leverage AI agents in a future newsletter, but for now I suggest prompting the agent to tackle one step of the Coding Challenge at a time. Between steps, or if the context window starts to fill up or it hallucinates, clear the context window.

Once your solution is complete, head to step 2 to start leveraging tools to assess the quality and security of the code produced by your AI.

Step 2

In this step your goal is to prompt your agent to review the code quality using the compiler, code formatter and linter appropriate to your programming language and stack.

For example, if you’re using Python run checks with ruff, ty, pyrefly or pyright. If you’re using JavaScript, switch to TypeScript 😇. If you’re using Rust, use clippy, for Go check out Golangci-lint. You get the idea.

Step 3

In this step your goal is to install the SonarQube MCP Server and use it to evaluate the code created by your agent.

SonarQube is an automate code review solution that performs deep code analysis to detect bugs, security vulnerabilities, and quality issues that simple linters miss. It can also be used to enforce rules for maintainability, reliability, and security before merging. It’s been widely used in the organisations I’ve worked for and I’ve often been grateful for the feedback it provided my teams. So for this step you’re going to use it, via the MCP server to analyse the code your agent has produced.

To get started, see Sonar’s instructions on how to install their MCP server. I found the quickstart guide particularly useful and chose to use the container running in Docker. Then configured my agent to use SonarQube via the MCP server.

Once you have it installed, check with your agent to verify the MCP server is configured. For example, with Copilot run:

/mcp show

You should then see some output that includes something like this:

Configured MCP servers: sonarqube

MCP Server Configuration:
  sonarqube (local): Command: docker

Once you have that done, ask your agent to review the project with SonarQube, for example:

Review this code with SonarQube

I tried it on the Monkey compiler project I did last year and got the following output:

Total Issues Found: 40

  Issue Breakdown by Severity:

   - CRITICAL: 37 issues
   - MAJOR: 1 issue
   - MINOR: 2 issues

  Key Issues Identified:

  1. High Cognitive Complexity (6 issues):

   - compiler/compiler.go:72 - Complexity: 96 (allowed: 15)
   - vm/vm.go:72 - Complexity: 69 (allowed: 15)
   - vm/vm_test.go:66 - Complexity: 34 (allowed: 15)
   - evaluator/evaluator.go:15 - Complexity: 25 (allowed: 15)
   - compiler/symbol_table_test.go:166 - Complexity: 16 (allowed: 15)
   - compiler/compiler_test.go:649 - Complexity: 16 (allowed: 15)

  2. Duplicated String Literals (14 issues): Various test files contain repeated error message strings that should be defined as
  constants.

  3. Empty Functions (13 issues): Multiple empty functions in ast/ast.go lack explanatory comments about why they're intentionally
  empty.

  4. Other Issues:

   - Empty code block in compiler/compiler_test.go:661
   - Interface naming convention in object/object.go:72
   - Parameter grouping in compiler/compiler.go:353

  The most critical issues are the high cognitive complexity in the compiler and VM modules, which could impact maintainability.

You can then use the agent to explore the issues in more depth.

Step 4

In this step your goal is to pick out three of the points that SonarQube highlights and investigate them. Review the code yourself and either fix it, or prompt your agent to fix the code, including the output from SonarQube in your prompt.

I did this by prompting the agent for the highest priority issues:

Show me the three highest priority issues

Which gave me:

Top 3 Highest Priority Issues

  1. compiler/compiler.go:72 - Cognitive Complexity: 96 ⚠️ CRITICAL

  Function: Compile(node ast.Node)

  Problem: This massive switch statement handles all AST node compilation with deeply nested conditionals and error handling. The complexity is 6.4x over the allowed limit (96
  vs 15).

  Why it matters: This is the core compiler function - extremely difficult to understand, test, and modify. Each new language feature adds more nesting.

  Recommendation: Refactor using the strategy pattern - extract each case into separate compilation methods:

   func (c *Compiler) compileProgram(node *ast.Program) error { ... }
   func (c *Compiler) compileInfixExpression(node *ast.InfixExpression) error { ... }
   func (c *Compiler) compileIfExpression(node *ast.IfExpression) error { ... }

  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  2. vm/vm.go:72 - Cognitive Complexity: 69 ⚠️ CRITICAL

  Function: Run()

  Problem: Main VM execution loop with a giant switch statement for all opcodes. Complexity is 4.6x over the limit (69 vs 15).

  Why it matters: This is the VM's heart - any bug here affects runtime execution. The nested switch cases, error checks, and instruction pointer manipulation make it hard to
  reason about.

  Recommendation: Extract opcode handlers into separate methods:

   func (vm *VM) handleOpConstant(ins code.Instructions, ip int) error { ... }
   func (vm *VM) handleOpJump(ins code.Instructions, ip int) error { ... }

  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  3. vm/vm_test.go:66 - Cognitive Complexity: 34 ⚠️ CRITICAL

  Function: Test helper function

  Problem: Test code with complexity 2.3x over the limit (34 vs 15), likely due to nested loops validating complex test cases.

  Why it matters: Complex test code is harder to maintain and can hide bugs in the tests themselves.

  Recommendation: Break into smaller helper functions for specific validation types (integers, booleans, arrays, etc.).

Impact: These three functions represent the core of your interpreter - the compiler, VM, and tests. High complexity makes maintenance risky and onboarding difficult.

Look at the issues you find and either fix them by hand or have the agent fix them. Then use the agent and SonarQube to review the fixes, it’s a great way to improve your code quality.

Step 5

In this step your goal is to reflect on how to use LLMs and how tools like compilers, linters and SonarQube can help you improve the quality of the software you build.

I recommend adding them to your normal agentic coding process, for me that means incorporating them into the prompt for each task, something like this in my prompts:

After reading the specification: 
1. Create a set of tests to verify the implementation behaves correctly. 
2. Then create the code required to meet the specification. 
3. Verify the functionality is correct using the tests.
4. Verify the code lints and passes quality checks with no warnings or errors.

My AGENTS.md usually defines how to run the linter and quality checks for the project.

Going Further

Review the LLM Leaderboard that Sonar created to provide transparency into how models build code, not just what they build. By running thousands of AI-generated solutions through SonarQube, they evaluated the models on the metrics that matter to engineering leaders: security, reliability, maintainability, and complexity.

To generate the leaderboard, Sonar analysed code quality from leading AI models (GPT-5.2 High, GPT-5.1 High, Gemini 3 Pro, Opus 4.5 Thinking, and Claude Sonnet 4.5).

It was interesting to see that while these models pass functional benchmarks well, they have significant differences in code quality, security, and maintainability.

Higher performing models tend to generate more verbose and complex code for example:

Opus 4.5 Thinking leads with 83.62% pass rate but generates 639,465 lines of code (more than double the less verbose models).
Gemini 3 Pro achieves similar performance (81.72%) with much lower complexity and verbosity.
GPT-5.2 High hits 80.66% pass rate but produces the most code (974,379 lines) and shows worse maintainability than GPT-5.1.

I found it particularly interesting to see that Gemini produced only 289k lines. That’s a lot less code to review and maintain!

Many thanks to Sonar for sponsoring this issue of Coding Challenges.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions GitHub repo

Request for Feedback

I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback greatly appreciated.

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenges

Discussion about this post

Ready for more?