Code Quality in the AI Era: The Guardrails You Can't Skip

This is the final post in the Velocity vs Value series. Previously: The Velocity Trap, AI Greenfield vs Brownfield, and Measuring What Matters.

AI coding tools accelerate delivery — we’ve covered the data in this series. They’re genuinely valuable. But acceleration without visibility creates a specific risk: quality degradation that nobody notices until it compounds.

Here’s how it plays out: an AI tool generates a feature in 20 minutes. It passes all automated tests. CI is green. The PR is merged. Velocity: +1. Two weeks later, a security audit flags a pattern. A month later, a production issue traces back to a function with cyclomatic complexity of 47. Three months later, the module is hard to modify because the generated code lacks documentation and clear structure.

The velocity chart never noticed any of this. And that’s not a failure of velocity — it’s a gap in what we’re measuring. This post is about closing that gap.

The Data: AI Code vs. Human Code

Let’s be precise about what research tells us. These numbers come from peer-reviewed studies and large-scale analyses of real-world codebases:

Defect Density and Churn

AI-generated code shows 1.7x higher defect density than human-written code
2x code churn — meaning AI code gets rewritten more often
An analysis of 470 GitHub pull requests found AI-generated PRs contained 1.7x more issues overall

Specific Quality Dimensions

Logic and correctness issues: 75% more common in AI code
Readability issues: 3x more common
Error handling gaps: nearly 2x more frequent
Security vulnerabilities: up to 2.74x higher
Performance inefficiencies: 8x more frequent

The “Passes Tests, Fails Review” Problem

AI-generated code frequently passes automated test suites but gets rejected by human reviewers. Automated benchmarks significantly overestimate production-readiness. The gap exists because AI code often fails on “soft” requirements:

Repository coding standards
Consistent architectural patterns
Domain-specific conventions
Edge cases that require business context

This is why velocity metrics need quality context. A feature that passes CI isn’t necessarily a feature that’s ready for production — and the team is often the first to know.

The Metrics That Matter

If you’re going to use AI coding tools — and you should, they’re genuinely useful — you need to track code quality at the source. Here are the metrics I monitor, split by code origin (AI-assisted vs. human-written):

1. Cyclomatic Complexity

What it measures: The number of independent paths through a function. Higher = more complex = harder to test and maintain.

AI reality: Studies show AI-generated Python code averages a cyclomatic complexity of 5.0, compared to 3.1 for human code — 61% higher. AI tends to generate more branching logic, extra null checks, and redundant condition handling.

Threshold: Flag any function above 10. Investigate anything above 15. AI-generated functions above 20 should be rewritten by hand.

2. Cognitive Complexity

What it measures: How hard code is for a human to understand. Unlike cyclomatic complexity (which counts paths), cognitive complexity penalizes deep nesting, multiple break conditions, and interleaved logic.

Why it matters more than cyclomatic: A function with cyclomatic complexity 8 might be a clean switch statement (easy to read) or a deeply nested if-else maze (impossible to read). Cognitive complexity distinguishes between them.

AI reality: AI-generated code often creates deeply nested structures and unclear control flow. High cognitive complexity slows down every developer who touches the code after it was generated.

3. Maintainability Index

What it measures: A composite score (0-100) reflecting how easy code is to understand and modify. Combines lines of code, cyclomatic complexity, and Halstead metrics.

AI reality: Conflicting research here. Some studies show AI code scores higher on maintainability (more comments, shorter functions). Others show lower scores (redundant code, unclear dependencies). The discrepancy likely reflects different AI models and tasks.

My recommendation: Don’t rely on this as a single metric. Use it alongside cyclomatic and cognitive complexity for a triangulated view.

4. Duplication Percentage

What it measures: The amount of copy-pasted or near-identical code in the codebase.

AI reality: This is where AI tools consistently fail. Unlike human developers who recognize patterns and create abstractions, AI often treats each generation request as isolated. The result: the same logic duplicated across multiple files, sometimes with subtle variations that make refactoring dangerous.

Threshold: Overall codebase duplication above 5% is a warning sign. AI-heavy codebases frequently hit 10-15%.

5. Security Hotspots

The guardrail stands watch — ERROR: HERESY DETECTED

What it measures: Code areas with elevated security risk — hardcoded secrets, unsanitized inputs, insecure cryptographic patterns, improper authentication handling.

AI reality: Approximately 12% of AI-generated code contains identifiable security vulnerabilities. Common issues: hardcoded API keys, missing input sanitization, insecure default configurations, cross-site scripting (XSS) patterns. AI models, trained on vast amounts of public code (including insecure examples), reproduce these patterns without understanding the risk.

Non-negotiable: Run security scanning on every PR. Split results by AI vs. human origin. If AI-generated code shows elevated security hotspots, that’s a process problem, not a speed problem.

6. Test Coverage (with context)

What it measures: Percentage of code exercised by automated tests.

Caveat: Raw coverage numbers are misleading. AI tools can generate tests that achieve 90% coverage but test trivial paths and miss critical edge cases. Look at mutation testing scores or at least branch coverage rather than line coverage.

7. Tech Debt (hours)

What it measures: Estimated remediation time for all code issues flagged by static analysis.

Track the trend: If tech debt hours per sprint are increasing while velocity is also increasing, your velocity is borrowing from the future. That’s not productivity — it’s a credit card.

Building the Technical Guardrails Dashboard

Important framing: This dashboard isn’t about challenging velocity or questioning AI adoption. Managers need velocity visibility — that’s legitimate. This dashboard exists for the engineering team itself. It gives the team objective data on whether the current pace and methodology are sustainable in the long run. It’s a steering tool, not a brake.

The Layout

Code Quality by Origin (AI vs. Manual)

Metric	AI-Assisted	Human-Written	Delta	Trend
Avg. Cyclomatic Complexity	6.2	3.8	+63%	↑
Avg. Cognitive Complexity	8.4	4.1	+105%	↑
Maintainability Index	62	78	-21%	↓
Duplication %	11.2%	3.8%	+195%	↑
Security Hotspots / kLOC	2.1	0.8	+163%	→
Bug Density / kLOC	4.3	2.5	+72%	↑
Test Coverage (branch)	71%	84%	-15%	→
Tech Debt (hours/sprint)	23	12	+92%	↑

(Illustrative numbers based on research ranges, not production data.)

Charts:

Trend lines for each metric over time — improving or degrading?
Correlation scatter plot: Velocity vs. Rework Rate, colored by AI ratio
Quality gate pass rate: What percentage of PRs pass static analysis on first try? Split by origin.

How the Team Uses This

This dashboard empowers the team to self-correct before problems compound. When cyclomatic complexity trends up or duplication percentage rises, the team can proactively allocate refactoring time — not because a manager demanded it, but because the data shows it’s needed.

When a stakeholder asks about quality, the team has a clear answer:

“We’re shipping at a good pace. Here’s our quality picture: complexity is stable, security hotspots are trending down, and we’re proactively addressing duplication in AI-generated modules. We’re on top of it.”

That’s not defensive. That’s a team that owns its craft. And when the data shows a problem — rising complexity, increasing rework — the team can flag it early with a concrete proposal, not a vague “we need to slow down.”

Practical Recommendations

For Engineering Managers

Add static analysis to CI/CD if you haven’t already. Tools like SonarQube, CodeQL, or Semgrep are mature and well-supported.
Tag AI-assisted commits in your version control. Many teams use commit message conventions or PR labels. This is essential for split reporting.
Set quality gates that AI-generated code must pass before merge. Don’t lower the bar because the code was generated faster.
Review AI code more carefully, not less. The temptation is “AI wrote it, it’s probably fine.” The data says the opposite.

For Developers

Treat AI output as a first draft, not a finished product. Refactor for clarity, remove duplication, add meaningful tests.
Run security scans locally before pushing AI-generated code. Don’t rely on CI to catch everything.
Question AI decisions. If you don’t understand why the AI structured code a certain way, rewrite it. Code you can’t explain is code you can’t maintain.

All green, feet up, gremlins in the background — the vibes are immaculate until they aren’t

For Executives

Celebrate velocity with quality context. A team shipping 50% faster with stable quality metrics is genuinely accelerating. A team shipping faster with rising defect density needs support, not pressure.
Fund the guardrails. Static analysis, security scanning, and quality dashboards cost money. They’re cheaper than production incidents and security breaches — and they give your teams the tools to self-manage quality.
Ask about the full picture. Velocity + rework rate + quality trends together tell you whether acceleration is sustainable. Support the team in maintaining all three.

The Bottom Line

AI coding tools are here to stay, and they make developers more productive. The opportunity is to pair that acceleration with visibility — so the team can steer, not just sprint.

Build the dashboards. Track the metrics. Split by origin. Let the team own the data.

Because in the age of AI, the teams that thrive won’t be the ones that ship the most code. They’ll be the ones that ship the most value — with code they can still maintain, extend, and be proud of a year from now. And they’ll have the dashboard to prove it.

This concludes the Velocity vs Value series. Start from the beginning: Velocity vs Value: How to Measure Success in the Age of AI →

Krzysztof Sajna is an IT engineering manager who builds internal platforms at scale. He writes about the messy intersection of technology, management, and reality at sajna.space.

The Data: AI Code vs. Human Code#

Defect Density and Churn#

Specific Quality Dimensions#

The “Passes Tests, Fails Review” Problem#

The Metrics That Matter#

1. Cyclomatic Complexity#

2. Cognitive Complexity#

3. Maintainability Index#

4. Duplication Percentage#

5. Security Hotspots#

6. Test Coverage (with context)#

7. Tech Debt (hours)#

Building the Technical Guardrails Dashboard#

The Layout#

How the Team Uses This#

Practical Recommendations#

For Engineering Managers#

For Developers#

For Executives#

The Bottom Line#