Last quarter, three bugs made it to production in the same sprint. Not three minor typos — three real incidents: a broken authentication flow, a silent data loss in a form submission handler, and an API endpoint that returned 200 on failure. All three were in code that had been generated or heavily assisted by AI tools. All three passed our existing review process.
That was the wake-up call.
If you're using Copilot, Cursor, or Claude to write code daily, you've probably noticed the pattern: AI-generated code looks clean. It follows conventions, it's well-named, it sometimes even includes comments. The problem is that looking clean and being correct are two different things. Traditional PR review habits were built for human-written code. They don't account for the specific failure modes AI introduces.
Here's the checklist we now use on every PR that contains AI-generated code — and the reasoning behind each item.
Why AI Code Fails Differently Than Human Code
When a human writes a bug, it's usually because they didn't know something, forgot something, or were rushed. The bug tends to be local and traceable.
When an AI writes a bug, it's often because the model confidently filled in a plausible pattern that doesn't match your actual system. The code is coherent — it just assumes the wrong thing about how your database handles nulls, or it copies an authentication pattern that worked in a different context but silently skips a step in yours.
The three production bugs I mentioned followed this exact shape. The auth bug came from Copilot completing a JWT verification function that skipped the expiry check — it had seen enough examples where that step was handled upstream that it just... omitted it. The data loss came from a Cursor-generated form handler that called preventDefault() but never actually sent the payload. The code ran. Nothing errored. The data simply disappeared.
Understanding this failure mode changes how you review.
The Core AI Code Review Checklist
Go through this on every PR where AI tooling contributed meaningfully to the code.
1. Verify the happy path actually works end-to-end. Don't just read the function — trace the data from input to output. AI models are good at writing the middle of functions but frequently mishandle boundaries: what happens when the input is empty, when a network call returns null, when the user is not logged in.
2. Check every external dependency call. If the AI used an SDK method, verify the method signature against the current docs. AI training data has a cutoff, and APIs change. I've seen Copilot confidently use deprecated Stripe methods that still work but trigger deprecation warnings — until they don't.
3. Audit authentication and authorization gates. This is non-negotiable. Every route, every mutation, every data access: confirm the auth check is real and in the right place. AI often generates the right-looking middleware but places it after the data has already been read.
4. Trace error handling explicitly. Find every try/catch, every .catch(), every error boundary. Ask: what does this actually do on failure? Returning a 200 with { success: false } buried in the body is not error handling. Swallowing an exception and continuing is not error handling.
5. Search for hardcoded values and assumptions. AI tools will sometimes bake in assumptions — a hardcoded user ID for testing, a localhost URL that should be an environment variable, a timeout value that's wrong for your infrastructure. Grep for string literals in logic code.
6. Read the test file as carefully as the implementation. AI-generated tests have a specific antipattern: they test the implementation rather than the behavior. If the test mocks the function being tested, or only checks that a method was called without verifying what it returned, the test is theater. Delete it or rewrite it.
What to Automate (So You Don't Have to Think About It)
Some of this checklist can and should be automated. Your brain is better spent on the judgment calls.
Set up a linter rule or pre-commit hook that flags: any console.log or debug print statements, any hardcoded IP addresses or credentials, any TODO comments in new code (AI loves generating these as placeholders it never fills).
For security-critical paths, run a static analysis tool like Semgrep on every PR. You can write custom rules for your stack — for example, a rule that flags any route handler that doesn't reference your auth middleware by name. This is mechanical work that shouldn't require a human each time.
In GitHub Actions or your CI pipeline, enforce that test coverage doesn't drop on new files. AI generates untested code more often than human developers do, and coverage gates catch it early.
Finally, for teams doing PR review with Copilot or Cursor integrated into the editor, establish a simple rule: AI-suggested code gets a different label in your PR description. Something as simple as a [AI-assisted] tag on the relevant sections tells reviewers where to spend extra time.
The Mini Case Study: What We Changed After the Incidents
After the three bugs, we did a post-mortem on our review process. The changes were small but specific.
We added a mandatory checklist comment to every PR template — not a long one, just six checkboxes that map to the items above. Reviewers have to check them off before approving. It adds maybe five minutes to a review. In the first month after rolling it out, we caught four issues that would have gone to staging: two of them were authentication gaps, one was a silent swallowed error in a payment webhook, and one was a Cursor-generated migration that ran without checking for existing data.
We also started tagging AI-heavy PRs in our tooling so we could track the error rate over time. The data has been useful: AI-generated code in our codebase has a higher defect rate per line than human-written code, but it also gets written 3-4x faster. The math works out in our favor only if the review layer is solid.
Adapting the Checklist for Solo Developers
If you're working solo — as a freelancer, indie developer, or solopreneur — you don't have a second pair of eyes. That makes the checklist more important, not less.
My suggestion: when you finish a block of AI-assisted code, switch contexts before reviewing it. Close the file, work on something else for 20 minutes, then come back and read it as if someone else wrote it. Your brain will catch things it missed when you were in flow.
Also, use your checklist as a commit message discipline. Before you commit, ask yourself: did I verify the auth? Did I trace the error handling? Writing it out forces a mental check even when you're alone.
For freelancers specifically: your reputation is on the line with every deploy. A client doesn't care whether the bug was written by you or by Copilot — it's your name on the contract. The checklist is professional insurance.
Building a Repeatable QA Process for AI-Generated Code
The goal isn't to distrust AI tools — it's to use them well. AI assistants make you faster. A solid review process makes you reliable. You need both.
If you want a more complete version of this system — including a full PR review template, automated CI/CD hooks, and checklists organized by risk level (auth, payments, data mutations, external APIs) — I put everything I use into a practical guide called AI PR Review Playbook: QA Checklists That Scale. It's built for exactly this audience: developers who ship fast with AI and need a repeatable process to keep quality high without slowing down.
The three production bugs cost us more time fixing them than a proper review process ever would have. Start with the checklist above — and build from there.