How We Improved Our Code Review Quality by 10x

Chaitanya Tyagi·November 18, 2025

TL;DR

We built Codity.ai, an AI-powered code review tool, and through continuous iteration discovered opportunities to dramatically improve accuracy. Through systematic architectural improvements, including a multi-agent review system, chunk overlap, file-level context boosting, dynamic agentic RAG retrieval, and migrating to an optimized vector storage system, we reduced false positives by 90% and achieved 10x better review quality.

The Root Causes: Deep Analysis

After analyzing feedback, we identified four critical areas for architectural improvement.

1. Insufficient Context Overlap

Our code was being chunked with minimal overlap between segments. When a function spanned multiple chunks, variable definitions in one chunk were invisible to code using them in another.

The issue:

Chunk overlap was too small relative to chunk size.
Result: Variables defined at the end of chunk A were often missing from chunk B's context.

2. Static Retrieval Depth

Our RAG (Retrieval-Augmented Generation) system used a fixed number of code chunks for every review, regardless of complexity. For large files with many functions, the critical chunk containing a variable definition often wasn't retrieved.

Example:

Query: "Review line 250 usage of dedupe_key"
Vector similarity ranks chunks about dedupe_key usage highest.
Definition at line 100 doesn't get retrieved.
Result: The model sees usage without definition.

3. No File-Level Context Prioritization

When reviewing file X, chunks from other files could rank higher in vector similarity, excluding critical context from the file being reviewed.

Example:

Reviewing src/auth.py line 200.
Top 10 chunks retrieved: 6 from src/db.py, 3 from src/utils.py, 1 from src/auth.py.
Result: Missing 90% of the context from the file under review.

4. Single-Pass Review Focus Limitation

When the reviewer identified one type of bug, it sometimes focused deeply on that issue and missed other potential problems in the same code block.

Example:

def process_payment(user_id, amount):
user = get_user(user_id)
if user.balance >= amount:
user.balance -= amount # Race condition
user.save()
db.execute(f"INSERT INTO transactions VALUES ({user_id}, {amount})") # SQL injection
send_receipt(user.email) # No error handling
logger.info(f"Payment processed: {user.email}, card: {user.card_number}") # Sensitive data logging

Initial review output:
Race condition detected: The balance check and deduction are not atomic. Multiple concurrent requests could cause negative balances.

Other issues such as SQL injection, missing error handling, and PII logging were ignored.

This happened because the review was single-pass and prioritized the first major issue.

The Solution: Four Architectural Pillars

Pillar 1: Intelligent Chunking Strategy

We redesigned our chunking approach with three key innovations.

A. Function-Aware Semantic Chunking

We implemented AST-based semantic chunking that respects code structure, keeping entire functions and classes together whenever possible.

Implementation highlights:

AST parsing to identify function/class boundaries.
Keep entire functions together.
Only split functions that exceed token limits.
Preserve semantic units (avoid splitting control flow blocks).

Impact:

Variable scope preserved naturally.
Function context always complete.

B. Increased Chunk Overlap

We significantly increased overlap between consecutive chunks. Variable definitions now appear in multiple consecutive chunks, ensuring definitions remain visible even when similarity retrieval focuses on usage.

Impact:

Context continuity maintained.
Reduced false positives due to missing definitions.

C. Agentic Smart Chunk Expansion

An intelligent expansion agent analyzes retrieved chunks and decides whether adjacent chunks should be included to add valuable context.

Impact:

Dynamic expansion ensures complete context.
Near-zero false positives for variable scope issues.

Trade-offs:

Modest storage increase.
Slightly more compute for parsing and decisions.

Pillar 2: Dynamic Agentic RAG Retrieval

Instead of a fixed retrieval depth, we use an agentic RAG system that dynamically adjusts context retrieval based on PR complexity and file criticality.

Impact:

Improved variable definition capture.
Better coverage of complex codebases.
More efficient resource use.

Pillar 3: File-Level Context Boosting

We implemented file-path extraction and context boosting that prioritizes all chunks from the file being reviewed.

Process:

Extract file paths from queries using pattern recognition.
Retrieve all chunks from target files.
Boost file-specific chunk relevance during retrieval.

Impact:

Complete function context for files under review.
Near-zero false positives for in-file references.

Pillar 4: Multi-Agent Agentic Architecture

We replaced single-pass reviews with a team of specialized AI agents—each focused on specific review dimensions.

Architecture Overview

Orchestrator Agent: Analyzes PR scope and decides which agents to deploy.
Specialist Agents: Security, Performance, Correctness, and Style/Best Practices.
Synthesis Agent: Merges, deduplicates, and prioritizes findings.

Specialist Agents

Security Agent: Detects vulnerabilities (SQL injection, PII leaks, auth flaws).
Performance Agent: Identifies inefficiencies (N+1 queries, algorithmic issues).
Correctness Agent: Finds logic and concurrency errors.
Style Agent: Ensures maintainability and conventions.

Orchestrator Logic

Deploys only necessary agents based on file types and PR content, reducing costs by ~30% and improving speed.

Synthesis Agent

Aggregates and prioritizes findings by severity and removes duplicates.

Results (Before vs After):

Aspect	Single Agent	Multi-Agent
Issues detected	25%	100%
False positives	High	Minimal
Review time	Longer	Faster
Developer satisfaction	4.1/10	8.9/10

Why Multi-Agent Architecture Works

Specialized focus areas.
Independent analysis per dimension.
Parallel execution.
Consensus-based synthesis improves accuracy.

Vector Storage Migration: Dramatic Storage Savings

Before: Per-Branch Storage

Each branch had its own full vector index—duplicating most data.

Problems:

Massive storage overhead.
Slow indexing.
Race conditions with concurrent PRs.

After: Unified Collection with Hash Deduplication

A single repository-wide index stores code chunks once, tagging them by branch.

Benefits:

95% storage savings.
Instant branch filtering via metadata.
No race conditions.

The Results: 10x Better Reviews

Metric	Before	After	Improvement
False positive rate	High	Reduced by 90%	Major
Critical issue detection	<50%	>95%	2x
Review quality score	3.2/10	8.7/10	2.7x
Storage per repo	Large	95% optimized	Huge
Review speed	Baseline	Faster	Noticeable
Developer satisfaction	Low	High	2.2x

Additional Improvements

1. Agentic Multi-Stage Retrieval (Shipped)

Multi-stage pipeline with intelligent scope determination, file-level reranking, semantic reranking, and expansion/deduplication.

2. Diff-Aware Context Retrieval (Shipped)

Parses git diffs to identify affected functions and dependencies for better context.

3. Cross-File Dependency Tracking (In Development)

Retrieves related context across files using import and function-call graphs.

4. Historical Bug Pattern Learning (Research)

Learns from past developer feedback to identify true vs false positives.

5. Progressive Review Depth (Shipped)

Dynamic review depth based on PR complexity and file sensitivity.

6. Test Coverage Analysis Agent (In Development)

Ensures changed code is adequately tested and suggests missing test cases.

7. Incremental Re-indexing (Shipped)

Re-indexes only changed files, not entire repositories.

8. Caching Layer for Common Patterns (Shipped)

Caches embeddings and results for frequent code snippets, reducing review time.

About Codity.ai

We’re building high-accuracy AI-powered code review tools with multi-agent architecture and comprehensive context analysis.