TL;DR
We built Codity.ai, an AI-powered code review tool, and through continuous iteration discovered opportunities to dramatically improve accuracy. Through systematic architectural improvements, including a multi-agent review system, chunk overlap, file-level context boosting, dynamic agentic RAG retrieval, and migrating to an optimized vector storage system, we reduced false positives by 90% and achieved 10x better review quality.
The Root Causes: Deep Analysis
After analyzing feedback, we identified four critical areas for architectural improvement.
1. Insufficient Context Overlap
Our code was being chunked with minimal overlap between segments. When a function spanned multiple chunks, variable definitions in one chunk were invisible to code using them in another.
The issue:
-
Chunk overlap was too small relative to chunk size.
-
Result: Variables defined at the end of chunk A were often missing from chunk B's context.
2. Static Retrieval Depth
Our RAG (Retrieval-Augmented Generation) system used a fixed number of code chunks for every review, regardless of complexity. For large files with many functions, the critical chunk containing a variable definition often wasn't retrieved.
Example:
-
Query: "Review line 250 usage of
dedupe_key" -
Vector similarity ranks chunks about
dedupe_key usagehighest. -
Definition at line 100 doesn't get retrieved.
-
Result: The model sees usage without definition.
3. No File-Level Context Prioritization
When reviewing file X, chunks from other files could rank higher in vector similarity, excluding critical context from the file being reviewed.
Example:
-
Reviewing
src/auth.pyline 200. -
Top 10 chunks retrieved: 6 from
src/db.py, 3 fromsrc/utils.py, 1 fromsrc/auth.py. -
Result: Missing 90% of the context from the file under review.
4. Single-Pass Review Focus Limitation
When the reviewer identified one type of bug, it sometimes focused deeply on that issue and missed other potential problems in the same code block.
Example:
def process_payment(user_id, amount):
user = get_user(user_id)
if user.balance >= amount:
user.balance -= amount # Race condition
user.save()
db.execute(f"INSERT INTO transactions VALUES ({user_id}, {amount})") # SQL injection
send_receipt(user.email) # No error handling
logger.info(f"Payment processed: {user.email}, card: {user.card_number}") # Sensitive data logging
Initial review output:
Race condition detected: The balance check and deduction are not atomic. Multiple concurrent requests could cause negative balances.
Other issues such as SQL injection, missing error handling, and PII logging were ignored.
This happened because the review was single-pass and prioritized the first major issue.
The Solution: Four Architectural Pillars
Pillar 1: Intelligent Chunking Strategy
We redesigned our chunking approach with three key innovations.
A. Function-Aware Semantic Chunking
We implemented AST-based semantic chunking that respects code structure, keeping entire functions and classes together whenever possible.
Implementation highlights:
-
AST parsing to identify function/class boundaries.
-
Keep entire functions together.
-
Only split functions that exceed token limits.
-
Preserve semantic units (avoid splitting control flow blocks).
Impact:
-
Variable scope preserved naturally.
-
Function context always complete.
B. Increased Chunk Overlap
We significantly increased overlap between consecutive chunks. Variable definitions now appear in multiple consecutive chunks, ensuring definitions remain visible even when similarity retrieval focuses on usage.
Impact:
-
Context continuity maintained.
-
Reduced false positives due to missing definitions.
C. Agentic Smart Chunk Expansion
An intelligent expansion agent analyzes retrieved chunks and decides whether adjacent chunks should be included to add valuable context.
Impact:
-
Dynamic expansion ensures complete context.
-
Near-zero false positives for variable scope issues.
Trade-offs:
-
Modest storage increase.
-
Slightly more compute for parsing and decisions.
Pillar 2: Dynamic Agentic RAG Retrieval
Instead of a fixed retrieval depth, we use an agentic RAG system that dynamically adjusts context retrieval based on PR complexity and file criticality.
Impact:
-
Improved variable definition capture.
-
Better coverage of complex codebases.
-
More efficient resource use.
Pillar 3: File-Level Context Boosting
We implemented file-path extraction and context boosting that prioritizes all chunks from the file being reviewed.
Process:
-
Extract file paths from queries using pattern recognition.
-
Retrieve all chunks from target files.
-
Boost file-specific chunk relevance during retrieval.
Impact:
-
Complete function context for files under review.
-
Near-zero false positives for in-file references.
Pillar 4: Multi-Agent Agentic Architecture
We replaced single-pass reviews with a team of specialized AI agents—each focused on specific review dimensions.
Architecture Overview
-
Orchestrator Agent: Analyzes PR scope and decides which agents to deploy.
-
Specialist Agents: Security, Performance, Correctness, and Style/Best Practices.
-
Synthesis Agent: Merges, deduplicates, and prioritizes findings.
Specialist Agents
-
Security Agent: Detects vulnerabilities (SQL injection, PII leaks, auth flaws).
-
Performance Agent: Identifies inefficiencies (N+1 queries, algorithmic issues).
-
Correctness Agent: Finds logic and concurrency errors.
-
Style Agent: Ensures maintainability and conventions.
Orchestrator Logic
Deploys only necessary agents based on file types and PR content, reducing costs by ~30% and improving speed.
Synthesis Agent
Aggregates and prioritizes findings by severity and removes duplicates.
Results (Before vs After):
| Aspect | Single Agent | Multi-Agent |
|---|---|---|
| Issues detected | 25% | 100% |
| False positives | High | Minimal |
| Review time | Longer | Faster |
| Developer satisfaction | 4.1/10 | 8.9/10 |
Why Multi-Agent Architecture Works
-
Specialized focus areas.
-
Independent analysis per dimension.
-
Parallel execution.
-
Consensus-based synthesis improves accuracy.
Vector Storage Migration: Dramatic Storage Savings
Before: Per-Branch Storage
Each branch had its own full vector index—duplicating most data.
Problems:
-
Massive storage overhead.
-
Slow indexing.
-
Race conditions with concurrent PRs.
After: Unified Collection with Hash Deduplication
A single repository-wide index stores code chunks once, tagging them by branch.
Benefits:
-
95% storage savings.
-
Instant branch filtering via metadata.
-
No race conditions.
The Results: 10x Better Reviews
| Metric | Before | After | Improvement |
|---|---|---|---|
| False positive rate | High | Reduced by 90% | Major |
| Critical issue detection | <50% | >95% | 2x |
| Review quality score | 3.2/10 | 8.7/10 | 2.7x |
| Storage per repo | Large | 95% optimized | Huge |
| Review speed | Baseline | Faster | Noticeable |
| Developer satisfaction | Low | High | 2.2x |
Additional Improvements
1. Agentic Multi-Stage Retrieval (Shipped)
Multi-stage pipeline with intelligent scope determination, file-level reranking, semantic reranking, and expansion/deduplication.
2. Diff-Aware Context Retrieval (Shipped)
Parses git diffs to identify affected functions and dependencies for better context.
3. Cross-File Dependency Tracking (In Development)
Retrieves related context across files using import and function-call graphs.
4. Historical Bug Pattern Learning (Research)
Learns from past developer feedback to identify true vs false positives.
5. Progressive Review Depth (Shipped)
Dynamic review depth based on PR complexity and file sensitivity.
6. Test Coverage Analysis Agent (In Development)
Ensures changed code is adequately tested and suggests missing test cases.
7. Incremental Re-indexing (Shipped)
Re-indexes only changed files, not entire repositories.
8. Caching Layer for Common Patterns (Shipped)
Caches embeddings and results for frequent code snippets, reducing review time.
About Codity.ai
We’re building high-accuracy AI-powered code review tools with multi-agent architecture and comprehensive context analysis.
