.github/workflows/ai-review.yml
TL;DR
The Qwen3.6-35B-A3B mixture-of-experts model fits on a single workstation GPU and handles structured code review surprisingly well. This post walks through quantization tradeoffs, serving engine selection, constrained tool-call output, and wiring it all into a GitHub Actions self-hosted runner as a pre-merge gate. No API costs. The model ships under Apache 2.0, so commercial CI use is fine.
Why self-host a review model?
I’ve built enough production CI pipelines to know the real blocker to AI-assisted code review was never model quality. It was cost predictability and data sovereignty. Sending every diff to a cloud API at $3-15 per million tokens adds up fast when your team pushes 50+ PRs a day, and plenty of organizations flat-out cannot send proprietary code to third-party endpoints.
Qwen3.6-35B-A3B makes self-hosting realistic. As a mixture-of-experts architecture, it activates only ~3B of its 35B parameters per forward pass, so inference fits on hardware that would choke on a dense 35B model. The model was built for agentic coding workflows (tool calling, structured output, multi-step reasoning), which is exactly what a CI review gate needs. And it’s Apache 2.0, so your legal team won’t have concerns about commercial use.
Quantization: picking the right tradeoff
When self-hosting with llama.cpp via GGUF, your quantization choice directly determines VRAM usage and output quality. Here’s what I see teams get wrong: they default to Q4_K_M without benchmarking whether the quality drop actually matters for their use case. Worse, they forget that VRAM consumption isn’t just model weights. KV cache overhead adds 2-6 GB depending on your context length, and that will push you over the edge on boundary hardware.
The estimates below assume a 4K-token context window. If you plan to feed full PR diffs at 8K-16K tokens, add 3-6 GB to the VRAM figures.
GGUF quantization comparison for Qwen3.6-35B-A3B
| Quantization | Model Size | VRAM (weights + KV @ 4K ctx) | Quality impact | Best for |
|---|---|---|---|---|
| Q5_K_S | ~24 GB | ~28-30 GB | Minimal degradation | Code review where precision matters |
| Q4_K_M | ~20 GB | ~24-26 GB | Slight degradation on nuanced reasoning | General refactoring suggestions, linting |
| Q3_K_M | ~16 GB | ~20-22 GB | Noticeable quality loss | Rough triage, classification only |
The numbers tell a clear story. A 24 GB card (RTX 4090, A5000) is tight for Q5_K_S once KV cache is factored in. You’ll likely need to cap context length or drop to Q4_K_M. With 32 GB (A6000 Ada), Q5_K_S at 8K context is comfortable. On a 16 GB card, Q4_K_M only works at short context windows.
A practical note on context budget: truncate or chunk large diffs to stay within your VRAM budget. A 500-line diff runs roughly 4K-6K tokens. For larger PRs, split the diff by file and review in batches. The model handles focused, single-file context better anyway.
Serving engine: vLLM vs llama.cpp
This decision comes down to concurrency.
| Factor | vLLM | llama.cpp (llama-server) |
|---|---|---|
| Throughput (concurrent) | High, continuous batching, PagedAttention | Lower, single-sequence optimized |
| Setup complexity | Requires Python env, CUDA toolkit | Single binary, minimal dependencies |
| Quantization support | GPTQ, AWQ, FP8 | GGUF (Q2-Q8, imatrix) |
| Structured output | Via outlines / guided decoding | Via GBNF grammars |
| Ideal for | Shared team server, multiple PRs queued | Single-runner, sequential review |
For a self-hosted GitHub Actions runner processing one PR at a time, llama.cpp’s simplicity wins. If you’re building a centralized review service behind an API that multiple repos hit, vLLM’s batching justifies the extra setup.
Constrained decoding for tool-call output
The piece that makes this actually work in CI is getting the model to emit structured, parseable output instead of freeform prose. You need JSON conforming to a schema so your CI script can programmatically extract verdicts, file paths, and suggested diffs.
With llama.cpp, you do this via GBNF grammars. Here’s a minimal schema for a review verdict:
{
"verdict": "approve | request_changes | comment",
"findings": [
{
"file": "src/queue.js",
"line": 42,
"severity": "warning",
"message": "Unbounded queue growth — consider a max-size with backpressure."
}
]
}
Pass the corresponding GBNF grammar to the server’s --grammar flag or per-request via the grammar field in the completions API. This guarantees every response is valid JSON matching your schema. No regex post-processing, no retry loops.
Wiring into GitHub Actions
The integration is straightforward once serving is running on your self-hosted runner. But pay close attention to how you pass the diff into the JSON payload. Shell-interpolating raw diff content into a JSON heredoc will break on quotes, backslashes, and newlines, and it’s a command-injection vector. Use jq to safely encode the diff as a JSON string. Don’t skip this.
# .github/workflows/ai-review.yml
jobs:
code-review:
runs-on: self-hosted
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Generate diff
run: git diff origin/main...HEAD > /tmp/pr.diff
- name: Run AI review
run: |
jq -n \
--arg diff "$(cat /tmp/pr.diff)" \
--arg grammar "$(cat review-schema.gbnf)" \
'{
model: "qwen3.6-35b-a3b",
messages: [
{role: "system", content: "You are a code reviewer. Output JSON only."},
{role: "user", content: $diff}
],
grammar: $grammar
}' | \
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @- | \
jq '.choices[0].message.content | fromjson' > review.json
- name: Gate on verdict
run: |
verdict=$(jq -r '.verdict' review.json)
if [ "$verdict" = "request_changes" ]; then exit 1; fi
By using jq -n --arg, the diff content is properly escaped into valid JSON regardless of what characters appear in the source code. This runs entirely on your hardware. Zero tokens billed. Full control over the model, the prompt, and the review criteria.
What I’d actually recommend
Match quantization to your real VRAM budget, KV cache included. Q4_K_M on a 24 GB card is the practical sweet spot for most teams. Only go Q5_K_S if you have 32+ GB or can keep context under 4K tokens. Benchmark with representative diffs before committing to a quant level, because synthetic benchmarks won’t tell you how it handles your codebase’s idioms.
Enforce structured output from day one. Use GBNF grammars or guided decoding to constrain the model to your review schema. Freeform text output in CI is a reliability problem. One malformed response breaks your gate, and you will not notice until a PR is blocked at 2 AM.
Start with the reviewer as advisory, not authoritative. Wire it as a non-blocking check (continue-on-error: true), watch its findings for a few weeks, then tighten to a blocking gate once you’ve calibrated the prompt and thresholds against your actual code. I’ve seen teams skip this step and burn trust with developers by shipping a gate that flags nonsense on day one.