.github/workflows/ai-review.yml

TL;DR

The Qwen3.6-35B-A3B mixture-of-experts model fits on a single workstation GPU and handles structured code review surprisingly well. This post walks through quantization tradeoffs, serving engine selection, constrained tool-call output, and wiring it all into a GitHub Actions self-hosted runner as a pre-merge gate. No API costs. The model ships under Apache 2.0, so commercial CI use is fine.

Why self-host a review model?

I’ve built enough production CI pipelines to know the real blocker to AI-assisted code review was never model quality. It was cost predictability and data sovereignty. Sending every diff to a cloud API at $3-15 per million tokens adds up fast when your team pushes 50+ PRs a day, and plenty of organizations flat-out cannot send proprietary code to third-party endpoints.

Qwen3.6-35B-A3B makes self-hosting realistic. As a mixture-of-experts architecture, it activates only ~3B of its 35B parameters per forward pass, so inference fits on hardware that would choke on a dense 35B model. The model was built for agentic coding workflows (tool calling, structured output, multi-step reasoning), which is exactly what a CI review gate needs. And it’s Apache 2.0, so your legal team won’t have concerns about commercial use.

Quantization: picking the right tradeoff

When self-hosting with llama.cpp via GGUF, your quantization choice directly determines VRAM usage and output quality. Here’s what I see teams get wrong: they default to Q4_K_M without benchmarking whether the quality drop actually matters for their use case. Worse, they forget that VRAM consumption isn’t just model weights. KV cache overhead adds 2-6 GB depending on your context length, and that will push you over the edge on boundary hardware.

The estimates below assume a 4K-token context window. If you plan to feed full PR diffs at 8K-16K tokens, add 3-6 GB to the VRAM figures.

GGUF quantization comparison for Qwen3.6-35B-A3B

Quantization	Model Size	VRAM (weights + KV @ 4K ctx)	Quality impact	Best for
Q5_K_S	~24 GB	~28-30 GB	Minimal degradation	Code review where precision matters
Q4_K_M	~20 GB	~24-26 GB	Slight degradation on nuanced reasoning	General refactoring suggestions, linting
Q3_K_M	~16 GB	~20-22 GB	Noticeable quality loss	Rough triage, classification only

The numbers tell a clear story. A 24 GB card (RTX 4090, A5000) is tight for Q5_K_S once KV cache is factored in. You’ll likely need to cap context length or drop to Q4_K_M. With 32 GB (A6000 Ada), Q5_K_S at 8K context is comfortable. On a 16 GB card, Q4_K_M only works at short context windows.

A practical note on context budget: truncate or chunk large diffs to stay within your VRAM budget. A 500-line diff runs roughly 4K-6K tokens. For larger PRs, split the diff by file and review in batches. The model handles focused, single-file context better anyway.

Serving engine: vLLM vs llama.cpp

This decision comes down to concurrency.

Factor	vLLM	llama.cpp (llama-server)
Throughput (concurrent)	High, continuous batching, PagedAttention	Lower, single-sequence optimized
Setup complexity	Requires Python env, CUDA toolkit	Single binary, minimal dependencies
Quantization support	GPTQ, AWQ, FP8	GGUF (Q2-Q8, imatrix)
Structured output	Via outlines / guided decoding	Via GBNF grammars
Ideal for	Shared team server, multiple PRs queued	Single-runner, sequential review

For a self-hosted GitHub Actions runner processing one PR at a time, llama.cpp’s simplicity wins. If you’re building a centralized review service behind an API that multiple repos hit, vLLM’s batching justifies the extra setup.

Constrained decoding for tool-call output

The piece that makes this actually work in CI is getting the model to emit structured, parseable output instead of freeform prose. You need JSON conforming to a schema so your CI script can programmatically extract verdicts, file paths, and suggested diffs.

With llama.cpp, you do this via GBNF grammars. Here’s a minimal schema for a review verdict:

{
  "verdict": "approve | request_changes | comment",
  "findings": [
    {
      "file": "src/queue.js",
      "line": 42,
      "severity": "warning",
      "message": "Unbounded queue growth — consider a max-size with backpressure."
    }
  ]
}

Pass the corresponding GBNF grammar to the server’s --grammar flag or per-request via the grammar field in the completions API. This guarantees every response is valid JSON matching your schema. No regex post-processing, no retry loops.

Wiring into GitHub Actions

The integration is straightforward once serving is running on your self-hosted runner. But pay close attention to how you pass the diff into the JSON payload. Shell-interpolating raw diff content into a JSON heredoc will break on quotes, backslashes, and newlines, and it’s a command-injection vector. Use jq to safely encode the diff as a JSON string. Don’t skip this.

# .github/workflows/ai-review.yml
jobs:
  code-review:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Generate diff
        run: git diff origin/main...HEAD > /tmp/pr.diff
      - name: Run AI review
        run: |
          jq -n \
            --arg diff "$(cat /tmp/pr.diff)" \
            --arg grammar "$(cat review-schema.gbnf)" \
            '{
              model: "qwen3.6-35b-a3b",
              messages: [
                {role: "system", content: "You are a code reviewer. Output JSON only."},
                {role: "user", content: $diff}
              ],
              grammar: $grammar
            }' | \
          curl -s http://localhost:8080/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d @- | \
          jq '.choices[0].message.content | fromjson' > review.json
      - name: Gate on verdict
        run: |
          verdict=$(jq -r '.verdict' review.json)
          if [ "$verdict" = "request_changes" ]; then exit 1; fi

By using jq -n --arg, the diff content is properly escaped into valid JSON regardless of what characters appear in the source code. This runs entirely on your hardware. Zero tokens billed. Full control over the model, the prompt, and the review criteria.

Match quantization to your real VRAM budget, KV cache included. Q4_K_M on a 24 GB card is the practical sweet spot for most teams. Only go Q5_K_S if you have 32+ GB or can keep context under 4K tokens. Benchmark with representative diffs before committing to a quant level, because synthetic benchmarks won’t tell you how it handles your codebase’s idioms.

Enforce structured output from day one. Use GBNF grammars or guided decoding to constrain the model to your review schema. Freeform text output in CI is a reliability problem. One malformed response breaks your gate, and you will not notice until a PR is blocked at 2 AM.

Start with the reviewer as advisory, not authoritative. Wire it as a non-blocking check (continue-on-error: true), watch its findings for a few weeks, then tighten to a blocking gate once you’ve calibrated the prompt and thresholds against your actual code. I’ve seen teams skip this step and burn trust with developers by shipping a gate that flags nonsense on day one.

.github/workflows/ai-review.yml

TL;DR

Why self-host a review model?

Quantization: picking the right tradeoff

GGUF quantization comparison for Qwen3.6-35B-A3B

Serving engine: vLLM vs llama.cpp

Constrained decoding for tool-call output

Wiring into GitHub Actions

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol

.github/workflows/ai-review.yml

TL;DR

Why self-host a review model?

Quantization: picking the right tradeoff

GGUF quantization comparison for Qwen3.6-35B-A3B

Serving engine: vLLM vs llama.cpp

Constrained decoding for tool-call output

Wiring into GitHub Actions

What I’d actually recommend

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol