fieldtest — Eval practice, not just eval tooling.

Quickstart

Two commands. No setup.

Install the package and run the bundled demo. See a full scored report — with real failures — before you write a line of config.

Terminal

$ pip install fieldtest
Collecting fieldtest
  Downloading fieldtest-0.2.2-py3-none-any.whl (112 kB)
Installing collected packages: fieldtest
Successfully installed fieldtest-0.2.2

$ fieldtest demo --example rag --offline
✓ Copied demo: rag → ./demo-rag/
✓ Loaded pre-scored results (offline mode)
✓ Generated HTML report

Run ID: demo-offline-0000
Example: rag (Handbook Q&A Assistant)
Fixtures: 4  Evals: 6  Runs: 3
Results: demo-rag/evals/results/

  RIGHT  83.3%  pass rate across 2 evals
  GOOD   95.8%  pass rate across 2 evals
  SAFE   91.7%  pass rate across 2 evals

$ fieldtest view
✓ Opened demo-rag/evals/results/demo-offline-0000-report.html

Three demo modes

Pick your entry point

Offline

Pre-scored results

Uses bundled outputs and results. Full report in under a second. Nothing to configure, no credentials needed.

No API key needed

Add --offline flag

Live (rule)

Run rule + regex evals

Re-scores the pre-built outputs using only deterministic evals (rule, regex, reference). No LLM calls, no API key.

No API key needed

Omit --offline

Full live

Call your system + all judges

Run your own model against the demo fixtures. All eval types including LLM judges fire. Requires an Anthropic, OpenAI, or Gemini API key.

ANTHROPIC_API_KEY required

Add your runner script

Three examples available: --example email (customer support), --example rag (handbook Q&A), --example extraction (invoice JSON extraction). Each shows different eval patterns and a distinct set of failure modes.

HTML Report

Everything in one file.

Every scored run generates a self-contained HTML report. No server, no dashboard, no dependencies. fieldtest view opens it in your browser. Below: the RAG demo report.

fieldtest

run demo-offline-0000

example rag — Handbook Q&A Assistant

4 fixtures · 6 evals · 3 runs each

2026-03-31

RIGHT

83%

correctness

GOOD

96%

quality

SAFE

92%

guardrails

How to read this

Tags tell you where to look for the fix, not just what failed. RIGHT → prompt or training. GOOD → formatting or tone. SAFE → guardrails.

Filter by label: all accuracy completeness grounding

Fixture	answers-from-context	known-answer	answer-length	cites-source	no-hallucination	stays-in-scope
vacation-policy Employee asking about PTO accrual	3/3 PASS	3/3 PASS	3/3 PASS	3/3 PASS	3/3 PASS	3/3 PASS
remote-work Employee asking about remote work policy	2/3 FAIL	2/3 FAIL	3/3 PASS	3/3 PASS	2/3 FAIL	3/3 PASS
expense-reimbursement Employee asking about reimbursement limits	3/3 PASS	2/3 FAIL	3/3 PASS	3/3 PASS	3/3 PASS	3/3 PASS
out-of-scope Question not answerable from context	2/3 FAIL	—	3/3 PASS	2/3 FAIL	2/3 FAIL	2/3 FAIL

↓ out-of-scope / no-hallucination — click any cell to expand

Run 1 PASS

Response correctly states it cannot answer from the provided context and directs to HR. type: llm · grounding label

Run 2 PASS

Correctly declines. Mentions "a different section of the handbook" — appropriate deflection, no hallucinated policy. type: llm · grounding label

Run 3 FAIL

Fabricates "9:00 AM to 5:00 PM standard hours, 10:00 AM to 3:00 PM core hours" — hours that appear in the remote-work fixture context but are absent from the PTO section provided. Cross-fixture contamination. type: llm · grounding label

Generated by fieldtest · self-contained HTML · no server required

What you're looking at: The out-of-scope fixture catches a real and common failure — run 3 fabricates specific policy details that weren't in the provided context. The model saw those hours in a different fixture but hallucinated them into an unanswerable question. The stays-in-scope and no-hallucination evals both catch this; the grounding label groups them for filtering.

Core concept

Right / Good / Safe

Every eval has exactly one tag. Not for scoring — for diagnosis. The tag tells you where to look for the fix when something fails. You don't need all three to start — a suite with a single eval is a valid suite.

RIGHT

Is the answer correct?

Correctness relative to ground truth. Did the system answer the question? Did it retrieve the right information? Does the output match the expected reference?

Known-answer reference checks
Required field presence
Addresses the user's actual question
Extracted value matches source

Failures point to: prompt, retrieval, training data, or model capability

GOOD

Is the answer well-formed?

Quality beyond correctness. Appropriate tone, format, length, and style. The answer is right — but is it delivered well for this context?

Greeting present in support email
Response within expected length range
Tone appropriate to the customer
Cites the source section

Failures point to: prompt instructions, formatting rules, or output post-processing

SAFE

Does it stay in bounds?

Guardrails and constraints. Does the system stay within its defined scope? Does it avoid fabricating, overreaching, or making unauthorized commitments?

No hallucinated policy details
No invented JSON fields
No unauthorized pricing commitments
Declines unanswerable questions

Failures point to: system prompt constraints, grounding instructions, or architecture (RAG retrieval scope)

Labels

Analytics grouping on top of tags

Labels are free-form strings you add to any eval — labels: [accuracy, grounding]. They're orthogonal to tags: a SAFE eval might carry the grounding label, a RIGHT eval might carry completeness. The HTML report renders them as clickable filter chips so you can isolate all grounding-related evals across the matrix regardless of tag.

Config

One file defines everything.

The config is the practice. It forces you to name what you're building, decide what matters for your use case, and enumerate your evals before you measure anything. Start with one use case and a few evals — the structure grows with you.

config.yaml (rag example)

schema_version: 1

system:
  name: Handbook Q&A Assistant          # what does your system do?
  domain: Employee handbook Q&A with RAG

use_cases:
  - id: handbook_qa
    description: Answer employee policy questions from handbook context

    evals:

      # ── RIGHT evals — correctness ────────────────
      - id: answers-from-context
        tag: right                                # diagnostic lens
        labels: [accuracy]                         # analytics grouping
        type: llm
        description: Answer is supported by the provided context
        pass_criteria: Every claim is directly supported by the excerpt
        fail_criteria: Answer makes claims not in the provided context

      - id: known-answer
        tag: right
        labels: [accuracy]
        type: reference                            # exact string match
        description: Golden fixture exact check

      # ── GOOD evals — quality ─────────────────────
      - id: cites-source
        tag: good
        labels: [completeness]
        type: regex
        pattern: "(?i)(section|handbook|policy|per the)"
        match: true

      # ── SAFE evals — guardrails ──────────────────
      - id: no-hallucination
        tag: safe
        labels: [grounding]
        type: llm
        description: All details traceable to the provided handbook excerpt
        pass_criteria: Every specific claim can be found in the context
        fail_criteria: Any detail appears invented or added beyond the source

      - id: stays-in-scope
        tag: safe
        labels: [grounding]
        type: llm
        description: System declines questions not answerable from context
        pass_criteria: Redirects to HR or acknowledges missing context
        fail_criteria: Fabricates an answer for out-of-scope questions

    fixtures:
      directory: fixtures/golden
      sets:
        smoke: [vacation-policy]
        full: all
      runs: 3                                       # N runs per fixture → distributions

defaults:
  provider: anthropic                            # or openai, gemini
  model: claude-haiku-3-5-20251001              # judge model

Eval types

Four judges. Closed set.

rule

Python rule function

Your own deterministic logic. Register with @rule("eval-id") in evals/rules.py. Gets the raw output string, returns Pass/Fail.

No API calls. Fastest. Best for structural checks that need code.

regex

Pattern match

Tests the output against a regex pattern. Set match: true (must contain) or match: false (must not contain).

No API calls. Zero latency. Exact and predictable.

llm

LLM judge

Binary pass/fail via a second model call. You write pass_criteria and fail_criteria as plain English. The judge returns structured JSON with reasoning.

Most flexible. Per-eval model overrides supported — use Haiku for most, Sonnet for subtle judgments.

reference

Reference comparison

Compares output against an expected block in the fixture. Checks contains (required strings) and not_contains (forbidden strings).

No API calls. Ground-truth fixtures only. Skip row shows — when fixture has no expected block.

Output files

Every scored run writes five files.

All outputs land in evals/results/[run-id]/. No database. No server. Files you can diff, commit, open in Excel, or drop in a bug report.

📊

*-data.json

Full structured data: every row, every run, full reasoning text, summary stats, delta vs prior run. Machine-readable. CI parses this for gates.

🌐

*-report.html

Self-contained HTML report. Tag health cards, label filter bar, fixture×eval matrix, click-to-expand cell detail with per-run reasoning. Open in any browser.

📝

*-report.md

Markdown report grouped by RIGHT / GOOD / SAFE. Copy-paste into GitHub issues, Notion pages, or Slack. Delta section shows what changed vs last run.

📄

*-data.csv

Flat table: one row per eval×fixture×run. Tag, labels (pipe-separated), type, passed, score, detail, error. Load directly in Excel or Pandas for ad-hoc slicing.

📋

*-report.csv

Spreadsheet-friendly report view: tag health summary and per-eval matrix. Pairs with the markdown report for teams that want CSV over prose.

All commands

fieldtest CLI reference

# ── Demo and exploration ─────────────────────────────
fieldtest demo                   # interactive example picker
fieldtest demo --example rag --offline    # offline, instant
fieldtest demo --example email   # re-score with rule evals
fieldtest view                   # open latest HTML report in browser
fieldtest view 20260331-a4b2     # open specific run

# ── Start a real project ──────────────────────────────
fieldtest init                   # scaffold evals/ directory
fieldtest init --template rag    # curated template: chatbot, rag, or email

# ── Evaluate ──────────────────────────────────────────
fieldtest score                  # score all outputs in evals/outputs/
fieldtest score --set smoke      # score a named fixture subset
fieldtest score --concurrency 10 # parallel judge calls

# ── Inspect and clean ─────────────────────────────────
fieldtest list                   # list all scored runs with summary
fieldtest clean                  # delete all results (keeps outputs)

Claude Code users: fieldtest ships with a built-in /optimize skill. It scores your outputs, diagnoses failures from the report, edits your prompt or system code, and re-runs — an automated score-diagnose-fix-rescore loop. Type /optimize in Claude Code inside any fieldtest project.

Design principles

Opinions we hold.

fieldtest is opinionated. These are the constraints that shape every design decision.

Tool measures. Human judges.

fieldtest does not declare pass or fail. It produces distributions — not verdicts. 83% pass rate on RIGHT evals is information. Whether 83% is acceptable is your engineering call, not the tool's.

One eval per failure mode.

Never bundle multiple concerns into one eval. A bundled eval that passes tells you nothing about which failure modes are absent. Narrow scope = interpretable failures.

Structure before measurement.

The config forces you to name your system, think about what matters, and enumerate failure modes before you run anything. Start with one tag and two evals. The structure scales with you — you can't skip the thinking, but you decide how much to think about first.

Files, not infrastructure.

No database, no server, no dashboard. Results are files. You can diff them, commit them, grep them, open them in any browser. Works from a laptop to CI to enterprise.

N runs capture variance.

A single run tells you almost nothing about a probabilistic system. fieldtest runs each fixture N times and shows distributions. 3 runs is the minimum; smoke suites use 1 for speed.

Runner decoupled from scoring.

Your runner writes outputs/[fixture-id]/run-N.txt. The scorer reads those files. They share nothing except the directory format. Re-score without re-running when you improve a judge.

Get started

From zero to scored report in two minutes.

Install

fieldtest is a plain Python package. No containers, no databases, no accounts required.

$ pip install fieldtest

Run the demo

See a real scored report before you write any config. Three examples — email, RAG, extraction — each with real failure modes.

$ fieldtest demo --example rag --offline
$ fieldtest view

Scaffold your own project

Run fieldtest init inside any project directory. It creates the evals/ structure and a starter config with inline comments that walk you through every section.

$ fieldtest init --template rag
✓ Scaffolded from rag template at evals/
  evals/config.yaml       — fill in system, domain, tags
  evals/fixtures/golden/  — fixtures with expected outputs
  evals/.gitignore        — outputs/ excluded from git

Next steps:
  1. Fill in system name and domain in evals/config.yaml
  2. Tag each eval: right, good, or safe
  3. Add fixtures to evals/fixtures/
  4. Run your system → write outputs to evals/outputs/
  5. fieldtest score

Run your system. Score. Repeat.

Your runner calls your system and writes outputs. fieldtest score judges them. fieldtest view opens the HTML report. Iterate on prompts, retrieve logic, or constraints — the distribution shows what changed.

$ python evals/runner.py     # you write this — ~30 lines
$ fieldtest score
$ fieldtest view

Eval practice,not just eval tooling.

Two commands. No setup.

Pick your entry point

Everything in one file.

Right / Good / Safe

Analytics grouping on top of tags

One file defines everything.

Four judges. Closed set.

Every scored run writes five files.

Opinions we hold.

From zero to scored report in two minutes.

Eval practice,
not just eval tooling.