fieldtest v0.1.4 release

Eval practice,
not just eval tooling.

Drop fieldtest into any AI project. Write one config file. Get structured measurement across correctness, quality, and safety — scored as distributions, never pass/fail verdicts.

2
commands to first report
3
built-in demo examples
5
output formats
0
API key needed to start

Two commands. No setup.

Install the package and run the bundled demo. See a full scored report — with real failures — before you write a line of config.

Terminal
$ pip install fieldtest
Collecting fieldtest
  Downloading fieldtest-0.1.4-py3-none-any.whl (82 kB)
Installing collected packages: fieldtest
Successfully installed fieldtest-0.1.4

$ fieldtest demo --example rag --offline
 Copied demo: rag → ./demo-rag/
 Loaded pre-scored results (offline mode)
 Generated HTML report

Run ID: demo-offline-0000
Example: rag (Handbook Q&A Assistant)
Fixtures: 4  Evals: 6  Runs: 3
Results: demo-rag/evals/results/

  RIGHT  83.3%  pass rate across 2 evals
  GOOD   95.8%  pass rate across 2 evals
  SAFE   91.7%  pass rate across 2 evals

$ fieldtest view
 Opened demo-rag/evals/results/demo-offline-0000-report.html

Pick your entry point

Offline
Pre-scored results
Uses bundled outputs and results. Full report in under a second. Nothing to configure, no credentials needed.
No API key needed
Add --offline flag
Live (rule)
Run rule + regex evals
Re-scores the pre-built outputs using only deterministic evals (rule, regex, reference). No LLM calls, no API key.
No API key needed
Omit --offline
Full live
Call your system + all judges
Run your own model against the demo fixtures. All eval types including LLM judges fire. Requires an Anthropic API key.
ANTHROPIC_API_KEY required
Add your runner script
Three examples available: --example email (customer support), --example rag (handbook Q&A), --example extraction (invoice JSON extraction). Each shows different eval patterns and a distinct set of failure modes.

Everything in one file.

Every scored run generates a self-contained HTML report. No server, no dashboard, no dependencies. fieldtest view opens it in your browser. Below: the RAG demo report.

fieldtest
run demo-offline-0000
example rag — Handbook Q&A Assistant
4 fixtures · 6 evals · 3 runs each
2026-03-31
RIGHT
83%
correctness
GOOD
96%
quality
SAFE
92%
guardrails
How to read this
Tags tell you where to look for the fix, not just what failed. RIGHT → prompt or training. GOOD → formatting or tone. SAFE → guardrails.
Filter by label: all accuracy completeness grounding
Fixture answers-from-context known-answer answer-length cites-source no-hallucination stays-in-scope
vacation-policy
Employee asking about PTO accrual
3/3 PASS 3/3 PASS 3/3 PASS 3/3 PASS 3/3 PASS 3/3 PASS
remote-work
Employee asking about remote work policy
2/3 FAIL 2/3 FAIL 3/3 PASS 3/3 PASS 2/3 FAIL 3/3 PASS
expense-reimbursement
Employee asking about reimbursement limits
3/3 PASS 2/3 FAIL 3/3 PASS 3/3 PASS 3/3 PASS 3/3 PASS
out-of-scope
Question not answerable from context
2/3 FAIL 3/3 PASS 2/3 FAIL 2/3 FAIL 2/3 FAIL
↓ out-of-scope / no-hallucination — click any cell to expand
Run 1 PASS
Response correctly states it cannot answer from the provided context and directs to HR. type: llm · grounding label
Run 2 PASS
Correctly declines. Mentions "a different section of the handbook" — appropriate deflection, no hallucinated policy. type: llm · grounding label
Run 3 FAIL
Fabricates "9:00 AM to 5:00 PM standard hours, 10:00 AM to 3:00 PM core hours" — hours that appear in the remote-work fixture context but are absent from the PTO section provided. Cross-fixture contamination. type: llm · grounding label
Generated by fieldtest · self-contained HTML · no server required
What you're looking at: The out-of-scope fixture catches a real and common failure — run 3 fabricates specific policy details that weren't in the provided context. The model saw those hours in a different fixture but hallucinated them into an unanswerable question. The stays-in-scope and no-hallucination evals both catch this; the grounding label groups them for filtering.

Right / Good / Safe

Every eval has exactly one tag. Not for scoring — for diagnosis. The tag tells you where to look for the fix when something fails. You don't need all three to start — a suite with a single eval is a valid suite.

RIGHT
Is the answer correct?
Correctness relative to ground truth. Did the system answer the question? Did it retrieve the right information? Does the output match the expected reference?
  • Known-answer reference checks
  • Required field presence
  • Addresses the user's actual question
  • Extracted value matches source
Failures point to: prompt, retrieval, training data, or model capability
GOOD
Is the answer well-formed?
Quality beyond correctness. Appropriate tone, format, length, and style. The answer is right — but is it delivered well for this context?
  • Greeting present in support email
  • Response within expected length range
  • Tone appropriate to the customer
  • Cites the source section
Failures point to: prompt instructions, formatting rules, or output post-processing
SAFE
Does it stay in bounds?
Guardrails and constraints. Does the system stay within its defined scope? Does it avoid fabricating, overreaching, or making unauthorized commitments?
  • No hallucinated policy details
  • No invented JSON fields
  • No unauthorized pricing commitments
  • Declines unanswerable questions
Failures point to: system prompt constraints, grounding instructions, or architecture (RAG retrieval scope)

Analytics grouping on top of tags

Labels are free-form strings you add to any eval — labels: [accuracy, grounding]. They're orthogonal to tags: a SAFE eval might carry the grounding label, a RIGHT eval might carry completeness. The HTML report renders them as clickable filter chips so you can isolate all grounding-related evals across the matrix regardless of tag.

One file defines everything.

The config is the practice. It forces you to name what you're building, decide what matters for your use case, and enumerate your evals before you measure anything. Start with one use case and a few evals — the structure grows with you.

config.yaml (rag example)
schema_version: 1

system:
  name: Handbook Q&A Assistant          # what does your system do?
  domain: Employee handbook Q&A with RAG

use_cases:
  - id: handbook_qa
    description: Answer employee policy questions from handbook context

    evals:

      # ── RIGHT evals — correctness ────────────────
      - id: answers-from-context
        tag: right                                # diagnostic lens
        labels: [accuracy]                         # analytics grouping
        type: llm
        description: Answer is supported by the provided context
        pass_criteria: Every claim is directly supported by the excerpt
        fail_criteria: Answer makes claims not in the provided context

      - id: known-answer
        tag: right
        labels: [accuracy]
        type: reference                            # exact string match
        description: Golden fixture exact check

      # ── GOOD evals — quality ─────────────────────
      - id: cites-source
        tag: good
        labels: [completeness]
        type: regex
        pattern: "(?i)(section|handbook|policy|per the)"
        match: true

      # ── SAFE evals — guardrails ──────────────────
      - id: no-hallucination
        tag: safe
        labels: [grounding]
        type: llm
        description: All details traceable to the provided handbook excerpt
        pass_criteria: Every specific claim can be found in the context
        fail_criteria: Any detail appears invented or added beyond the source

      - id: stays-in-scope
        tag: safe
        labels: [grounding]
        type: llm
        description: System declines questions not answerable from context
        pass_criteria: Redirects to HR or acknowledges missing context
        fail_criteria: Fabricates an answer for out-of-scope questions

    fixtures:
      directory: fixtures/golden
      sets:
        smoke: [vacation-policy]
        full: all
      runs: 3                                       # N runs per fixture → distributions

defaults:
  provider: anthropic
  model: claude-haiku-3-5-20251001              # judge model

Four judges. Closed set.

rule
Python rule function
Your own deterministic logic. Register with @rule("eval-id") in evals/rules.py. Gets the raw output string, returns Pass/Fail.
No API calls. Fastest. Best for structural checks that need code.
regex
Pattern match
Tests the output against a regex pattern. Set match: true (must contain) or match: false (must not contain).
No API calls. Zero latency. Exact and predictable.
llm
LLM judge
Binary pass/fail via a second model call. You write pass_criteria and fail_criteria as plain English. The judge returns structured JSON with reasoning.
Most flexible. Per-eval model overrides supported — use Haiku for most, Sonnet for subtle judgments.
reference
Reference comparison
Compares output against an expected block in the fixture. Checks contains (required strings) and not_contains (forbidden strings).
No API calls. Ground-truth fixtures only. Skip row shows when fixture has no expected block.

Every scored run writes five files.

All outputs land in evals/results/[run-id]/. No database. No server. Files you can diff, commit, open in Excel, or drop in a bug report.

📊
*-data.json
Full structured data: every row, every run, full reasoning text, summary stats, delta vs prior run. Machine-readable. CI parses this for gates.
🌐
*-report.html
Self-contained HTML report. Tag health cards, label filter bar, fixture×eval matrix, click-to-expand cell detail with per-run reasoning. Open in any browser.
📝
*-report.md
Markdown report grouped by RIGHT / GOOD / SAFE. Copy-paste into GitHub issues, Notion pages, or Slack. Delta section shows what changed vs last run.
📄
*-data.csv
Flat table: one row per eval×fixture×run. Tag, labels (pipe-separated), type, passed, score, detail, error. Load directly in Excel or Pandas for ad-hoc slicing.
📋
*-report.csv
Spreadsheet-friendly report view: tag health summary and per-eval matrix. Pairs with the markdown report for teams that want CSV over prose.

fieldtest CLI reference
# ── Demo and exploration ─────────────────────────────
fieldtest demo                   # interactive example picker
fieldtest demo --example rag --offline    # offline, instant
fieldtest demo --example email   # re-score with rule evals
fieldtest view                   # open latest HTML report in browser
fieldtest view 20260331-a4b2     # open specific run

# ── Start a real project ──────────────────────────────
fieldtest init                   # scaffold evals/ directory
fieldtest init --template rag    # start from a typed template

# ── Evaluate ──────────────────────────────────────────
fieldtest score                  # score all outputs in evals/outputs/
fieldtest score --set smoke      # score a named fixture subset
fieldtest score --concurrency 10 # parallel judge calls

# ── Inspect and clean ─────────────────────────────────
fieldtest list                   # list all scored runs with summary
fieldtest clean                  # delete all results (keeps outputs)
Claude Code users: fieldtest ships with a built-in /optimize skill. It scores your outputs, diagnoses failures from the report, edits your prompt or system code, and re-runs — an automated score-diagnose-fix-rescore loop. Type /optimize in Claude Code inside any fieldtest project.

Opinions we hold.

fieldtest is opinionated. These are the constraints that shape every design decision.

Tool measures. Human judges.
fieldtest does not declare pass or fail. It produces distributions — not verdicts. 83% pass rate on RIGHT evals is information. Whether 83% is acceptable is your engineering call, not the tool's.
One eval per failure mode.
Never bundle multiple concerns into one eval. A bundled eval that passes tells you nothing about which failure modes are absent. Narrow scope = interpretable failures.
Structure before measurement.
The config forces you to name your system, think about what matters, and enumerate failure modes before you run anything. Start with one tag and two evals. The structure scales with you — you can't skip the thinking, but you decide how much to think about first.
Files, not infrastructure.
No database, no server, no dashboard. Results are files. You can diff them, commit them, grep them, open them in any browser. Works from a laptop to CI to enterprise.
N runs capture variance.
A single run tells you almost nothing about a probabilistic system. fieldtest runs each fixture N times and shows distributions. 3 runs is the minimum; smoke suites use 1 for speed.
Runner decoupled from scoring.
Your runner writes outputs/[fixture-id]/run-N.txt. The scorer reads those files. They share nothing except the directory format. Re-score without re-running when you improve a judge.

From zero to scored report in two minutes.

1
Install
fieldtest is a plain Python package. No containers, no databases, no accounts required.
$ pip install fieldtest
2
Run the demo
See a real scored report before you write any config. Three examples — email, RAG, extraction — each with real failure modes.
$ fieldtest demo --example rag --offline
$ fieldtest view
3
Scaffold your own project
Run fieldtest init inside any project directory. It creates the evals/ structure and a starter config with inline comments that walk you through every section.
$ fieldtest init --template rag
 Created evals/config.yaml
 Created evals/rules.py
 Created evals/fixtures/golden/
 Created evals/outputs/
 Created evals/results/
→ Edit evals/config.yaml to define your system and evals
→ Add fixture files to evals/fixtures/golden/
→ Write your runner (see README for examples)
4
Run your system. Score. Repeat.
Your runner calls your system and writes outputs. fieldtest score judges them. fieldtest view opens the HTML report. Iterate on prompts, retrieve logic, or constraints — the distribution shows what changed.
$ python evals/runner.py     # you write this — ~30 lines
$ fieldtest score
$ fieldtest view