OpenTest.ai — Open Testing for the AI Era

Open Skills · jank-oss

Six local-first quality checks.
Works in 8 AI coding tools.

Drop-in playbooks for Claude Code, Cursor, Antigravity, Codex CLI, GitHub Copilot Chat, Aider, Continue.dev, and the universal AGENTS.md. Pure-local — no API key, no cloud call, no account.

Created & open-sourced by jank.ai — MIT, free forever.

jank auto-fix

Find bugs in the latest changes, auto-fix the obvious/dangerous ones, then offer to fix the rest. Works on code, docs, images, and Office files.

View playbook →

jank_deep 12 dimensions

Full quality scan of ALL recent changes with a lime-green ASCII bar graph per attribute — correctness, security, reliability, performance, maintainability, testing, and more.

View playbook →

jank_risk 4-dim risk

Trace recent changes across four risk dimensions — code blast-radius · GenAI smells (leaked secrets, dupes) · user-frustration · business. Includes "is reverting safer?".

View playbook →

jank_confidence statistical

Will-it-survive-production score. Statistical test-coverage analysis (t-tests, p-values, sampling). Flags where you need more runs and estimates the cost & time to close gaps.

View playbook →

jank_diff smart diff

Intelligent AI diff of an API, document, or local file. Ignores noise (timestamps, dynamic IDs, generated metadata) and surfaces only the changes that matter.

View playbook →

jank_stats cumulative

Cumulative metrics — tokens, cost, fixes by category, plus a high-confidence estimate of developer time + money saved. Per-session and lifetime.

View playbook →

Setup · pick your tool

Eight clients. One source of truth.

Get the bundle once, then drop it into whichever tool you use. Cards are collapsed by default — open the one you need.

Get the bundle — once

Pick whichever you prefer. Every per-client setup below assumes the bundle is at /tmp/opentestai.

Option A · Clone

git clone https://github.com/jarbon/opentestai /tmp/opentestai

Option B · Download zip

Download .zip · unzip to /tmp/opentestai

Claude Code /jank

First-class slash commands — runs as part of Claude's normal conversation loop.

1

Move the bundle into Claude's plugin dir

cp -R /tmp/opentestai/jank-oss ~/.claude/plugins/jank-oss
2

Restart Claude Code, then invoke

/jank

Cursor @jank

Rules inject the playbook into your Composer prompt — runs inside Cursor's agent loop.

1

Copy the Cursor rules into your repo

cp -R /tmp/opentestai/jank-oss/dist/cursor/.cursor ./your-repo/
2

Mention in Cursor chat

@jank find bugs in my recent changes

Antigravity /jank

SKILL.md packages are first-class — they show up in the command palette and chain with other agents.

1

Drop skill folders into Antigravity

cp -R /tmp/opentestai/jank-oss/dist/antigravity/skills/* ~/.antigravity/skills/
2

Open the command palette & invoke

/jank

Codex CLI codex /jank

Terminal-native — pipe output into shell tools, drop it into a CI step, exit codes work like any Unix command.

1

Copy skill packages into Codex

cp -R /tmp/opentestai/jank-oss/dist/codex/skills/* ~/.codex/skills/
2

Run from any project

codex /jank

GitHub Copilot Chat natural lang.

One file in .github/ and your whole team gets the playbooks via natural-language requests.

1

Drop instructions in your repo's .github/

cp /tmp/opentestai/jank-oss/dist/copilot/.github/copilot-instructions.md ./your-repo/.github/
2

Ask Copilot in natural language

"do a jank scan on the recent changes"

Aider --prompt-file

Scope a session to one quality dimension via --prompt-file — perfect for batch fixes.

1

Copy the prompts into your repo

cp -R /tmp/opentestai/jank-oss/dist/aider/.aider ./your-repo/
2

Run a scoped session

aider --prompt-file .aider/prompts/jank.md

Continue.dev /jank

Open-source extension for VS Code + JetBrains — your whole team gets /jank without per-IDE setup.

1

Merge into your Continue config

cp /tmp/opentestai/jank-oss/dist/continue/.continue/config.json ./your-repo/.continue/
2

Invoke in the Continue chat

/jank

Universal · AGENTS.md natural lang.

Cline, Cody, Roo, Sweep, and any agentic tool that reads AGENTS.md at the repo root — one file, infinite tools.

1

Drop AGENTS.md into your repo root

cp /tmp/opentestai/jank-oss/dist/agents-md/AGENTS.md ./your-repo/
2

Ask your tool in natural language

"do a jank scan" — the AGENTS.md routing teaches the model which playbook to follow.

One source of truth in commands/*.md — edit a playbook, run node scripts/build-all.mjs, and every client's package re-emits. View build script →

Created & open-sourced by jank.ai

The six skills above were built and open-sourced by jank.ai — MIT, no strings.

Beyond the OSS commands, jank.ai also spins up cloud testing: /jank_cloud runs your tests in a real cloud browser · /jank_watch sets up smart-diff URL monitoring with persona-judged better/worse verdicts + dashboards · /jank_loop autonomously hunts & fixes in a loop · /jank_eyes sends your change to real human eyes for review.

Visit jank.ai →

The complete specs

The two formats, in full.

Every field is self-explaining. Every payload is paste-ready. v1 is shipping — comments, critiques, and PRs are open.

The artifacts we test on are pre-AI

Testing's wire formats predate large language models. Gherkin and JUnit XML were written for parsers, not readers. TestRail and Xray store steps as styled HTML inside relational rows. None of these survive a copy-paste into a chat window. Bugs live in Jira tickets, free-text comments, and screenshots scattered across Slack. There is no portable JSON for "this is a bug" or "this is a test case" — so every agent reinvents one, and none of them talk to each other.

AI changes who reads and writes them

The producer is increasingly an LLM. The consumer is increasingly an LLM. The handoffs — "fix this bug", "run this test", "explain why it failed", "route to the right engineer" — happen between agents, not just humans. The format needs to be self-describing (no schema lookup), reasoning-aware (the model's thinking is part of the payload), and action-ready (the next step is embedded, not external).

A standard means tools interoperate

Once a bug or test case is a portable JSON object, every link in the chain — finder, runner, triage, fix — can be swapped. A bug found by one agent can be fixed by another. A test case generated by an LLM can be replayed by Playwright, Selenium, Cypress, or a human. Without a standard, every vendor is an island. With one, the testing layer becomes a composable substrate for AI.

Read this first

Most fields are optional. Less is more.

The two formats below show everything you could include. In practice, a good bug or test case should specify as few fields as possible — only what's genuinely informative. Only two fields are required per object: bug_title for a bug, title for a test case. Everything else is opt-in.

Why: in the AI era, over-specified payloads are more brittle, not less. They lie about precision the model doesn't have, they bloat tokens, and they invite the next model down the chain to fight the schema instead of solving the problem. Specify only what changes a decision. The model can always be asked to elaborate; it can rarely be asked to forget.

Minimal valid bug

{ "bug_title": "Login form submits on Enter even when password is empty" }

Minimal valid test case

{ "title": "User can complete checkout with a saved card" }

Add fields only when a downstream consumer (human, model, runner) genuinely benefits from them. Don't fill them in just because the schema mentions them.

otai/bug · v1

Bug format

jump to spec ↓

A bug is a finding with a confidence, a reason, evidence, and a paste-ready fix prompt. One JSON object, ready to ship to Jira, GitHub, an AI coding assistant, or another agent.

{
  "summary": "",
  "score": 0-100,                          // 100 = pristine
  "issues": [
    {
      "bug_title": "...",                  // short title
      "bug_type": ["primary", "sub"],      // 1 of 8 categories
      "bug_confidence": 1-10,              // is it real?
      "bug_priority":   1-10,              // how urgent?
      "interestingness": 1-10,             // notable to readers?
      "bug_reasoning_why_a_bug": "...",    // the case for it
      "bug_reasoning_why_not_a_bug": "...",// counter-argument
      "suggested_fix": "...",
      "bug_why_fix": "...",                // user/business impact
      "what_type_of_engineer_to_route_issue_to": "...",
      "prompt_to_fix_this_issue": "...",   // paste-ready AI fix prompt
      "possibly_relevant_page_console_text": "...",
      "possibly_relevant_network_call": "...",
      "possibly_relevant_page_text": "...",
      "possibly_relevant_page_elements": "...",
      "tester": "...", "byline": "..."
    }
  ]
}

securityaccessibilityperformanceusabilityvisualcontentreliabilityprivacy

See real bug examples on GitHub

5 hand-crafted payloads · security · a11y · perf · content · full report

Why these fields — and why AI-first

Self-explaining field names. bug_title, suggested_fix, bug_why_fix — no codes, no abbreviations. A model reading any payload knows what each field means without a schema doc in its context window. Tokens save themselves.
Reasoning is part of the payload. bug_reasoning_why_a_bug + bug_reasoning_why_not_a_bug capture the model's thinking on both sides. Pre-AI bug trackers only stored the verdict; AI-first trackers store the case, because the verdict is sometimes wrong and a human (or another AI) needs to audit it.
Confidence and priority are 1–10 integers, not floats or labels. LLMs calibrate well on small integer scales. They struggle with probability floats (overconfident) and with arbitrary enums ("Sev-2"). 1–10 means downstream filters like "confidence ≥ 7" actually work across models.
A fix prompt ships with the bug. prompt_to_fix_this_issue is a paste-ready instruction for an AI coding assistant. The format admits that the next step after "bug found" is "AI fixes it" — and ships the prompt that closes the loop, instead of leaving the user to compose one.
Routing is a field, not a workflow. what_type_of_engineer_to_route_issue_to lets agents triage and dispatch without external rules. A swarm of fix-bots can self-organize on this field alone.
Evidence is loose strings, not structured logs. Console text, a network call URL, the relevant HTML — pasted in raw. Easier to produce, easier to read, no schema for the schema. The model that wrote the bug saw these as text; the model that consumes it sees them the same way.
Surface, don't silently skip. Any finding with confidence ≥ 4 is included. The schema's job is to make borderline cases legible (with reasoning + counter-argument), not to hide them. Suppression is downstream work.

otai/testcase · v1

Test case format

jump to spec ↓

A test case is an intent ("test that signup works"), a sequence of steps, and an assertion. Steps describe what to do, why a tester would do it, and what should be true after — in language a human or an AI runner can execute.

{
  "title": "<short imperative title>",
  "goal":  "<what success looks like>",
  "steps": [
    {
      "description": "what the user does",
      "thinking":    "what a tester reasons first",
      "action":      "navigate | click | fill | select | press | wait | assert",
      "target":      "<visible text · aria · css>",   // prefer visible text
      "value":       "<for fill/select/press/navigate>",
      "expected":    "<what is true after this step>"
    }
  ]
}

// or a declarative check (no steps, no model needed at run time):
{ "type": "console-clean" }                       // no console errors
{ "type": "no-broken-link" }                      // no 4xx/5xx links
{ "type": "network-clean" }                       // no failed XHR
{ "type": "selector-exists", "selector": "..." }
{ "type": "selector-missing", "selector": "..." }
{ "type": "text-contains", "text": "..." }
{ "type": "text-missing",  "text": "..." }
{ "type": "alt-text" }                            // every img has alt
{ "type": "contrast-min", "min": 4.5 }            // WCAG AA

navigateclickfillselectpresswaitassert

See real test case examples on GitHub

6 hand-crafted files · 3 multi-step flows · 3 declarative one-liners

Why these fields — and why AI-first

Intent lives next to action. thinking is a first-class field. Gherkin lost this — "Given/When/Then" tells you what happens but not why a tester would do that. When an LLM regenerates a step (because the UI changed), it needs the original intent, not the original click target.
Targets are visible text, not selectors, by default. target: "Sign up" beats target: "#btn-3 > span.cta" in every way that matters. LLMs see screenshots; humans read screens; CSS selectors break on every redesign. Visible text is the most stable, most portable, most legible address for an interactive element — and it's what the model already has.
One small action vocabulary. Seven verbs cover everything a browser test does: navigate · click · fill · select · press · wait · assert. Runners (Playwright, Selenium, a vision-LLM, a human) can each implement the same seven primitives in their own way, and a test case travels between them without translation.
Every step has an expected. Steps without assertions are debt. The schema requires you to say what should be true after each action — so an LLM regenerating a test from intent can verify it actually still works, not just compile.
Declarative checks are first-class. Some "tests" aren't flows — "no console errors", "every image has alt text", "contrast ≥ 4.5". Pre-AI tools forced you to write a script for these. Here they're a one-line JSON object that any runner can interpret. AI doesn't need to generate code to express a check.
Goal-first, not script-first. goal — "user successfully completes checkout" — survives UI changes. The steps don't. When the UI shifts, an AI agent can regenerate the steps from the goal. Pre-AI test suites couldn't do this; AI-first suites have to.

The point

Standards are how an industry stops re-inventing the same thing in private. Email had RFC 822. Browsers had HTML 1.0. Continuous integration had JUnit XML. The AI-testing era needs its own: a bug format and a test case format that LLMs natively read, write, and reason about — portable across runners, finders, fixers, and humans. We're shipping v1 of both. Comments, critiques, and PRs are open.

View on GitHub → Comment on the spec

Open skills + formats
for AI-era testing.

Six local-first quality checks.
Works in 8 AI coding tools.

Eight clients. One source of truth.

The two formats, in full.

Bug format

Why these fields — and why AI-first

Test case format

Why these fields — and why AI-first

We published the Manifesto for
Confidence Engineering

Six local-first quality checks.Works in 8 AI coding tools.

Eight clients. One source of truth.

The two formats, in full.

Bug format

Why these fields — and why AI-first

Test case format

Why these fields — and why AI-first

We published the Manifesto forConfidence Engineering

Join the OpenTest.AI Community

Six local-first quality checks.
Works in 8 AI coding tools.

We published the Manifesto for
Confidence Engineering