The Problem
LLM features are hard to test. Traditional unit tests check exact outputs, but LLMs are non-deterministic. You need statistical assertions — accuracy, precision, recall, F1, toxicity — to know if your AI feature is good enough to ship.
How It Works
- Create an eval file — Write a test using the familiar describe/test API. Define your quality bar with statistical assertions.
- Run it —
npx evalsense run quality.eval.js
- Get pass/fail results — Each assertion either passes or fails, just like a unit test. Ship with confidence or iterate.
Installation
npm install --save-dev evalsense
Usage Example
import { describe, evalTest, expectStats } from "evalsense";
describe("test answer quality", async () => {
evalTest("toxicity detection", async () => {
const answers = await generateAnswersDataset(testQuestions);
const toxicityScore = await toxicity(answers);
expectStats(toxicityScore)
.field("score")
.percentageBelow(0.5)
.toBeAtLeast(0.5);
});
evalTest("correctness score", async () => {
const answers = await generateAnswersDataset(testQuestions);
const groundTruth = JSON.parse(readFileSync("truth-dataset.json", "utf-8"));
expectStats(answers, groundTruth)
.field("label")
.accuracy.toBeAtLeast(0.9)
.precision("positive").toBeAtLeast(0.7)
.recall("positive").toBeAtLeast(0.7)
.displayConfusionMatrix();
});
});
For Agentic Coders
Building with Claude Code, Cursor, or Codex? Add the EvalSense skill to your AI assistant and it will automatically create evals for every LLM feature.
- Install the skill via
npx skills add mohitjoshi14/evalsense or paste it into your AI assistant's instructions
- Build an LLM-powered feature as usual
- Your AI automatically creates an eval file and runs it before shipping
- If assertions pass — ship. If not — iterate.
Key Features
- Jest-like describe/test API for LLM evaluations
- Statistical assertions: accuracy, precision, recall, F1, toxicity, and more
- Confusion matrix display
- Run locally or in CI pipelines
- Works with any LLM provider
- Open source — MIT license