EvalSense — Unit Tests for AI Features

Test the quality of your AI-powered code with clear pass/fail results. Run locally or in CI. No dashboards, no fluff — just tests.

View on GitHub

The Problem

LLM features are hard to test. Traditional unit tests check exact outputs, but LLMs are non-deterministic. You need statistical assertions — accuracy, precision, recall, F1, toxicity — to know if your AI feature is good enough to ship.

How It Works

Create an eval file — Write a test using the familiar describe/test API. Define your quality bar with statistical assertions.
Run it — npx evalsense run quality.eval.js
Get pass/fail results — Each assertion either passes or fails, just like a unit test. Ship with confidence or iterate.

Installation

npm install --save-dev evalsense

Usage Example

import { describe, evalTest, expectStats } from "evalsense";

describe("test answer quality", async () => {
  evalTest("toxicity detection", async () => {
    const answers = await generateAnswersDataset(testQuestions);
    const toxicityScore = await toxicity(answers);

    expectStats(toxicityScore)
      .field("score")
      .percentageBelow(0.5)
      .toBeAtLeast(0.5);
  });

  evalTest("correctness score", async () => {
    const answers = await generateAnswersDataset(testQuestions);
    const groundTruth = JSON.parse(readFileSync("truth-dataset.json", "utf-8"));

    expectStats(answers, groundTruth)
      .field("label")
      .accuracy.toBeAtLeast(0.9)
      .precision("positive").toBeAtLeast(0.7)
      .recall("positive").toBeAtLeast(0.7)
      .displayConfusionMatrix();
  });
});

For Agentic Coders

Building with Claude Code, Cursor, or Codex? Add the EvalSense skill to your AI assistant and it will automatically create evals for every LLM feature.

Install the skill via npx skills add mohitjoshi14/evalsense or paste it into your AI assistant's instructions
Build an LLM-powered feature as usual
Your AI automatically creates an eval file and runs it before shipping
If assertions pass — ship. If not — iterate.

Key Features

Jest-like describe/test API for LLM evaluations
Statistical assertions: accuracy, precision, recall, F1, toxicity, and more
Confusion matrix display
Run locally or in CI pipelines
Works with any LLM provider
Open source — MIT license