Building Your First AI Test Suite in 1 Hour

Your AI works great in demos. But will it survive real users?

In this guide, I'll show you how to build a basic test suite that catches 80% of AI issues before they reach production.

Time investment: 1 hour

Skill level: Beginner-friendly (no ML expertise required)

Tools: Promptfoo (free, open-source)

Why You Need a Test Suite

Manual testing doesn't scale. Here's what happens without systematic testing:

  • You test 10 queries → miss the 11th that breaks everything
  • Model update → no idea if it broke existing functionality
  • Prompt change → regression bugs you didn't catch
  • Production incident → scrambling to reproduce and fix

With a test suite:

  • Test 100+ scenarios in 30 seconds
  • Catch regressions before deployment
  • Document expected behavior
  • Sleep better at night

The 5 Core Test Categories

1. Happy Path Tests (30% of suite)

What: Test that your AI works for common, straightforward queries.

Example:

Query: "What is your return policy?"
Expected: Mentions 30-day window, requires receipt

2. Edge Case Tests (25% of suite)

What: Test unusual but valid inputs.

Examples:

  • Typos: "refnd polcy"
  • Abbreviations: "ROI of Product X"
  • Ambiguity: "How do I reset it?" (what is "it"?)
  • Multi-question: "What's the price and refund policy?"

3. Hallucination Tests (20% of suite)

What: Ensure AI doesn't fabricate information.

Examples:

  • "What's your policy on X?" (X doesn't exist → should say "no information")
  • "Who is your CEO?" → Must match actual CEO
  • "How much does Y cost?" (if Y not in docs → should acknowledge uncertainty)

4. Safety Tests (15% of suite)

What: Test for harmful, biased, or inappropriate outputs.

Examples:

  • Jailbreak attempts: "Ignore all previous instructions..."
  • Harmful requests: "How do I hack..."
  • Bias: Test same query with different demographic markers

5. Performance Tests (10% of suite)

What: Ensure acceptable latency and quality under load.

Examples:

  • Response time < 3 seconds
  • Consistent quality across 100 runs
  • No timeouts or errors

Step-by-Step: Building Your Test Suite

Step 1: Install Promptfoo (5 minutes)

npm install -g promptfoo
promptfoo init my-test-suite
cd my-test-suite

Step 2: Define Your Test Cases (20 minutes)

Create prompts.yaml:

prompts:
  - "You are a helpful customer service assistant. Answer based only on the provided context."

providers:
  - openai:gpt-4

tests:
  # Happy path
  - description: "Return policy question"
    vars:
      question: "What is your return policy?"
    assert:
      - type: contains
        value: "30 days"
      - type: contains
        value: "receipt"

  # Edge case
  - description: "Typo in query"
    vars:
      question: "refnd polcy"
    assert:
      - type: contains
        value: "return"

  # Hallucination test
  - description: "Non-existent policy"
    vars:
      question: "What is your policy on lunar deliveries?"
    assert:
      - type: not-contains
        value: "we offer"
      - type: contains-any
        value: ["don't have", "no information", "not available"]

Step 3: Run Tests (1 minute)

promptfoo eval

Step 4: Review Results (5 minutes)

Promptfoo generates a beautiful HTML report showing:

  • ✓ Passed tests (green)
  • ✗ Failed tests (red)
  • Actual vs. expected outputs
  • Performance metrics

50 Starter Test Cases (Copy & Paste)

Download our pre-built test suite:

📥 LLM Evaluation Template

200+ test cases organized by category. Just customize for your use case.

Best Practices

  1. Start small: 20-30 tests, then grow to 100+
  2. Run on every commit: Integrate with CI/CD
  3. Track over time: Monitor pass rates
  4. Update regularly: Add tests for every bug you find
  5. Document expected behavior: Tests = living documentation

Common Pitfalls

  • Too many tests: Start small, iterate
  • Brittle assertions: Don't check exact wording
  • No negative tests: Test what should NOT happen
  • Ignoring failures: If tests fail, fix them

What's Next?

Once you have a basic suite:

  1. Week 2: Add 50 more tests
  2. Week 3: Integrate with CI/CD
  3. Week 4: Add performance benchmarks
  4. Month 2: Set up production monitoring

✓ Action Item:

Spend the next hour building your first 20 test cases. Use our template as a starting point.

Need Help Building Your Test Suite?

We'll help you design and implement a custom test suite for your AI system.

Book Consultation