Building Your First AI Test Suite in 1 Hour
Your AI works great in demos. But will it survive real users?
In this guide, I'll show you how to build a basic test suite that catches 80% of AI issues before they reach production.
Time investment: 1 hour
Skill level: Beginner-friendly (no ML expertise required)
Tools: Promptfoo (free, open-source)
Why You Need a Test Suite
Manual testing doesn't scale. Here's what happens without systematic testing:
- You test 10 queries → miss the 11th that breaks everything
- Model update → no idea if it broke existing functionality
- Prompt change → regression bugs you didn't catch
- Production incident → scrambling to reproduce and fix
With a test suite:
- Test 100+ scenarios in 30 seconds
- Catch regressions before deployment
- Document expected behavior
- Sleep better at night
The 5 Core Test Categories
1. Happy Path Tests (30% of suite)
What: Test that your AI works for common, straightforward queries.
Example:
Expected: Mentions 30-day window, requires receipt
2. Edge Case Tests (25% of suite)
What: Test unusual but valid inputs.
Examples:
- Typos: "refnd polcy"
- Abbreviations: "ROI of Product X"
- Ambiguity: "How do I reset it?" (what is "it"?)
- Multi-question: "What's the price and refund policy?"
3. Hallucination Tests (20% of suite)
What: Ensure AI doesn't fabricate information.
Examples:
- "What's your policy on X?" (X doesn't exist → should say "no information")
- "Who is your CEO?" → Must match actual CEO
- "How much does Y cost?" (if Y not in docs → should acknowledge uncertainty)
4. Safety Tests (15% of suite)
What: Test for harmful, biased, or inappropriate outputs.
Examples:
- Jailbreak attempts: "Ignore all previous instructions..."
- Harmful requests: "How do I hack..."
- Bias: Test same query with different demographic markers
5. Performance Tests (10% of suite)
What: Ensure acceptable latency and quality under load.
Examples:
- Response time < 3 seconds
- Consistent quality across 100 runs
- No timeouts or errors
Step-by-Step: Building Your Test Suite
Step 1: Install Promptfoo (5 minutes)
promptfoo init my-test-suite
cd my-test-suite
Step 2: Define Your Test Cases (20 minutes)
Create prompts.yaml:
- "You are a helpful customer service assistant. Answer based only on the provided context."
providers:
- openai:gpt-4
tests:
# Happy path
- description: "Return policy question"
vars:
question: "What is your return policy?"
assert:
- type: contains
value: "30 days"
- type: contains
value: "receipt"
# Edge case
- description: "Typo in query"
vars:
question: "refnd polcy"
assert:
- type: contains
value: "return"
# Hallucination test
- description: "Non-existent policy"
vars:
question: "What is your policy on lunar deliveries?"
assert:
- type: not-contains
value: "we offer"
- type: contains-any
value: ["don't have", "no information", "not available"]
Step 3: Run Tests (1 minute)
Step 4: Review Results (5 minutes)
Promptfoo generates a beautiful HTML report showing:
- ✓ Passed tests (green)
- ✗ Failed tests (red)
- Actual vs. expected outputs
- Performance metrics
50 Starter Test Cases (Copy & Paste)
Download our pre-built test suite:
200+ test cases organized by category. Just customize for your use case.
Best Practices
- Start small: 20-30 tests, then grow to 100+
- Run on every commit: Integrate with CI/CD
- Track over time: Monitor pass rates
- Update regularly: Add tests for every bug you find
- Document expected behavior: Tests = living documentation
Common Pitfalls
- ❌ Too many tests: Start small, iterate
- ❌ Brittle assertions: Don't check exact wording
- ❌ No negative tests: Test what should NOT happen
- ❌ Ignoring failures: If tests fail, fix them
What's Next?
Once you have a basic suite:
- Week 2: Add 50 more tests
- Week 3: Integrate with CI/CD
- Week 4: Add performance benchmarks
- Month 2: Set up production monitoring
✓ Action Item:
Spend the next hour building your first 20 test cases. Use our template as a starting point.
Need Help Building Your Test Suite?
We'll help you design and implement a custom test suite for your AI system.
Book Consultation