🔁 The Nondeterminism Simulator
Click Run Prompt repeatedly and watch the outputs change — even though the prompt never does.
Lower temperature → more similar outputs. Higher temperature → more varied, sometimes surprising responses.
Hit Run Prompt to see the AI respond…
⚖️ Traditional vs AI Testing
See why a hard assertion that works perfectly in traditional software simply can't work for AI.
📋 Traditional Assertion
# Works perfectly for deterministic software
def test_summarise():
output = summarise_benefits()
assert output == "Automated testing saves time by catching bugs early and reducing manual effort."
# Result when run against an AI model:
FAILED — AssertionError
# Even though the model gave a perfectly valid answer!
The model returned a correct, high-quality answer — but it used different words.
A hard assert == fails every time.
🧠 Hallucination Demo
A question with a definite correct answer — but the AI sometimes gets the details subtly wrong. This is why AI outputs need validation, not just format checks.
Hit Ask the AI to see responses…
📋 Test Strategy Cheat Sheet
A quick-reference summary of the strategies that make AI testing tractable. Click any row to expand it.
Use embedding models (e.g. Sentence-BERT, OpenAI text-embedding-3-small or text-embedding-3-large) or cosine similarity to compare the meaning of outputs rather than their exact text.
Threshold example: semantic_similarity(output, reference) > 0.85
Best for: open-ended generation, summarisation, Q&A.
Ask questions like: "Is this answer factually correct? Is it relevant? Is it harmful?"
Tools: GPT-4 as judge, Claude Constitutional AI, custom scoring prompts.
Best for: complex outputs where rule-based scoring is insufficient.
Test constraints rather than exact content: word count, JSON schema validity, required keyword presence, absence of banned phrases, sentiment bounds.
Example: assert len(output.split()) < 50 and "testing" in output.lower()
Best for: format compliance, safety guardrails, structured output.
Ground model outputs against a trusted knowledge base or retrieval system. Flag responses that include unverifiable or contradicted claims.
Approaches: RAG grounding checks, named entity verification, fact-checking APIs.
Best for: factual Q&A, customer-facing chatbots, medical/legal domains.
Run the same prompt many times and assert on distributions rather than single outputs. Track pass-rate trends across model versions to catch regressions.
Example: assert pass_rate(prompt, n=100) >= 0.95
Best for: model evaluation, A/B testing, production monitoring.
Create adversarial prompts designed to elicit harmful, biased, or incorrect outputs. Use prompt injection attacks, jailbreaks, and edge-case inputs to stress-test guardrails.
Tools: PyRIT, Garak, manual red-team sprints.
Best for: safety evaluation, compliance, high-stakes deployments.