Skip to main content

Overview

Automated testing framework for Agents and Flows, enabling systematic evaluation of AI system quality, consistency, and performance. Status: 🔮 Planned for Q2 2026

What are Evaluations?

A comprehensive testing and benchmarking system that runs your Agents and Flows against test cases, measures quality, and tracks improvements over time.

Key Features

Test Suites

Organize tests logically:
  • Group related test cases
  • Run entire suites or individual tests
  • Schedule regular runs
  • Compare results across runs

Test Cases

Define expected behavior:
- name: "Customer greeting"
  input: "Hello, I need help"
  expected_output_contains: "How can I help you"
  expected_tone: "friendly"
  max_tokens: 100

Evaluation Metrics

Measure quality:
  • Semantic similarity — How close is output to expected?
  • Factual accuracy — Are facts correct?
  • Tone matching — Does tone match guidelines?
  • Safety — No harmful content?
  • Latency — Response time acceptable?
  • Cost — Token usage within budget?

A/B Testing

Compare variants:
  • Test different prompts
  • Compare models
  • Evaluate tools configuration
  • Measure impact of changes

Regression Detection

Catch quality drops:
  • Baseline established from passing tests
  • Alert when quality degrades
  • Track metrics over time
  • Identify breaking changes

Use Cases

Prompt Engineering
  • Test prompt variations
  • Measure impact of changes
  • Find optimal configuration
Model Comparison
  • GPT-4 vs Claude
  • Cost vs quality tradeoffs
  • Speed vs accuracy
Quality Assurance
  • Verify Agent behavior
  • Catch regressions
  • Ensure consistency
Continuous Improvement
  • Track progress over time
  • Benchmark against goals
  • Identify improvement areas

Example Evaluation

name: "Customer Support Bot Evaluation"
agent: "customer-support-agent"

test_cases:
  - name: "Polite greeting"
    input: "Hello"
    assertions:
      - contains: ["hello", "hi", "help"]
      - tone: "friendly"
      - max_tokens: 50
    
  - name: "Product question"
    input: "How much does Pro plan cost?"
    assertions:
      - contains: ["$49", "month", "Pro"]
      - factually_correct: true
      - tools_called: ["get_pricing"]
    
  - name: "Escalation"
    input: "I want to cancel my account immediately"
    assertions:
      - tools_called: ["escalate_to_human"]
      - tone: "empathetic"
      - contains: ["understand", "help"]

metrics:
  - semantic_similarity: 0.8  # 80% minimum
  - response_time: 3000  # 3 seconds max
  - cost_per_interaction: 0.05  # $0.05 max

Evaluation Types

Unit Tests

Single component:
  • Test one Agent or Action
  • Specific input/output
  • Fast feedback

Integration Tests

Multiple components:
  • Test full Flows
  • End-to-end scenarios
  • Realistic workflows

Regression Tests

Prevent quality drops:
  • Run after every change
  • Compare to baseline
  • Alert on degradation

Performance Tests

Measure speed and cost:
  • Batch execution
  • Latency under load
  • Cost analysis

Safety Tests

Ensure responsible AI:
  • No harmful outputs
  • No PII leakage
  • No bias or toxicity

Evaluation Dashboard

Visual results:
  • Pass/fail rates
  • Quality trends over time
  • Cost and latency charts
  • Failure analysis
  • Comparison views

Automated Runs

Schedule evaluations:
  • After every deployment
  • Daily/weekly cron
  • Before production push
  • On-demand via API

CI/CD Integration

Block bad deployments:
# GitHub Actions
- name: Run Triform Evaluations
  run: triform eval run --suite customer-support --require-pass

- name: Deploy if evaluations pass
  if: success()
  run: triform deploy

Evaluation API

Programmatic access:
from triform import Evaluations

# Create evaluation
eval = Evaluations.create(
    agent_id="agent_abc123",
    test_suite="customer_support_v1"
)

# Run evaluation
result = eval.run()

# Check results
if result.pass_rate >= 0.95:
    print("Quality threshold met!")
else:
    print(f"Failed: {result.failures}")

Pricing

Included in Pro and Enterprise plans Usage:
  • Evaluations run as regular executions
  • Count toward execution quota
  • No additional cost beyond execution fees

Timeline

Q2 2026: Beta release with basic test cases
Q3 2026: Advanced metrics and A/B testing
Q4 2026: CI/CD integrations and automation

Get Notified

Sign up: triform.ai/evaluations-beta

Questions?

I