Evaluations

Overview

Automated testing framework for Agents and Flows, enabling systematic evaluation of AI system quality, consistency, and performance. Status: 🔮 Planned for Q2 2026

What are Evaluations?

A comprehensive testing and benchmarking system that runs your Agents and Flows against test cases, measures quality, and tracks improvements over time.

Key Features

Test Suites

Organize tests logically:

Group related test cases
Run entire suites or individual tests
Schedule regular runs
Compare results across runs

Test Cases

Define expected behavior:

- name: "Customer greeting"
  input: "Hello, I need help"
  expected_output_contains: "How can I help you"
  expected_tone: "friendly"
  max_tokens: 100

Evaluation Metrics

Measure quality:

Semantic similarity — How close is output to expected?
Factual accuracy — Are facts correct?
Tone matching — Does tone match guidelines?
Safety — No harmful content?
Latency — Response time acceptable?
Cost — Token usage within budget?

A/B Testing

Compare variants:

Test different prompts
Compare models
Evaluate tools configuration
Measure impact of changes

Regression Detection

Catch quality drops:

Baseline established from passing tests
Alert when quality degrades
Track metrics over time
Identify breaking changes

Use Cases

Prompt Engineering

Test prompt variations
Measure impact of changes
Find optimal configuration

Model Comparison

GPT-4 vs Claude
Cost vs quality tradeoffs
Speed vs accuracy

Quality Assurance

Verify Agent behavior
Catch regressions
Ensure consistency

Continuous Improvement

Track progress over time
Benchmark against goals
Identify improvement areas

Example Evaluation

name: "Customer Support Bot Evaluation"
agent: "customer-support-agent"

test_cases:
  - name: "Polite greeting"
    input: "Hello"
    assertions:
      - contains: ["hello", "hi", "help"]
      - tone: "friendly"
      - max_tokens: 50
    
  - name: "Product question"
    input: "How much does Pro plan cost?"
    assertions:
      - contains: ["$49", "month", "Pro"]
      - factually_correct: true
      - tools_called: ["get_pricing"]
    
  - name: "Escalation"
    input: "I want to cancel my account immediately"
    assertions:
      - tools_called: ["escalate_to_human"]
      - tone: "empathetic"
      - contains: ["understand", "help"]

metrics:
  - semantic_similarity: 0.8  # 80% minimum
  - response_time: 3000  # 3 seconds max
  - cost_per_interaction: 0.05  # $0.05 max

Evaluation Types

Unit Tests

Single component:

Test one Agent or Action
Specific input/output
Fast feedback

Integration Tests

Multiple components:

Test full Flows
End-to-end scenarios
Realistic workflows

Regression Tests

Prevent quality drops:

Run after every change
Compare to baseline
Alert on degradation

Performance Tests

Measure speed and cost:

Batch execution
Latency under load
Cost analysis

Safety Tests

Ensure responsible AI:

No harmful outputs
No PII leakage
No bias or toxicity

Evaluation Dashboard

Visual results:

Pass/fail rates
Quality trends over time
Cost and latency charts
Failure analysis
Comparison views

Automated Runs

Schedule evaluations:

After every deployment
Daily/weekly cron
Before production push
On-demand via API

CI/CD Integration

Block bad deployments:

# GitHub Actions
- name: Run Triform Evaluations
  run: triform eval run --suite customer-support --require-pass

- name: Deploy if evaluations pass
  if: success()
  run: triform deploy

Evaluation API

Programmatic access:

from triform import Evaluations

# Create evaluation
eval = Evaluations.create(
    agent_id="agent_abc123",
    test_suite="customer_support_v1"
)

# Run evaluation
result = eval.run()

# Check results
if result.pass_rate >= 0.95:
    print("Quality threshold met!")
else:
    print(f"Failed: {result.failures}")

Pricing

Included in Pro and Enterprise plans Usage:

Evaluations run as regular executions
Count toward execution quota
No additional cost beyond execution fees

Timeline

Q2 2026: Beta release with basic test cases
Q3 2026: Advanced metrics and A/B testing
Q4 2026: CI/CD integrations and automation

Get Notified

Sign up: triform.ai/evaluations-beta

Questions?

Join Discord #evaluations channel
Email: product@triform.ai

Getting Started

Tutorials

Triton

Workspace

Core Concepts

Organizations

Quotas

Security

Changelog

Roadmap

Support

Appendix

Overview

What are Evaluations?

Key Features

Test Suites

Test Cases

Evaluation Metrics

A/B Testing

Regression Detection

Use Cases

Example Evaluation

Evaluation Types

Unit Tests

Integration Tests

Regression Tests

Performance Tests

Safety Tests

Evaluation Dashboard

Automated Runs

CI/CD Integration

Evaluation API

Pricing

Timeline

Get Notified

Questions?

Getting Started

Tutorials

Triton

Workspace

Core Concepts

Organizations

Quotas

Security

Changelog

Roadmap

Support

Appendix

​Overview

​What are Evaluations?

​Key Features

​Test Suites

​Test Cases

​Evaluation Metrics

​A/B Testing

​Regression Detection

​Use Cases

​Example Evaluation

​Evaluation Types

​Unit Tests

​Integration Tests

​Regression Tests

​Performance Tests

​Safety Tests

​Evaluation Dashboard

​Automated Runs

​CI/CD Integration

​Evaluation API

​Pricing

​Timeline

​Get Notified

​Questions?

Overview

What are Evaluations?

Key Features

Test Suites

Test Cases

Evaluation Metrics

A/B Testing

Regression Detection

Use Cases

Example Evaluation

Evaluation Types

Unit Tests

Integration Tests

Regression Tests

Performance Tests

Safety Tests

Evaluation Dashboard

Automated Runs

CI/CD Integration

Evaluation API

Pricing

Timeline

Get Notified

Questions?