Evaluations
Evaluations are a specialized form of Action Output, a Scorecard, in Triform designed to assess the outcome of an execution.
Evaluations can be embedded or run independently. Their purpose is to provide insight—pass/fail signals, quantitative scores, or semantic reasoning.
Concept
An stand alone Evaluation takes in the input and/or output of another Node and applies logic to measure how well it performed against expectations. This is done like any Action, but with a focus on measurement rather than processing.
@triform.entrypoint
def evaluate(input):
# Perform evaluation here...
card = Scorecard("Response Quality Evaluation")
card.add_criterion("kindness", 65, 0.3, "Response was generally polite but could be more empathetic.")
card.add_criterion("correctness", 88, 0.5, "Information provided was accurate and well-supported.")
card.add_criterion("helpfulness", 56, 0.2, "Response addressed the question but lacked specific examples.")
card.set_summary("Overall meets basic quality standards but needs improvement in helpfulness.")
card.set_pass(True)
return card
The Evaluation returns a Scorecard as its output, which captures one or more dimensions of analysis, computes a weighted average, and records underlying details.
An embedded Evaluation is when you write the evaluation as part of an existing action and output the Scorecard as part of the Action Output. Allowing the next Action in line to use the Evaluation Scorecard during execution.
Scorecard
The Scorecard is a structured output schema returned by an Evaluation. It reflects multiple scoring dimensions, each defined by the user.
{
"resource": "scorecard/v1",
"meta": {
"name": "Name of evaluation...",
"run_id": "The id of the execution evaluated",
"node_id": "The id of the node in the flow evaluated"
},
"spec": {
"criteria": {
"kindness": {
"score": 65,
"weight": 0.3,
"motivation": "Bla bla..."
},
"correctness": {
"score": 88,
"weight": 0.5,
"motivation": "Bla bla..."
},
"helpfulness": {
"score": 56,
"weight": 0.2,
"motivation": "Bla bla..."
}
},
"summary": "Bla bla...",
"score": 75,
"pass": true
}
}
Each dimension includes:
- A label (e.g. "factuality", "relevance", "readability")
- A score (typically numeric, e.g. 0–100)
- A weight (value from 0.0 to 1.0) indicating its importance in the overall score
Triform uses these to compute a weighted average as the headline score, while preserving the raw breakdown for full transparency.
Features
- Users define the dimensions and their respective weights
- Triform automatically computes the weighted average
- All evaluations store the full breakdown—not just the average
- Scorecards can be displayed in detail during and after execution
Use Cases
- Model performance grading based on multiple quality metrics
- Automated quality gates for deployment pipelines
- Trend monitoring across executions, flows, or time windows
- Multi-dimensional feedback in human-in-the-loop systems