Evaluations

Evaluations are a specialized form of Action Output, a Scorecard, in Triform designed to assess the outcome of an execution.

Evaluations can be embedded or run independently. Their purpose is to provide insight—pass/fail signals, quantitative scores, or semantic reasoning.

Concept

An stand alone Evaluation takes in the input and/or output of another Node and applies logic to measure how well it performed against expectations. This is done like any Action, but with a focus on measurement rather than processing.

@triform.entrypoint
def evaluate(input):
    # Perform evaluation here...

    card = Scorecard("Response Quality Evaluation")
    
    card.add_criterion("kindness", 65, 0.3, "Response was generally polite but could be more empathetic.")
    card.add_criterion("correctness", 88, 0.5, "Information provided was accurate and well-supported.")
    card.add_criterion("helpfulness", 56, 0.2, "Response addressed the question but lacked specific examples.")
    
    card.set_summary("Overall meets basic quality standards but needs improvement in helpfulness.")
    card.set_pass(True)
    
    return card

The Evaluation returns a Scorecard as its output, which captures one or more dimensions of analysis, computes a weighted average, and records underlying details.

An embedded Evaluation is when you write the evaluation as part of an existing action and output the Scorecard as part of the Action Output. Allowing the next Action in line to use the Evaluation Scorecard during execution.

Scorecard

The Scorecard is a structured output schema returned by an Evaluation. It reflects multiple scoring dimensions, each defined by the user.

{
  "resource": "scorecard/v1",
  "meta": {
    "name": "Name of evaluation...",
    "run_id": "The id of the execution evaluated",
    "node_id": "The id of the node in the flow evaluated"
  },
  "spec": {
    "criteria": {
      "kindness": {
        "score": 65,
        "weight": 0.3,
        "motivation": "Bla bla..."
      },
      "correctness": {
        "score": 88,
        "weight": 0.5,
        "motivation": "Bla bla..."
      },
      "helpfulness": {
        "score": 56,
        "weight": 0.2,
        "motivation": "Bla bla..."
      }
    },
    "summary": "Bla bla...",
    "score": 75,
    "pass": true
  }
}

Each dimension includes:

A label (e.g. "factuality", "relevance", "readability")
A score (typically numeric, e.g. 0–100)
A weight (value from 0.0 to 1.0) indicating its importance in the overall score

Triform uses these to compute a weighted average as the headline score, while preserving the raw breakdown for full transparency.

Features

Users define the dimensions and their respective weights
Triform automatically computes the weighted average
All evaluations store the full breakdown—not just the average
Scorecards can be displayed in detail during and after execution

Use Cases

Model performance grading based on multiple quality metrics
Automated quality gates for deployment pipelines
Trend monitoring across executions, flows, or time windows
Multi-dimensional feedback in human-in-the-loop systems

Concept​

Scorecard​

Features​

Use Cases​

Concept

Scorecard

Features

Use Cases