Document Intelligence
BC Gov Document Processing Platform

Benchmarking System — Technical Reference

This page covers the internal architecture, data model, API surface, and development configuration of the benchmarking system. For a usage-focused guide, see the Benchmarking Guide.

Table of Contents

Architecture Overview

The benchmarking system is built on three infrastructure components that integrate with the existing NestJS backend and Temporal workflow engine:

Object Storage (MinIO / Azure Blob)

Pluggable object storage for dataset files (inputs, ground truth, manifests) and benchmark run outputs. The system uses a BlobStorageInterface abstraction with two implementations, switchable via the BLOB_STORAGE_PROVIDER environment variable:

  • MinIO (default) — S3-compatible, used for local development. Ports: 19000 (API), 19001 (Console).
  • Azure Blob Storage — for cloud deployments using Azure infrastructure.

PostgreSQL (Metadata Store)

Existing PostgreSQL instance extended with benchmark tables via Prisma: datasets, versions, splits, projects, definitions, runs, and audit logs.

Temporal (Workflow Orchestration)

Orchestrates the benchmark run lifecycle — dataset materialization, per-document fan-out execution, evaluation, aggregation, and baseline comparison. Uses a dedicated benchmark-processing task queue.

Task Queue Isolation
Benchmark workflows run on the benchmark-processing task queue by default, isolating them from production document processing. The useProductionQueue runtime setting can route child workflows to ocr-processing instead, for benchmarking under production conditions.

Data Model

All benchmark entities are defined in the Prisma schema at apps/shared/prisma/schema.prisma.

Dataset & DatasetVersion

TableFieldTypeDescription
datasetsidUUIDPrimary key.
nameString (unique)Human-readable dataset name.
descriptionString?Optional description.
metadataJSONArbitrary metadata (default: {}).
storagePathStringBase storage path prefix.
createdByStringUser who created the dataset.
versionsRelationHas many DatasetVersion.
TableFieldTypeDescription
dataset_versionsidUUIDPrimary key.
datasetIdUUID (FK)References parent dataset.
versionStringVersion identifier (e.g. "1.0", "2.0").
nameString?Optional version label.
storagePrefixString?Object storage prefix for all files in this version.
manifestPathStringRelative path to dataset-manifest.json within the storage prefix.
documentCountIntNumber of samples in this version.
groundTruthSchemaJSON?Optional JSON schema for ground truth validation.
frozenBooleanWhether the version is locked from modification (default: false). Automatically set to true when a benchmark run starts.

Split

FieldTypeDescription
idUUIDPrimary key.
datasetVersionIdUUID (FK)References parent dataset version.
nameStringDisplay name (e.g. "Test Split").
typeSplitTypeOne of: train, val, test, golden.
sampleIdsJSONArray of sample IDs belonging to this split.
stratificationRulesJSON?Optional rules used to generate the split.
frozenBooleanWhether the split is locked from modification (default: false).

BenchmarkProject

FieldTypeDescription
idUUIDPrimary key.
nameString (unique)Project name.
descriptionString?Optional description.
createdByStringUser who created the project.

BenchmarkDefinition

FieldTypeDescription
idUUIDPrimary key.
projectIdUUID (FK)References parent project.
nameStringDefinition name.
datasetVersionIdUUID (FK)Pinned dataset version.
splitIdUUID? (FK)Optional split to evaluate against.
workflowIdUUID (FK)Workflow to execute.
workflowConfigHashStringSHA-256 hash of the workflow configuration.
evaluatorTypeStringEvaluator type identifier (e.g. schema-aware).
evaluatorConfigJSONEvaluator-specific configuration (field rules, thresholds).
runtimeSettingsJSONExecution settings (concurrency, timeouts, queue routing).
immutableBooleanWhether the definition is locked (default: false).
revisionIntRevision counter (default: 1).
scheduleEnabledBooleanWhether scheduled execution is enabled.
scheduleCronString?Cron expression for scheduled runs.
scheduleIdString? (unique)Temporal schedule ID.

BenchmarkRun

FieldTypeDescription
idUUIDPrimary key.
definitionIdUUID (FK)References parent definition.
projectIdUUID (FK)References parent project.
statusBenchmarkRunStatusRun state: pending | running | completed | failed | cancelled.
temporalWorkflowIdStringTemporal workflow execution ID.
workerImageDigestString?Docker image digest of the benchmark worker.
workerGitShaStringGit SHA of the codebase at execution time.
startedAtDateTime?When the workflow started executing.
completedAtDateTime?When the workflow finished.
metricsJSONCombined flat metrics + structured aggregate + per-sample results (see Metrics Storage Format).
paramsJSONCaptured input parameters for reproducibility.
tagsJSONKey-value tags (e.g. { regression: "true" }).
errorString?Error message if the run failed.
isBaselineBooleanWhether this run is the current baseline for its definition.
baselineThresholdsJSON?Array of MetricThreshold (set when promoted to baseline).
baselineComparisonJSON?Comparison result against the baseline (set after run completion).

BenchmarkAuditLog

FieldTypeDescription
idUUIDPrimary key.
timestampDateTimeWhen the event occurred.
userIdStringUser who performed the action.
actionAuditActionEvent type.
detailsJSONEvent-specific details.

DatasetGroundTruthJob

Tracks the lifecycle of ground truth generation for individual samples. Each job links a dataset sample to a Document record that is processed through an OCR workflow and then reviewed via a dataset-scoped HITL queue.

FieldTypeDescription
idUUIDPrimary key.
datasetVersionIdUUID (FK)References parent dataset version.
sampleIdStringManifest sample ID.
documentIdUUID? (FK, unique)References the Document created for OCR processing. Unique constraint ensures one-to-one mapping.
workflowConfigIdStringWorkflow configuration used for OCR processing.
temporalWorkflowIdString?Temporal workflow execution ID.
statusGroundTruthJobStatusJob lifecycle state (see Enums below).
groundTruthPathString?Blob storage path of the generated ground truth JSON.
errorString?Error message if the job failed.

Indices: datasetVersionId, documentId, status. The unique constraint on documentId is used by the production HITL queue to exclude ground truth documents via groundTruthJob: { is: null }.

Enums

SplitType

  • train
  • val
  • test
  • golden

BenchmarkRunStatus

  • pending
  • running
  • completed
  • failed
  • cancelled

AuditAction

  • dataset_created
  • version_published
  • run_started
  • run_completed
  • baseline_promoted

GroundTruthJobStatus

  • pending — Job created, not yet started.
  • processing — OCR workflow is running.
  • awaiting_review — OCR complete, waiting for HITL review.
  • completed — Ground truth extracted and written.
  • failed — Job encountered an error.

API Endpoints

Dataset Management

Controller: apps/backend-services/src/benchmark/dataset.controller.ts

MethodPathDescription
POST/api/benchmark/datasetsCreate a dataset.
GET/api/benchmark/datasetsList datasets (paginated).
GET/api/benchmark/datasets/:idGet dataset details.
DELETE/api/benchmark/datasets/:idDelete a dataset.
POST/api/benchmark/datasets/:id/versionsCreate a dataset version.
POST/api/benchmark/datasets/:id/versions/:versionId/uploadUpload dataset files (inputs + ground truth). Blocked if version is frozen.
DELETE/api/benchmark/datasets/:id/versions/:versionId/samples/:sampleIdDelete a sample from a version. Blocked if version is frozen.
POST/api/benchmark/datasets/:id/versions/:versionId/freezeFreeze a dataset version (prevents uploads and sample deletions).
POST/api/benchmark/datasets/:id/versions/:versionId/validateValidate dataset files and ground truth.

HITL Dataset Creation

Controller: apps/backend-services/src/benchmark/hitl-dataset.controller.ts

These endpoints allow creating datasets from documents that have been verified through the Human-In-The-Loop review interface. The corrected OCR data becomes ground truth in the standard flat key-value format.

MethodPathDescription
GET/api/benchmark/datasets/from-hitl/eligible-documentsList documents with approved HITL sessions (paginated, searchable). Returns filename, file type, approval date, reviewer, field count, and correction count.
POST/api/benchmark/datasets/from-hitlCreate a new dataset and version from selected HITL-verified document IDs.
POST/api/benchmark/datasets/:id/versions/from-hitlAdd a new version to an existing dataset from selected HITL-verified document IDs.

Eligible Documents Query Parameters

ParameterTypeDescription
pagenumberPage number (default: 1).
limitnumberResults per page (default: 20).
searchstringFilter by filename (case-insensitive substring match).

Ground Truth Construction

For each selected document, the service takes the most recent approved review session and applies its field corrections to the OCR results. Corrections are applied as follows: confirmed fields keep their original value, corrected fields use the reviewer's value, deleted fields are removed, and flagged fields are kept as-is. The output is a flat key-value JSON file (e.g., {"vendor_name": "Acme Corp", "total_amount": 1250.75}) using the same field value extraction logic as the benchmark workflow (extractFieldValue). Provenance metadata (source document ID, review session, reviewer) is stored in the manifest sample's metadata field.

Ground Truth Generation

Controller: apps/backend-services/src/benchmark/ground-truth-generation.controller.ts

These endpoints manage the ground truth generation workflow — running samples without ground truth through an OCR workflow and providing a dataset-scoped HITL review queue.

MethodPathDescription
POST/api/benchmark/datasets/:id/versions/:versionId/ground-truth-generationStart ground truth generation. Creates jobs for samples without GT and starts OCR workflows. Body: { workflowConfigId: string }.
GET/api/benchmark/datasets/:id/versions/:versionId/ground-truth-generation/jobsList ground truth jobs (paginated). Lazily syncs job statuses from document status.
GET/api/benchmark/datasets/:id/versions/:versionId/ground-truth-generation/review/queueDataset-scoped HITL review queue. Returns documents awaiting review with OCR results and last session info.
GET/api/benchmark/datasets/:id/versions/:versionId/ground-truth-generation/review/statsReview queue statistics: total jobs, awaiting review, completed, failed.

Jobs Query Parameters

ParameterTypeDescription
pagenumberPage number (default: 1).
limitnumberResults per page (default: 50, max: 100).

Review Queue Query Parameters

ParameterTypeDescription
limitnumberResults per page (default: 50, max: 100).
offsetnumberNumber of results to skip (default: 0).
reviewStatuspending | reviewed | allFilter by review status (default: pending).

Queue Separation

Production and dataset HITL queues are separated at the database query level. The production queue (GET /api/hitl/queue) filters documents with groundTruthJob: { is: null }, excluding any document linked to a ground truth generation job. The dataset queue queries DatasetGroundTruthJob records directly. Both queues use the same ReviewSession and FieldCorrection models for the actual review process.

Post-Approval Hook

When a review session is approved via POST /api/hitl/sessions/:id/submit, the HITL service checks if the document is linked to a DatasetGroundTruthJob. If so, it triggers ground truth extraction: the OCR results are combined with the reviewer's corrections using buildGroundTruth(), the resulting JSON is written to dataset storage, and the manifest is updated. This hook is non-blocking — if it fails, the session approval still succeeds.

Projects

Controller: apps/backend-services/src/benchmark/benchmark-project.controller.ts

MethodPathDescription
POST/api/benchmark/projectsCreate a project.
GET/api/benchmark/projectsList all projects.
GET/api/benchmark/projects/:idGet project details.
DELETE/api/benchmark/projects/:idDelete a project.

Definitions

Controller: apps/backend-services/src/benchmark/benchmark-definition.controller.ts

MethodPathDescription
POST/api/benchmark/projects/:projectId/definitionsCreate a definition.
GET/api/benchmark/projects/:projectId/definitionsList definitions in a project.
GET/api/benchmark/projects/:projectId/definitions/:idGet definition details.
PUT/api/benchmark/projects/:projectId/definitions/:idUpdate a definition.
DELETE/api/benchmark/projects/:projectId/definitions/:idDelete a definition.
POST/api/benchmark/projects/:projectId/definitions/:id/scheduleConfigure scheduled execution.
GET/api/benchmark/projects/:projectId/definitions/:id/scheduleGet schedule info.
GET/api/benchmark/projects/:projectId/definitions/:id/baseline-historyGet baseline promotion history.

Runs

Controller: apps/backend-services/src/benchmark/benchmark-run.controller.ts

MethodPathDescription
POST/api/benchmark/projects/:projectId/definitions/:defId/runsStart a benchmark run.
GET/api/benchmark/projects/:projectId/runsList runs in a project.
GET/api/benchmark/projects/:projectId/runs/:runIdGet run details with headline metrics.
POST/api/benchmark/projects/:projectId/runs/:runId/cancelCancel a running benchmark.
GET/api/benchmark/projects/:projectId/runs/:runId/drill-downGet aggregated metrics, worst samples, and per-field errors.
GET/api/benchmark/projects/:projectId/runs/:runId/samplesGet per-sample results (paginated, filterable).
POST/api/benchmark/projects/:projectId/runs/:runId/baselinePromote run to baseline.
DELETE/api/benchmark/projects/:projectId/runs/:runIdDelete a run.

Per-Sample Results Query Parameters

ParameterTypeDescription
pagenumberPage number (default: 1).
limitnumberResults per page (default: 20).
passFilterpass | failFilter by pass/fail status.
dimensionstringMetadata dimension to filter by (e.g. docType).
dimensionValuestringValue of the dimension to filter on (e.g. invoice).

Temporal Workflow Engine

Orchestrator Workflow

The benchmarkRunWorkflow function in apps/temporal/src/benchmark-workflow.ts orchestrates the full run lifecycle:

  1. Materialize dataset — downloads files from object storage to local cache.
  2. Load manifest — parses dataset-manifest.json and filters by split.
  3. Execute & evaluate (fan-out) — processes samples in batches:
    • Executes the graph workflow as a child Temporal workflow per sample.
    • Extracts predictions from the workflow context.
    • Writes predictions to disk and runs the evaluator.
  4. Aggregate — computes run-level statistics from all evaluation results.
  5. Update run status — writes metrics and status to the database.
  6. Compare against baseline — if a baseline exists, runs threshold checks.
  7. Cleanup — removes temporary outputs (cached datasets are preserved).

The workflow supports:

Progress Query Response

FieldTypeDescription
phasestringCurrent phase: materializing, executing, evaluating, aggregating, cleanup, or completed.
totalSamplesnumberTotal samples to process.
completedSamplesnumberSamples finished so far.
failedSamplesnumberSamples that failed execution or evaluation.
percentCompletenumberInteger 0–100.

Activities

ActivityFileDescription
benchmark.materializeDataset activities/benchmark-materialize.ts Downloads dataset version from object storage to a local cache directory. Uses cache key {datasetId}-{datasetVersionId}.
benchmark.loadDatasetManifest activities/benchmark-materialize.ts Reads and parses the dataset manifest from the materialized directory.
benchmark.evaluate activities/benchmark-evaluate.ts Evaluates a single sample using the configured evaluator. Returns EvaluationResult with metrics, diagnostics, and pass/fail.
benchmark.aggregate activities/benchmark-evaluate.ts Aggregates all evaluation results into BenchmarkAggregationResult with overall stats, sliced metrics, and failure analysis.
benchmark.writePrediction activities/benchmark-write-prediction.ts Writes a prediction JSON to disk so the evaluator can read it.
benchmark.updateRunStatus activities/benchmark-update-run.ts Updates the BenchmarkRun record with status, metrics, and timestamps.
benchmark.compareAgainstBaseline activities/benchmark-baseline-comparison.ts Compares a run's flat metrics against the baseline's thresholds. Stores the comparison and tags regressions.
benchmark.cleanup activities/benchmark-logging.ts Removes temporary output files. Optionally preserves cached datasets.

The child workflow benchmarkExecuteWorkflow (in activities/benchmark-execute.ts) wraps the existing graphWorkflow and executes it for a single sample. It extracts predictions from the workflow context using a field extraction function that handles Azure Document Intelligence field types (typed values like valueNumber, valueCurrency, valueDate, etc.).

Configuration & Timeouts

SettingDefaultDescription
startToCloseTimeout30 minutesMaximum activity execution time.
retry.initialInterval1 secondInitial retry delay.
retry.maximumInterval30 secondsMaximum retry delay.
retry.maximumAttempts3Number of retry attempts.
maxParallelDocuments10Number of samples processed concurrently per batch.
timeoutPerDocumentMs300,000 (5 min)Timeout for each child workflow execution.

All defaults can be overridden per definition via runtimeSettings.

Evaluator System

Evaluator Interface

Defined in apps/temporal/src/benchmark-types.ts:

interface BenchmarkEvaluator {
  type: string;
  evaluate(input: EvaluationInput): Promise<EvaluationResult>;
}

interface EvaluationInput {
  sampleId: string;
  inputPaths: string[];
  predictionPaths: string[];
  groundTruthPaths: string[];
  metadata: Record<string, unknown>;
  evaluatorConfig: Record<string, unknown>;
}

interface EvaluationResult {
  sampleId: string;
  metrics: Record<string, number>;
  diagnostics: Record<string, unknown>;
  pass: boolean;
  artifacts?: EvaluationArtifact[];
  groundTruth?: unknown;
  prediction?: unknown;
  evaluationDetails?: unknown;
}

Schema-Aware Evaluator

File: apps/temporal/src/evaluators/schema-aware-evaluator.ts

Performs field-by-field comparison between predicted and ground truth JSON objects. For each field, applies a configurable matching rule:

Produces metrics: precision, recall, f1, truePositives, falsePositives, falseNegatives, totalGroundTruthFields, matchedFields, checkboxAccuracy.

Pass condition: f1 ≥ passThreshold (default: 1.0).

Black-Box Evaluator

File: apps/temporal/src/evaluators/black-box-evaluator.ts

Treats outputs as opaque. Supports two modes:

Pass condition: exact match only.

Evaluator Registry

File: apps/temporal/src/evaluator-registry.ts

A simple Map-based registry that stores evaluator instances keyed by their type string. Both built-in evaluators are registered at module load time. Custom evaluators can be added via the registerEvaluator() function.

Metrics Pipeline

Per-Sample Metrics

Each evaluator returns a Record<string, number> of named metrics for each sample. The specific metrics depend on the evaluator (see Per-Sample Metrics in the user guide).

Aggregation

File: apps/temporal/src/benchmark-aggregation.ts

The aggregateResults() function collects all metric names across samples, then computes the following statistics for each metric:

interface MetricStatistics {
  name: string;
  mean: number;     // Arithmetic mean
  median: number;   // 50th percentile
  stdDev: number;   // Population standard deviation
  p5: number;       // 5th percentile
  p25: number;      // 25th percentile (Q1)
  p75: number;      // 75th percentile (Q3)
  p95: number;      // 95th percentile
  min: number;      // Minimum value
  max: number;      // Maximum value
}

Percentiles are calculated using linear interpolation: for percentile P across N sorted values, the index is P/100 × (N-1). If the index falls between two values, a weighted average is used.

The aggregation also computes overall counts: totalSamples, passingsSamples, failingSamples, and passRate.

Storage Format

The BenchmarkRun.metrics JSON field stores a combined structure:

{
  // Flat metrics (for baseline comparison and quick access)
  "total_samples": 100,
  "passing_samples": 95,
  "failing_samples": 5,
  "pass_rate": 0.95,
  "f1.mean": 0.92,
  "f1.median": 0.93,
  "f1.stdDev": 0.05,
  "f1.p5": 0.81,
  "f1.p25": 0.88,
  "f1.p75": 0.96,
  "f1.p95": 0.99,
  "f1.min": 0.65,
  "f1.max": 0.99,

  // Structured aggregate (for drill-down)
  "_aggregate": {
    "overall": { ... },
    "sliced": [ ... ],
    "failureAnalysis": { ... }
  },

  // Per-sample results (for sample browsing)
  "perSampleResults": [
    {
      "sampleId": "sample-1",
      "metrics": { "f1": 0.92, "precision": 0.95, "recall": 0.89 },
      "diagnostics": { ... },
      "pass": true,
      "groundTruth": { ... },
      "prediction": { ... }
    }
  ]
}

The flat keys (like f1.mean) are produced by the flattenMetrics() helper in the workflow, which converts AggregatedMetrics into a flat Record<string, number>. These are the keys used by baseline comparison.

Sliced Metrics

When sliceDimensions is configured, the aggregation groups samples by metadata dimension values and computes separate AggregatedMetrics for each group:

interface SlicedMetrics {
  dimension: string;                          // e.g. "docType"
  slices: Record<string, AggregatedMetrics>;  // e.g. { "invoice": {...}, "receipt": {...} }
}

Metadata is read from the sample's diagnostics.metadata field. Samples with missing dimension values are grouped under "unknown".

Failure Analysis

The performFailureAnalysis() function produces two outputs:

Baseline Comparison

File: apps/temporal/src/activities/benchmark-baseline-comparison.ts

The comparison activity runs after every completed run. It:

  1. Looks up the baseline run for the same definition (the run with isBaseline: true).
  2. Iterates over all numeric metrics in both runs.
  3. For each metric, computes the absolute delta and percentage change.
  4. Checks the metric against any matching threshold:
    • Absolute: currentValue ≥ threshold.value
    • Relative: currentValue ≥ baselineValue × threshold.value
  5. Stores the BaselineComparison result on the run record.
  6. Tags the run with regression: "true" if any metric failed its threshold.
interface BaselineComparison {
  baselineRunId: string;
  overallPassed: boolean;
  metricComparisons: MetricComparison[];
  regressedMetrics: string[];
}

interface MetricComparison {
  metricName: string;
  currentValue: number;
  baselineValue: number;
  delta: number;
  deltaPercent: number;
  passed: boolean;
  threshold?: MetricThreshold;
}

interface MetricThreshold {
  metricName: string;
  type: "absolute" | "relative";
  value: number;
}

Threshold Types

TypeRuleExample
absolute Current value must be ≥ threshold value. f1.mean ≥ 0.90 — F1 must be at least 0.90 regardless of baseline.
relative Current value must be ≥ baseline value × threshold value. pass_rate ≥ baseline × 0.95 — pass rate can drop by at most 5% from the baseline.

Object Storage

The benchmarking system accesses object storage through a BlobStorageInterface abstraction, supporting both MinIO (S3-compatible, default for local development) and Azure Blob Storage (for cloud deployments). The active provider is selected via the BLOB_STORAGE_PROVIDER environment variable ("minio" or "azure").

Datasets are stored with the following structure:

datasets/{datasetId}/{version}/
  ├── dataset-manifest.json
  ├── inputs/
  │   ├── sample-1.pdf
  │   └── sample-2.pdf
  └── ground_truth/
      ├── sample-1.json
      └── sample-2.json

Dataset Materialization

The materialization activity downloads all files under the storagePrefix to a local cache directory:

Environment Variables

VariableDefaultDescription
TEMPORAL_ADDRESSlocalhost:7233Temporal server address.
TEMPORAL_NAMESPACEdefaultTemporal namespace.
BENCHMARK_TASK_QUEUEbenchmark-processingTask queue for benchmark workflows.
BENCHMARK_CACHE_DIR/tmp/benchmark-cacheLocal directory for materialized dataset cache.
WORKER_IMAGE_DIGEST(none)Docker image digest of the benchmark worker (optional, for reproducibility tracking).
DATABASE_URL(required)PostgreSQL connection string.
Object Storage
BLOB_STORAGE_PROVIDERminioStorage backend: "minio" or "azure".
MINIO_ENDPOINT(required for MinIO)MinIO server endpoint.
MINIO_ACCESS_KEY(required for MinIO)MinIO access key.
MINIO_SECRET_KEY(required for MinIO)MinIO secret key.
MINIO_DOCUMENT_BUCKET(required for MinIO)MinIO bucket name for document storage.
AZURE_STORAGE_CONNECTION_STRING(required for Azure)Azure Blob Storage connection string.
AZURE_STORAGE_CONTAINER_NAME(required for Azure)Azure Blob Storage container name.

Key Files Reference

Backend Services

FilePurpose
apps/backend-services/src/benchmark/benchmark-run.service.tsRun lifecycle, metrics retrieval, drill-down, per-sample results filtering.
apps/backend-services/src/benchmark/benchmark-run.controller.tsRun REST endpoints.
apps/backend-services/src/benchmark/benchmark-definition.service.tsDefinition CRUD, scheduling.
apps/backend-services/src/benchmark/benchmark-definition.controller.tsDefinition REST endpoints.
apps/backend-services/src/benchmark/benchmark-project.service.tsProject CRUD.
apps/backend-services/src/benchmark/benchmark-project.controller.tsProject REST endpoints.
apps/backend-services/src/benchmark/dataset.service.tsDataset versioning, upload, validation.
apps/backend-services/src/benchmark/dataset.controller.tsDataset REST endpoints.
apps/backend-services/src/benchmark/hitl-dataset.service.tsHITL-to-dataset bridge: eligible document queries, ground truth construction, dataset packaging.
apps/backend-services/src/benchmark/hitl-dataset.controller.tsHITL dataset creation REST endpoints.
apps/backend-services/src/benchmark/ground-truth-generation.service.tsGround truth generation: job creation, OCR workflow triggering, review queue, GT extraction on approval.
apps/backend-services/src/benchmark/ground-truth-generation.controller.tsGround truth generation REST endpoints.
apps/backend-services/src/benchmark/benchmark-temporal.service.tsTemporal client wrapper for starting runs.
apps/backend-services/src/benchmark/evaluator-registry.service.tsBackend-side evaluator registry.

Temporal Workflows & Activities

FilePurpose
apps/temporal/src/benchmark-workflow.tsMain orchestrator workflow.
apps/temporal/src/benchmark-aggregation.tsMetrics aggregation and failure analysis.
apps/temporal/src/benchmark-types.tsCore type definitions (evaluator interface, result types).
apps/temporal/src/evaluator-registry.tsEvaluator plugin registry.
apps/temporal/src/evaluators/schema-aware-evaluator.tsSchema-aware evaluator implementation.
apps/temporal/src/evaluators/black-box-evaluator.tsBlack-box evaluator implementation.
apps/temporal/src/activities/benchmark-execute.tsChild workflow execution per sample.
apps/temporal/src/activities/benchmark-materialize.tsDataset materialization and manifest loading.
apps/temporal/src/activities/benchmark-evaluate.tsEvaluation and aggregation activities.
apps/temporal/src/activities/benchmark-baseline-comparison.tsBaseline comparison logic.
apps/temporal/src/activities/benchmark-write-prediction.tsWrite prediction to disk.
apps/temporal/src/activities/benchmark-update-run.tsDatabase status updates.

Data Model

FilePurpose
apps/shared/prisma/schema.prismaPrisma schema (benchmark models at lines ~380–540).