Document Intelligence
BC Gov Document Processing Platform
Open Source

Document Intelligence Platform

Enterprise-grade document processing platform with OCR, graph-based workflow orchestration, document classification, custom model training, and human-in-the-loop review capabilities.

Platform Overview

The Document Intelligence Platform is a comprehensive solution for document processing workflows. It combines OCR services, flexible graph-based workflow orchestration, supervised learning capabilities, and collaborative review tools into a unified platform.

🚀 Core Features

  • Document Processing - Multi-format document upload and OCR with Azure Document Intelligence
  • Document Classification - Train and deploy Azure Document Intelligence classifiers for automated document type routing
  • Graph Workflows - DAG-based workflow engine with 30+ extensible activities powered by Temporal
  • Model Training - Supervised learning with custom model training and versioned deployments
  • HITL Review - Collaborative human-in-the-loop workflows with confidence-based routing

🏗️ Architecture

  • Backend - NestJS with Express, Prisma ORM, PostgreSQL
  • Frontend - React with TypeScript, Mantine UI
  • Orchestration - Temporal workflow engine with durable execution
  • Storage - Pluggable blob storage (Azure, S3, filesystem)

Key Capabilities

Graph-Based Workflow Engine

The platform features a sophisticated DAG (Directed Acyclic Graph) workflow engine that enables complex, multi-step document processing pipelines. Workflows are defined as JSON graphs with nodes representing activities like OCR, validation, training, and review operations.

30+ Built-in Activities: OCR processing, field extraction, validation, confidence filtering, model training, HITL routing, parallel execution, conditional branching, and more.

📄 Document Management

Full lifecycle document management with metadata tagging, versioning, and relationship tracking. Supports PDF, images, and multi-page documents.

🏷️ Labeling & Annotation

Project-based labeling system for ground truth creation. Supports bounding box annotations, field labeling, and batch operations for training data preparation.

🧠 Model Training

End-to-end custom model training pipeline. Train Azure Document Intelligence custom models from labeled data with versioning and deployment management.

👥 Human Review Queue

Collaborative HITL system with assignment management, bulk operations, and confidence-based routing for quality assurance workflows.

🔐 Authentication & Security

Keycloak SSO integration with JWT bearer tokens and API key management. Role-based access control for secure multi-tenant operations.

📊 Extensible Activities

Developer-friendly activity registry system. Create custom activities by extending base classes and registering them in the workflow engine.

🗂️ Document Classification

Train Azure Document Intelligence classifiers to automatically identify document types. Manage the full lifecycle from training data upload through classifier deployment.

Quick Start

Prerequisites
Node.js 24+, PostgreSQL 14+, Temporal Server, Azure Document Intelligence resource (for OCR)

1. Install Dependencies

npm install

2. Configure Environment

Set up environment variables for database, Temporal, Azure services, and authentication:

DATABASE_URL=postgresql://user:pass@localhost:5432/docdb
TEMPORAL_ADDRESS=localhost:7233
AZURE_DI_ENDPOINT=https://your-instance.cognitiveservices.azure.com/
AZURE_DI_KEY=your-key
SSO_AUTH_SERVER_URL=https://your-keycloak.com/realms/your-realm

3. Initialize Database

cd apps/backend-services
npm run db:migrate
npm run db:generate

4. Start Services

# Terminal 1: Backend API
cd apps/backend-services && npm run start:dev

# Terminal 2: Temporal Worker
cd apps/temporal && npm run worker

# Terminal 3: Frontend
cd apps/frontend && npm run dev

5. Access the Platform

API Documentation

Comprehensive REST API documentation is available on the API Reference page. The API supports both Bearer token authentication (Keycloak SSO) and API key authentication.

📚 API Reference

Interactive API documentation with request/response schemas, authentication details, and example requests for all 56 endpoints.

💻 GitHub Repository

View source code, report issues, contribute to the project, and access additional documentation.

Technology Stack

Core Technologies

Backend Services

  • NestJS 11 + Express
  • Prisma ORM
  • PostgreSQL 14+
  • Temporal Workflow Engine
  • Azure Document Intelligence SDK

Frontend Application

  • React 19 + TypeScript
  • Vite
  • Mantine UI Components
  • React Flow (workflow visualization)
  • Axios + React Query

Project Structure

apps/
├── backend-services/    # NestJS API server (14 modules)
├── frontend/            # React application
├── temporal/            # Temporal workflow workers
├── image-service/       # Python image preprocessing tools
└── shared/              # Shared Prisma schema

docs-md/                 # Technical documentation
docs/                    # This documentation site
deployments/             # OpenShift/Kubernetes manifests

Learn More

Additional Documentation
Detailed technical documentation is available in the /docs-md directory including:
  • HITL_ARCHITECTURE.md: Human-in-the-loop system design
  • TEMPLATE_TRAINING.md: Model training workflows
  • docs-md/graph-workflows/: Workflow engine architecture and activity development guides