AI Services Hub
Azure Landing Zone Infrastructure

Architecture Decision Records

This page documents the key architecture decisions behind the AI Services Hub. Each architecture decision record explains the problem being solved, the choice that was made, and the trade-offs that came with that choice. If you are new to the project, this page is here to explain not just what was built, but why it was built that way.

BC Government Policy Context
Many of the choices documented here were not optional design preferences. They were driven by British Columbia Government security and platform rules. Important constraints include:
  • No public endpoints - All Azure services must use private endpoints only
  • Private networking required - Resources must be isolated inside virtual networks
  • No long-lived secrets - Identity and short-lived token patterns are preferred over stored passwords and keys
  • Bastion-only access - Direct SSH/RDP from internet is prohibited

Start Here: Architecture Decision Record 001

Architecture Decision Record 001 is the foundation that explains why the rest of the infrastructure exists. Read it first if you want the big-picture explanation before reviewing the more specific decisions.

About Architecture Decision Records
Architecture decision records capture a decision together with its background and consequences. They help future maintainers avoid reopening the same debate without context, and they give new team members a faster way to understand the platform.

Decision Index

ID Title Driver Status
ADR-001 Shared AI Landing Zone Policy Accepted
ADR-002 Use OIDC instead of Service Principal Secrets Policy Accepted
ADR-003 Optional Use of Azure Bastion for VM Access Policy Accepted
ADR-004 Private Endpoints for All Azure Services Policy Accepted
ADR-005 Zero-Dependency Documentation System Choice Accepted
ADR-006 Terraform as Infrastructure as Code (IaC) Choice Accepted
ADR-007 Client Connectivity via App Gateway + APIM Policy Accepted
ADR-008 No Azure Portal or Foundry Studio UI Access Policy Pending Platform/MS
ADR-009 Why AI Landing Zone vs Custom Solution Choice Accepted
ADR-010 Multi-Tenant Isolation Model Policy Accepted
ADR-011 Control Plane vs Data Plane Access & Chisel Tunnel Policy Accepted
ADR-012 Usage Monitoring, Cost Allocation and Chargeback Metrics Operations Proposed
ADR-013 Scaled Stack Architecture with Isolated State Files Choice Accepted
ADR-014 APIM Subscription Key Rotation Choice Proposed
ADR-015 Tenant Isolation: Resource Group vs Subscription Policy Pending Platform/MS
ADR-016 Backend Circuit Breaker Pattern Resilience Accepted
ADR-017 Custom Tenant Onboarding Portal Inside AI Hub Choice Accepted
ADR-018 External PII Redaction Service Choice Accepted
ADR-001: Shared AI Landing Zone Accepted
Status: Accepted
Date: 2025-01
Driver: Policy Required
Category: Architecture
This is the foundational ADR. It explains why we need VNets, Bastion, Jumpbox, and the Chisel tunnel CI/CD approach. All other ADRs build on these concepts.

Context

Policy Driver: BC Gov requires all Azure services to use private endpoints only for data plane access. GitHub Actions runners on the public internet cannot reach private endpoints. This creates a fundamental problem: how do we run Terraform if GitHub can't talk to Azure resources?

The Problem: Why Can't GitHub Just Run Terraform Directly?

The Network Barrier

When you run terraform apply from GitHub Actions, here's what happens:

  1. GitHub spins up a runner (a VM on Microsoft's public cloud)
  2. Terraform tries to connect to Azure PaaS services for data plane access for ec: key vault secrets
  3. BLOCKED - KV has no public endpoint
  4. Terraform tries to connect to Key Vault, databases, etc.
  5. BLOCKED - All services data plane access are private-only within the vnet.

Result: Terraform can run from the public internet but limited to Storage account and control plane access of Azure resources.

The Solution: Landing Zone Architecture

We need infrastructure inside the private network that can:

  • Receive commands from GitHub (via OIDC tokens)
  • Run Terraform against private endpoints
  • Allow humans to access resources for debugging

VNet (Virtual Network)

What: Private network in Azure

Why: All resources live here, isolated from public internet

Analogy: The building's internal network

Jumpbox VM

What: Linux VM inside the VNet

Why: Runs Terraform, can reach all private endpoints

Analogy: A workstation inside the secure office

Azure Bastion

What: Managed gateway service

Why: Secure way to access Jumpbox (no public SSH)

Analogy: The secure lobby with ID verification

How Terraform Actually Runs

┌─────────────────────────────────────────────────────────────────────────┐
│                         DEPLOYMENT FLOW                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   GitHub Actions          │ Azure Landing Zone (Private Network)        │
│   (Public Internet)       │                                             │
│                           │                                             │
│   ┌──────────────┐  OIDC  │    ┌──────────────┐    ┌─────────────────┐  │
│   │ GitHub-Hosted│───────▶│    │ Azure Proxy  │───▶│ Private         │  │
│   │ Runner       │        │    │ (Chisel App  │    │ Endpoints       │  │
│   │ ubuntu-24.04 │◀SOCKS5─┘    │  Service)    │    │ (Storage, KV)   │  │
│   └──────────────┘        │    └──────────────┘    └─────────────────┘  │
│          │                │                                             │
│          ▼                │                                             │
│   ┌──────────────┐        │                                             │
│   │ terraform    │        │                                             │
│   │ plan/apply   │        │                                             │
│   └──────────────┘        │                                             │
│   (via SOCKS tunnel)      │                                             │
└─────────────────────────────────────────────────────────────────────────┘

Two deployment options:

Option A: Chisel Tunnel with GitHub-Hosted Runners

Standard GitHub-hosted runners (ubuntu-24.04) combined with a Chisel SOCKS5 proxy deployed as an Azure App Service inside the VNet. This eliminates the need for persistent self-hosted runner infrastructure.

  • GitHub-hosted runner starts (ephemeral, no maintenance)
  • Workflow deploys/starts the Chisel proxy App Service
  • Runner connects through SOCKS5 tunnel into VNet
  • Terraform traffic routed via tunnel to private endpoints
  • No persistent runner pool needed — pay only for workflow minutes
💡
Implementation: The azure-proxy/chisel container runs on an App Service Plan inside the tools VNet subnet. The .deployer.yml reusable workflow deploys it first, then subsequent steps use it as a SOCKS proxy via ALL_PROXY environment variable.

Used by this platform for all CI/CD

Option B: Jumpbox + Bastion

Manual access via Bastion for debugging, testing, and emergency fixes.

  • Human connects via Azure Portal
  • Bastion brokers SSH connection
  • Land on Jumpbox inside VNet
  • Run commands, debug issues

Required for human access

What About Other Repositories?

Good News: Other repos do NOT need their own Bastion/Jumpbox/VNet!
This is shared infrastructure - the Landing Zone is set up once and used by all projects.

What This Repo Provides (Set Up Once)

Component Purpose Shared?
VNet + Subnets Private network for all resources Yes - all projects use this
Azure Bastion Secure access gateway Yes - one Bastion for all
Jumpbox VM Admin access point Yes - shared by admins
Azure Proxy (Chisel Server) SOCKS5 tunnel for CI/CD private-endpoint access Yes - shared by all stacks
Private DNS Zones Name resolution for private endpoints Managed by Platform Services
🔒
DNS & ExpressRoute Note: Private DNS zones are managed centrally by Platform Services, not by this Landing Zone. ExpressRoute connectivity exists for on-premises access, but AI Hub clients will connect through App Gateway + APIM endpoints rather than directly via ExpressRoute.

How Access Actually Works (Public vs Private)

🔑
Key Concept: "No public endpoints" doesn't mean "no access". It means access flows through controlled private channels, not the open internet.
Who Access Path Public IPs? How It Works
Platform Team (Admin) Internet → Azure Portal → Bastion → VMs Bastion only Bastion is Azure-managed PaaS with public IP. VMs have private IPs only. This is the ONE public-to-private bridge.
Ministry Apps/Users Gov Network → ExpressRoute → App Gateway → APIM → Services None ExpressRoute is a private dedicated circuit from BC Gov data centers to Azure backbone. Traffic never touches public internet.
GitHub Actions (CI/CD) GitHub-hosted runner + Chisel SOCKS tunnel → Private Endpoints None GitHub-hosted runners (ubuntu-24.04) route Terraform traffic through a Chisel SOCKS5 proxy (App Service inside the VNet). The runner itself is on the public internet but all data-plane calls are tunnelled privately.

What is ExpressRoute?

ExpressRoute is NOT a public endpoint. It's a dedicated fiber connection from BC Gov's data centers directly into Azure's network backbone.

  • Traffic stays on private circuits (not internet)
  • Managed by BC Gov Platform Services
  • Already exists - we just use it
  • Like a private tunnel to Azure

Why APIM is Internal

APIM is internal only for security purposes. APIM is reachable via App gateway :

  • ExpressRoute connects Gov Network → Azure VNet
  • App Gateway acts as TCP layer 4 load balancer and proxy to APIM with strong WAF protection.
  • Gov users can reach internal IPs via ExpressRoute if needed with proper Firewall rules in place.

Summary: Only ONE Public Endpoint

┌─────────────────────────────────────────────────────────────────────────────┐
│                           BC Gov Network                                    │
│  ┌──────────────┐                                                           │
│  │ Ministry     │                                                           │
│  │ Applications │──┐                                                        │
│  └──────────────┘  │                                                        │
│                    │                                                        │
│  ┌──────────────┐  │    ┌──────────────────────────────────────────────────┐│
│  │ Ministry     │──┼────│         ExpressRoute (Private Circuit)           ││
│  │ Users        │  │    │         NOT public internet!                     ││
│  └──────────────┘  │    └──────────────────────────────────────────────────┘│
└────────────────────┼────────────────────────────────────────────────────────┘
                     │ Firewall Rules (Allowed Traffic)
                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Azure (Private VNet)                              │
│                                                                             │
│  ┌────────────────┐    ┌────────────────┐    ┌────────────────────────────┐ │
│  │  App Gateway   │───▶│     APIM       │───▶│   Private Endpoints        │ │
│  │  (Public IP)   │    │  (Internal IP) │    │   (Storage, OpenAI, etc.)  │ │
│  └────────────────┘    └────────────────┘    └────────────────────────────┘ │
│         ▲                                                                   │
│         │                                                                   │
│         │ All private IPs - reachable via ExpressRoute                      │
│                                                                             │
│  ════════════════════════════════════════════════════════════════════════   │
│                                                                             │
│  ┌────────────────┐    ┌────────────────┐                                   │
│  │    BASTION     │─▶ │   Jumpbox VM   │    ◀── Only Bastion has public IP  │
│  │  (PUBLIC IP)   │    │  (Private IP)  │        (for admin access)         │
│  └────────────────┘    └────────────────┘                                   │
│         ▲                                                                   │
└─────────┼───────────────────────────────────────────────────────────────────┘
          │
          │ Admin accesses via Azure Portal (browser)
          │
┌─────────┴────────┐
│    INTERNET      │
│  (Platform Team) │
└──────────────────┘
    

Why Not Just Open Public Endpoints Temporarily?

Policy prohibits this. Even temporary public access:

  • Creates audit findings
  • Requires security exemption paperwork
  • Introduces attack window
  • Must be reverted manually (often forgotten)

The Landing Zone approach is always private - no exemptions needed.

Consequences

Positive

  • Fully policy compliant - No public endpoints ever
  • Shared infrastructure - One-time setup, many projects benefit
  • Consistent security - All projects inherit the same secure baseline
  • Cost efficient - Single Bastion (~$140/mo) serves all projects

Negative

  • Initial complexity - Landing Zone must be built first
  • Proxy dependency - Chisel App Service must be healthy before Terraform workflows run
  • VNet planning - Must allocate IP ranges carefully

References

ADR-002: Use OIDC instead of Service Principal Secrets Accepted
Status: Accepted
Date: 2025-01
Driver: Policy Required
Category: Security

Context

🏛️
Policy Driver: BC Gov security policy requires minimizing long-lived credentials and implementing credential rotation. The platform team provides rotating keys, but OIDC eliminates the need for a cron job to fetch and update credentials.

GitHub Actions workflows need to authenticate with Azure to deploy infrastructure via Terraform. We evaluated three options for credential management:

Option A: Static Secrets

  • Create Azure AD App Registration
  • Generate Client Secret
  • Store in GitHub Secrets
  • Rotate manually every 1-2 years

Not policy compliant - Long-lived credentials prohibited

Option B: Platform Rotating Keys

  • Platform team rotates keys every 2 days
  • Keys expire after 4 days
  • Requires cron job to fetch/update
  • Must sync to GitHub Secrets

Policy compliant - But adds operational overhead

Option C: OIDC Federation

  • Create Managed Identity
  • Configure GitHub OIDC trust
  • Token fetched in pipeline
  • No cron job needed

Policy compliant - Zero operational overhead

Decision

We chose Option C: OIDC Federated Credentials.

While the platform team's rotating key solution (Option B) is policy compliant, it requires maintaining a cron job to continuously fetch and update credentials. OIDC eliminates this operational burden - the bearer token is obtained directly within the GitHub Actions workflow at runtime, with no external synchronization required.

Rationale

Criteria Static Secrets Platform Rotating OIDC
Policy Compliant No Yes Yes
Secret Management Manual rotation Cron job required No secrets needed
Security Risk Long-lived credential leak 4-day window if compromised ~10 min window
Operational Overhead Annual rotation Cron job maintenance Set and forget
Token Lifetime 1-2 years 4 days max ~10 minutes
Failure Mode Expired secret breaks deploy Cron failure breaks deploy Self-contained in pipeline
Scope Control Per application Per application Per repo/branch/env

Consequences

Positive

  • Zero secrets to rotate - Eliminates credential management overhead
  • Reduced blast radius - Tokens valid for minutes, not years
  • Fine-grained access - Can restrict to specific branches/environments
  • Better audit trail - Every token exchange is logged with JWT claims
  • No secret sprawl - Secrets don't end up in logs, config files, or developer machines

Negative

  • More complex initial setup - Federated credential configuration is more involved
  • Newer technology - Less documentation and community examples available
  • GitHub dependency - Tightly coupled to GitHub's OIDC provider

Neutral

  • Requires understanding of JWT claims and subject matching
  • Debugging auth failures requires knowledge of OIDC flow

References

ADR-003: Optional Use of Azure Bastion for VM Access Accepted
Status: Accepted
Date: 2025-01
Driver: Policy Required
Category: Networking

Context

🏛️
Policy Driver: BC Gov policy prohibits public IP addresses on virtual machines. VMs must only be accessible through private networks. Azure Bastion provides compliant access without exposing VMs to the internet.
🌉
Bastion = The Public-to-Private Bridge
Azure Bastion is the one approved exception to the "no public endpoints" rule. It's an Azure-managed PaaS service with a public IP that provides secure browser-based access to private VMs. The key point: Bastion has the public IP, not the VMs themselves.

Public Endpoints in the Architecture

Service Has Public IP? Why
Azure Bastion Yes (exception) Required for browser-based VM access. Azure-managed, AAD-authenticated. This is the public-to-private bridge for admin access.
App Gateway Yes Receives traffic from public internet with WAF (Web Application Firewall).
APIM No (internal) Deployed in internal mode, sits behind App Gateway. No public exposure.
VMs (Jumpbox) No Private IPs only. Access via Bastion.
Storage, Key Vault, etc. No Private endpoints only. Public access disabled.

Operators need secure access to jumpbox VMs for debugging and administration. The VMs cannot have public IPs per policy. We evaluated:

Option A: VPN Gateway

Point-to-site VPN for developer access

Option B: Public IP + NSG

Expose SSH/RDP with IP allowlisting

Option C: Azure Bastion

Browser-based RDP/SSH via Azure Portal

Decision

We chose Option C: Azure Bastion.

Rationale

Criteria VPN Gateway Public IP Bastion
Setup Complexity High (certs, clients) Low Medium
Client Requirements VPN client software SSH/RDP client Browser only
Attack Surface VPN endpoint High (exposed ports) Minimal
Cost ~$140/month ~$5/month ~$140/month (when on)
On-demand No (always on) Yes Yes (can destroy)

Consequences

Positive

  • No public IPs on VMs - VMs only have private IPs
  • No client software - Works from any browser
  • Azure AD integration - Uses existing identity
  • Session recording - Can enable for audit
  • Cost control - Can deploy/destroy on demand via workflow

Negative

  • Azure Portal dependency - Must use Azure UI or CLI
  • File transfer limitations - No native SCP/SFTP
  • Latency - Browser-based adds some lag

References

ADR-004: Private Endpoints for All Azure Services Accepted
Status: Accepted
Date: 2025-01
Driver: Policy Required
Category: Networking

Context

🏛️
Policy Driver: BC Gov security policy mandates that all Azure PaaS services must be accessed via Private Endpoints only. Public endpoints must be disabled. This ensures all traffic stays within the Azure backbone and private networks.

Azure PaaS services (Storage Accounts, Key Vaults, Databases, etc.) by default have public endpoints accessible from the internet. BC Gov policy requires these to be locked down.

Policy Requirements

What Policy Prohibits

  • Public endpoints on any Azure service
  • Key Vaults with public network access
  • Databases with public connectivity
  • Any service reachable without VNet integration

What Policy Requires

  • Private Endpoints for all PaaS services
  • Private DNS zones for name resolution (managed by Platform Services)
  • VNet integration for all access
  • Network Security Groups controlling traffic
  • "Deny public access" enabled on all services

Implementation

🔒
DNS Management: All private DNS zones are managed centrally by Platform Services. The AI Hub team does not manage DNS - we only create private endpoints that link to the existing DNS zones.
Service Private Endpoint DNS Zone (Platform Services)
Key Vault (if used) privateEndpoint-vault privatelink.vaultcore.azure.net
Container Registry (if used) privateEndpoint-acr privatelink.azurecr.io

Client Connectivity Model

ExpressRoute + App Gateway + APIM

BC Gov has ExpressRoute connectivity to Azure, but AI Hub clients will not access services directly via ExpressRoute. Instead:

  • App Gateway: Provides ingress, WAF protection, and SSL termination
  • APIM: API management layer for authentication, rate limiting, and routing
  • Private Endpoints: Backend services (AI models, storage) remain fully private

Client flow: Client → App Gateway → APIM → Private Endpoint → Azure Service

Consequences

Challenges

  • GitHub Actions cannot reach private endpoints directly - Requires the Chisel SOCKS tunnel or VNet-internal access (see ADR-001)
  • Local development complexity - Developers cannot access resources without VPN/Bastion
  • DNS resolution - Must configure private DNS zones correctly
  • Debugging difficulty - Cannot easily test from outside the network

Workarounds

  • Terraform State: Use the Chisel SOCKS tunnel (GitHub-hosted runner + Azure Proxy) to access the private storage endpoint. The use_azuread_auth = true setting enables Azure AD authentication for state access.
  • Development: Use Bastion + Jumpbox for all Azure resource access (see ADR-003)
  • CI/CD: Use the Chisel SOCKS tunnel for full private-endpoint access during Terraform runs (see ADR-001)

References

ADR-005: Zero-Dependency Documentation System Accepted
Status: Accepted
Date: 2025-12
Driver: Choice
Category: Documentation

Context

We needed a documentation site for the project. Options considered:

Static Site Generators

  • Jekyll (Ruby)
  • Hugo (Go)
  • Docusaurus (Node.js)
  • MkDocs (Python)

Custom Bash Build

  • Header/footer partials
  • Variable substitution
  • ~60 lines of shell script
  • Zero external dependencies

Decision

We built a custom Bash-based static site generator.

Rationale

  • Portability: Runs anywhere with Bash (Linux, Mac, WSL, CI)
  • No dependency management: No npm, gem, pip, go install required
  • Simplicity: Anyone can understand the 60-line build script
  • Speed: Builds in milliseconds
  • GitHub Pages native: No special plugins or build configurations
  • AI-friendly: HTML generation is trivial for AI assistants

Consequences

Positive

  • Zero build dependencies to maintain or update
  • Works in any environment without setup
  • Easy to understand and modify
  • No security vulnerabilities from npm packages

Negative

  • No built-in Markdown support (write HTML directly)
  • No automatic table of contents generation
  • No built-in search (added custom client-side solution)

Mitigations

  • Created template page with all components for easy copy-paste
  • AI assistants generate HTML as easily as Markdown
  • Added custom SVG viewer for diagrams
ADR-006: Terraform as Infrastructure as Code (IaC) Accepted
Status: Accepted
Date: 2025-12
Driver: Choice
Category: Infrastructure

Context

This repo needs a repeatable, auditable way to provision and update Azure infrastructure (networking, private endpoints, RBAC, diagnostics, and PaaS services) under BC Government policy constraints.

Given the Landing Zone design (private endpoints, limited portal use, and CI/CD execution from within the VNet), we need Infrastructure as Code that supports:

  • Idempotent, reviewable changes (pull requests as the change record)
  • Policy-driven patterns (private endpoints, NSGs, diagnostics settings)
  • Composable modules (prefer Azure Verified Modules where possible)
  • Automation via GitHub Actions using OIDC and Chisel SOCKS tunnel

Options Considered

Terraform (selected)

  • Large ecosystem and Azure provider support
  • Strong module approach (including AVM for Terraform)
  • Works well in CI/CD and supports multi-environment workflows

Alternatives

  • Bicep / ARM templates
  • Pulumi
  • Portal-based configuration (click-ops)

Decision

We use Terraform as the primary Infrastructure as Code (IaC) tool for this repo.

Rationale

  • Fits Landing Zone operations: Terraform runs cleanly on GitHub-hosted runners via the Chisel SOCKS tunnel, providing data-plane access to private endpoints when required.
  • Standardization via modules: We can prefer Azure Verified Modules (AVM) for consistent, policy-aligned deployments.
  • Auditable change control: Plans and applies can be gated by pull request review and CI checks.
  • Multi-environment support: Variables and modules make it straightforward to deploy dev/test/prod consistently.
  • Sustainability: Terraform's widespread adoption ensures long-term community and vendor support. Team members working across AWS, Azure, and OpenShift can use one IaC tool consistently across the stack

Consequences

Positive

  • Repeatable deployments - Same inputs produce the same infrastructure
  • Versioned infrastructure - Git history becomes the change log
  • Policy-aligned defaults - Modules can encode private endpoint and logging patterns

Negative

  • Learning curve - Contributors must understand Terraform workflows
  • State management - Requires careful backend configuration and access controls
  • Upgrades - Provider/module version bumps require ongoing maintenance

Mitigations

  • Use pinned module versions and keep provider versions explicit
  • Use CI to run terraform fmt, terraform validate, and plans
  • Prefer AVM modules over custom resources where feasible

References

ADR-007: Client Connectivity via App Gateway + APIM Accepted
Status: Accepted
Date: 2025-12
Driver: Policy Required
Category: Networking

Context

🏛️
Platform Context: BC Gov Platform Services provides ExpressRoute connectivity to Azure for private, high-bandwidth access. ExpressRoute is available, not denied. However, for this multi-tenant AI Hub platform, we recommend all clients connect through App Gateway + APIM to ensure consistent security controls across all tenants.

BC Government Platform Services has established ExpressRoute connectivity between on-premises networks and Azure. For this AI Hub Landing Zone, the question arose: should ministry applications connect directly to AI services via ExpressRoute, or through a managed ingress layer?

Options Considered

Option A: Direct ExpressRoute Access

Clients connect directly to private endpoints via ExpressRoute.

  • Lowest latency (no middlemen)
  • Simpler architecture for single-tenant
  • ExpressRoute is provided by Platform Services

Case-by-Case: Available upon request with justification. Requires separate security review as it bypasses WAF, rate limiting, and centralized audit logging.

Option B: App Gateway + APIM (Recommended)

All traffic flows through App Gateway and APIM before reaching backends.

  • WAF protection at ingress
  • Centralized authentication
  • Rate limiting and quotas per tenant
  • Complete audit trail
  • Consistent multi-tenant isolation

Recommended: Standard path for all AI Hub clients

Decision

For a clear multi-tenant platform, we recommend all clients connect through App Gateway → APIM → Private Endpoints.

ExpressRoute connectivity is provided by Platform Services and is available. However, to support consistent security controls, audit logging, and fair resource allocation across multiple ministries, we recommend the App Gateway + APIM path as the standard connectivity model.

💡
Direct ExpressRoute Access: If a client has a specific requirement for direct ExpressRoute access to backend services (e.g., extremely latency-sensitive workloads), this can be analyzed on a case-by-case basis. Such requests require justification and a separate security review.

Traffic Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CLIENT CONNECTIVITY MODEL                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   On-Premises                    │         Azure Landing Zone               │
│   (Ministry Apps)                │                                          │
│                                  │                                          │
│   ┌──────────────┐              │    ┌─────────────────────────────────┐   │
│   │ Ministry App │              │    │         App Gateway              │   │
│   │ (Health)     │──────────────────▶│  • SSL Termination              │   │
│   └──────────────┘   Express    │    │  • WAF Protection               │   │
│                      Route      │    │  • DDoS Mitigation              │   │
│   ┌──────────────┐              │    └──────────────┬──────────────────┘   │
│   │ Ministry App │              │                   │                       │
│   │ (SDPR)       │──────────────────▶               ▼                       │
│   └──────────────┘              │    ┌─────────────────────────────────┐   │
│                                 │    │            APIM                  │   │
│   ┌──────────────┐              │    │  • Authentication (API Keys)    │   │
│   │ Ministry App │              │    │  • Rate Limiting                │   │
│   │ (Justice)    │──────────────────▶│  • Request Validation           │   │
│   └──────────────┘              │    │  • Ministry Routing             │   │
│                                 │    │  • Audit Logging                │   │
│                                 │    └──────────────┬──────────────────┘   │
│                                 │                   │                       │
│                                 │                   ▼                       │
│                                 │    ┌─────────────────────────────────┐   │
│                                 │    │      Private Endpoints          │   │
│                                 │    │  • AI Foundry                   │   │
│                                 │    │  • Azure OpenAI                 │   │
│                                 │    │  • AI Search                    │   │
│                                 │    │  • Storage (RAG docs)           │   │
│                                 │    └─────────────────────────────────┘   │
│                                 │                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Security Controls at Each Layer

Layer Security Control Purpose
App Gateway Web Application Firewall (WAF) Block OWASP top 10, SQL injection, XSS
App Gateway SSL/TLS Termination Enforce HTTPS, manage certificates
App Gateway DDoS Protection Mitigate volumetric attacks
APIM Subscription Keys Identify and authenticate ministry
APIM Rate Limiting Prevent abuse, ensure fair usage
APIM Request Validation Validate payload structure
APIM Audit Logging Track all requests with ministry context
Private Endpoints Network Isolation Backend services unreachable from internet

Consequences

Positive

  • Defense in depth - Multiple security layers before reaching backends
  • Centralized policy - All clients subject to same rules
  • Audit compliance - Every request logged with full context
  • Flexibility - Can update policies without changing backends
  • Cost attribution - Can track usage per ministry via APIM metrics

Negative

  • Added latency - Two extra hops (App Gateway + APIM)
  • Cost - App Gateway and APIM have significant monthly costs
  • Complexity - More components to configure and maintain

Mitigations

  • Latency is typically <10ms additional per hop
  • Costs are shared across all ministries (per ADR-010)
  • Infrastructure-as-code manages complexity

References

ADR-008: No Azure Portal or Foundry Studio UI Access Pending Platform/MS
Status: Pending Platform/MS
Date: 2025-12
Driver: Policy Required
Category: Operations

Context

🏛️
Policy Driver: BC Gov requires all Azure services to use private endpoints only, with no public network access. Azure Portal, AI Foundry Studio, and other browser-based management tools require public endpoints to function. This creates a fundamental incompatibility.

Microsoft's standard approach to AI Landing Zones assumes users will manage resources through:

  • Azure Portal (portal.azure.com)
  • AI Foundry Studio (ai.azure.com)
  • Azure Machine Learning Studio

All of these require public endpoint access to the Azure services being managed.

The Problem: UI Requires Public Endpoints

User Browser (Public Internet)
        │
        ▼
┌─────────────────────────────────────────────────────────────────┐
│  ai.azure.com / portal.azure.com (Public Endpoint)              │
│                                                                  │
│  "To manage your AI Foundry project, we need to reach           │
│   your Azure resources over the public internet"                │
│                                                                  │
│        │                                                        │
│        ▼                                                        │
│  ┌─────────────────────────────────────────────────┐           │
│  │  Your AI Foundry / Storage / Search             │           │
│  │                                                  │           │
│  │  ❌ PUBLIC ENDPOINT REQUIRED                    │           │
│  │     BC Gov Policy: DENIED                       │           │
│  └─────────────────────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Decision

No browser-based UI access is supported for tenant management.

All resource provisioning and management must occur through:

  • Terraform/AVM modules - Infrastructure as Code
  • Azure CLI - Via Chisel tunnel (CI/CD) or Jumpbox (admin only)
  • REST APIs - Via APIM with subscription keys

What This Means for Tenants

👁️
Important Clarification: Tenants CAN navigate to Azure Portal and AI Foundry Studio. They can see their resources, browse the UI, and view configurations. However, operations that require the UI to communicate with Azure services will fail because those services have no public endpoints. The UI is read-only at best, non-functional at worst.

Can Do (View Only)

  • Browse to portal.azure.com
  • See resource groups and resources
  • View configurations (read-only)
  • Navigate AI Foundry Studio UI
  • See project structure

Cannot Do (UI Blocked)

  • Create/modify resources via Portal
  • Upload documents in Foundry Studio
  • Test models in Foundry playground
  • Configure settings via web forms
  • Any operation requiring service connection

Supported Methods

  • Request resources via Terraform PR
  • Upload documents via API (APIM)
  • Query AI models via API (APIM)
  • Automate via CLI in pipelines
  • Use SDK from within VNet

Why Does This Happen?

When you click "Upload Document" in Foundry Studio, the browser (on public internet) tries to connect directly to your Storage Account. But your Storage Account has no public endpoint - it only accepts connections from within the VNet via Private Endpoint.

Browser (Public) → Storage Account (Private Only) = ❌ Connection Refused
Pipeline (VNet)  → Storage Account (Private EP)   = ✅ Works
    

The UI shows the resource exists, but can't actually interact with it.

Why Not Provide Bastion Access to Everyone?

This was considered and rejected:

Approach Problem
Bastion + VM per tenant Not scalable (20 ministries = 20 VMs = $$$), security nightmare
Shared Jumpbox for all Multi-tenant isolation violated, credential management chaos
VPN per tenant Massive operational overhead, not self-service

Result: Bastion/Jumpbox is for platform team administration only. Tenants interact via API.

AVM Module Requirement

Because all management must be IaC-based, only services with Azure Verified Modules (AVM) are supported.

Supported AVM Modules

AI Services:
  • cognitiveservices-account
  • machinelearningservices-workspace
  • search-searchservice
Data:
  • storage-storageaccount
  • documentdb-databaseaccount
  • keyvault-vault
Compute:
  • containerservice-managedcluster
  • containerregistry-registry
  • apimanagement-service

Full catalog: azure.github.io/Azure-Verified-Modules

Path Forward: Secure UI Access

⚠️
Open Question: A secure and scalable way to access browser-based tooling — including Document Intelligence Studio, AI Foundry Portal, and other Azure service UIs — needs to be investigated in collaboration with the BC Gov Azure Platform Services team. Current constraints leave tenants without an interactive management interface. A solution is needed at the platform level before this ADR can be fully closed.

Consequences

Positive

  • Policy compliant - No public endpoints ever exposed
  • Reproducible - All infrastructure is code, auditable, version controlled
  • Scalable - Onboard 100 tenants with same process as 1
  • Secure - No browser-based attack surface

Negative

  • Steeper learning curve - Tenants must use IaC, not click-ops
  • No visual management - Can't "see" resources in Portal
  • Debugging harder - Must use CLI/API, not browser dev tools
  • Microsoft disconnect - Their guidance assumes UI access

References

ADR-009: Why AI Landing Zone vs Custom Solution Accepted
Status: Accepted
Date: 2025-12
Driver: Choice
Category: Architecture

Context

A valid question arises: If we're customizing everything for BC Gov requirements anyway, why use Microsoft's AI Landing Zone at all? Why not just build our own custom solution from scratch?

This ADR explains what value the AI Landing Zone and Azure Verified Modules (AVM) actually provide, even when we can't use Microsoft's default assumptions.

What AI Landing Zone Actually Provides

💡
Key Insight: We're not using Microsoft's AI Landing Zone as a turnkey solution. We're using it as a reference architecture and leveraging the Azure Verified Modules (AVM) it's built on. The value is in the tested, maintained, modular components - not the out-of-box configuration.

The Real Value: AVM Modules

Building From Scratch

If we wrote our own Terraform modules:

  • Write 1000+ lines of Terraform per service
  • Handle every Azure API change ourselves
  • Debug edge cases Microsoft already solved
  • Maintain security patches ourselves
  • No community validation or review
  • Reinvent private endpoint patterns
  • Figure out RBAC configurations
  • Test against every Azure region

Using AVM Modules wherever possible

With Azure Verified Modules:

  • ~50 lines of Terraform to deploy a service
  • Microsoft maintains API compatibility
  • Edge cases handled by module maintainers
  • Security updates pushed automatically
  • Community tested, Microsoft validated
  • Private endpoint patterns built-in
  • RBAC best practices included
  • Tested across all Azure regions

What We Get From AI Landing Zone Reference

Component What Microsoft Provides What We Customize
Network Topology Hub-spoke pattern, subnet sizing guidance, NSG rule templates IP ranges, Canada regions, Platform Services DNS integration
AI Foundry Setup AVM module for workspace, project structure, compute patterns Disable public access, multi-tenant project isolation
Private Endpoints Patterns for connecting services privately, DNS zone integration Link to Platform Services DNS, IP allocation per tenant
APIM Integration AVM module, backend pool patterns, policy templates Subscription per tenant, OpenAPI routing, rate limits
Security Baseline RBAC templates, managed identity patterns, Key Vault integration BC Gov RBAC requirements, ministry-level isolation

AVM Module Maturity - Honest Assessment

⚠️
Important: Not all AVM modules are equally mature. AI Foundry modules are newer and still evolving. We must be realistic about what's production-ready vs what's in development.
Module Status Maturity Notes
avm-res-storage-storageaccount Released Production Ready Mature, well-tested
avm-res-keyvault-vault Released Production Ready Mature, well-tested
avm-res-network-virtualnetwork Released Production Ready Mature, well-tested
avm-res-network-applicationgateway Released Production Ready Mature, well-tested
avm-res-network-bastionhost Released Production Ready Mature, well-tested
avm-res-documentdb-databaseaccount Released Production Ready Cosmos DB - mature
avm-res-apimanagement-service Pending Early Release v0.0.5 - may need custom work
avm-res-cognitiveservices-account Pending Maturing OpenAI, Doc Intel - verify features
avm-res-search-searchservice Pending Maturing AI Search - verify private EP support
avm-res-machinelearningservices-workspace Pending Maturing Core resource for Foundry Hub/Project
avm-ptn-aiml-ai-foundry In Development Not Production Ready Pattern module - active development
avm-ptn-ai-foundry-enterprise Archived Abandoned Was archived July 2025 - do not use
avm-res-containerservice-managedcluster Pending Pre-release AKS - v0.4.0-pre2

What This Means

Safe to Use AVM

  • Virtual Networks, Subnets, NSGs
  • Storage Accounts
  • Key Vault
  • Bastion Host
  • App Gateway
  • Cosmos DB

Evaluate / May Need Custom

  • AI Foundry (use resource module, not pattern)
  • APIM (early version, test thoroughly)
  • Cognitive Services (verify private EP)
  • AI Search (verify features needed)
  • AKS (pre-release)
💡
Our Strategy: Use mature AVM modules for infrastructure (networking, storage, security). For AI Foundry, start with the core machinelearningservices-workspace resource module rather than the pattern modules. Be prepared to write custom Terraform for AI-specific configurations that AVM doesn't yet support.

Concrete Example: Storage Account

Without AVM (Custom from scratch)

# ~200 lines of Terraform to handle:
resource "azurerm_storage_account" "main" { ... }
resource "azurerm_storage_account_network_rules" "main" { ... }
resource "azurerm_private_endpoint" "blob" { ... }
resource "azurerm_private_endpoint" "file" { ... }
resource "azurerm_private_endpoint" "queue" { ... }
resource "azurerm_private_dns_zone_virtual_network_link" "blob" { ... }
resource "azurerm_role_assignment" "contributor" { ... }
resource "azurerm_role_assignment" "reader" { ... }
resource "azurerm_monitor_diagnostic_setting" "main" { ... }
# Plus encryption, lifecycle policies, soft delete, versioning...
# Plus handling for every Azure API version change...
    

With AVM Module

module "storage" {
  source  = "Azure/avm-res-storage-storageaccount/azurerm"
  version = "0.6.7"

  name                = "ministryhealthstorage"
  resource_group_name = azurerm_resource_group.ministry.name
  location            = "canadacentral"

  # Private endpoints - one line enables the pattern
  private_endpoints = {
    blob = { subnet_id = module.network.subnet_ids["private-endpoints"] }
  }

  # All the security, RBAC, monitoring handled by module
  tags = var.common_tags
}
    

Result: Same outcome, 10x less code, maintained by Microsoft.

Why Not 100% Custom?

Custom = Maintenance Burden

  • Azure releases ~100 API changes/month
  • Each change could break custom modules
  • Security vulnerabilities need patching
  • New features require manual implementation
  • Team turnover = knowledge loss
  • Who maintains this in 3 years?

AVM = Shared Maintenance

  • Microsoft + community maintain modules
  • API changes handled upstream
  • Security patches published as versions
  • New features added automatically
  • Documentation maintained centrally
  • Sustainable long-term

What We're Actually Doing

Our Approach: AVM + BC Gov Customization Layer

┌─────────────────────────────────────────────────────────────────────┐
│                    BC Gov AI Hub Architecture                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  BC Gov Customization Layer (Our Code)                      │   │
│   │  • Multi-tenant resource group structure                    │   │
│   │  • APIM subscription per ministry                           │   │
│   │  • IP allocation policies                                   │   │
│   │  • Platform Services DNS integration                        │   │
│   │  • Canada data residency enforcement                        │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  Azure Verified Modules (Microsoft Maintained)              │   │
│   │  • avm-res-cognitiveservices-account                        │   │
│   │  • avm-res-machinelearningservices-workspace                │   │
│   │  • avm-res-storage-storageaccount                           │   │
│   │  • avm-res-network-virtualnetwork                           │   │
│   │  • avm-res-apimanagement-service                            │   │
│   │  • ... 100+ more modules                                    │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  Azure Resource Manager APIs                                │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
    

We write the thin customization layer. Microsoft maintains the heavy lifting.

Decision

We use AI Landing Zone reference architecture and AVM modules as our foundation, with a BC Gov customization layer on top.

We do NOT use Microsoft's default configuration. We use their:

  • Reference patterns - How to wire services together
  • AVM modules - Tested, maintained infrastructure components
  • Best practices - Security, networking, RBAC patterns

Consequences

Positive

  • Reduced maintenance - Microsoft maintains 90% of the code
  • Faster development - Use proven patterns instead of inventing
  • Security updates - Get patches via module version bumps
  • Community support - Issues/bugs reported and fixed by others
  • Audit trail - Using "official" modules helps with compliance

Negative

  • Module constraints - Can only do what AVM modules support
  • Version management - Must track and update module versions
  • Abstraction leakage - Sometimes need to understand module internals

References

ADR-010: Multi-Tenant Isolation Model Accepted
Status: Accepted
Date: 2025-12
Driver: Policy Required
Category: Architecture

Context

🏛️
Policy Driver: BC Gov requires strict data isolation between ministries. Each ministry's data must be logically separated, with access controls preventing cross-ministry data exposure. The AI Hub serves multiple BC Government tenants (WLRS, SDPR, NR-DAP, etc.) from shared infrastructure.

The AI Hub Landing Zone is designed to serve multiple BC Government ministries from a single shared infrastructure deployment. This creates a multi-tenancy challenge: how do we provide cost-efficient shared services while ensuring strict data isolation between ministries?

The Problem: Shared Infrastructure, Isolated Data

What We Cannot Do

  • Allow Ministry A to access Ministry B's documents
  • Share AI search indexes across ministries
  • Use single storage accounts for all ministry data
  • Allow cross-ministry API access without authorization

What We Can Share

  • Network infrastructure (VNets, Bastion, NSGs)
  • Compute resources (AI Foundry Hub)
  • Ingress layer (App Gateway, APIM)
  • Monitoring and logging infrastructure

Decision

We implement a four-layer isolation model:

1. Storage Isolation

Separate storage accounts per ministry

  • Each ministry gets dedicated blob containers
  • Azure RBAC restricts access to ministry principals
  • Private endpoints per storage account
  • Encryption keys can be ministry-specific (CMK)

2. AI Search Index Isolation

Separate search indexes per ministry

  • Each ministry's documents indexed separately
  • RAG queries scoped to ministry index only
  • No cross-index queries permitted
  • Index-level access control via Azure RBAC

3. API Isolation (APIM)

APIM subscriptions per ministry

  • Unique subscription keys per ministry
  • Rate limiting scoped to subscription
  • API policies enforce ministry context
  • Audit logging includes ministry identifier

4. Network Isolation (NSGs)

Network policies enforce boundaries

  • NSG rules restrict subnet-to-subnet traffic
  • Private endpoints isolated by ministry where needed
  • No direct cross-ministry network paths
  • All traffic routed through controlled ingress

Implementation Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MULTI-TENANT ISOLATION MODEL                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Ministry Health        Ministry SDPR         Ministry Justice             │
│   ┌─────────────┐       ┌─────────────┐       ┌─────────────┐              │
│   │ APIM Sub: H │       │ APIM Sub: S │       │ APIM Sub: J │              │
│   └──────┬──────┘       └──────┬──────┘       └──────┬──────┘              │
│          │                     │                     │                      │
│          ▼                     ▼                     ▼                      │
│   ┌─────────────────────────────────────────────────────────────┐          │
│   │              Shared APIM (Policy Enforcement)                │          │
│   │         (validates subscription → routes to ministry)        │          │
│   └─────────────────────────────────────────────────────────────┘          │
│          │                     │                     │                      │
│          ▼                     ▼                     ▼                      │
│   ┌─────────────┐       ┌─────────────┐       ┌─────────────┐              │
│   │ Storage: H  │       │ Storage: S  │       │ Storage: J  │              │
│   │ Index: H    │       │ Index: S    │       │ Index: J    │              │
│   └─────────────┘       └─────────────┘       └─────────────┘              │
│                                                                              │
│   ════════════════════════════════════════════════════════════════          │
│                    Shared Infrastructure Layer                               │
│        (VNet, Bastion, AI Foundry Compute, Monitoring)                      │
│   ════════════════════════════════════════════════════════════════          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consequences

Positive

  • Strong data isolation - Ministry data never co-mingled
  • Cost efficiency - Shared compute and network infrastructure
  • Audit compliance - Clear ministry attribution in all logs
  • Scalable onboarding - New ministries get isolated resources automatically
  • Flexible isolation levels - Can increase isolation (dedicated compute) if needed

Negative

  • Resource multiplication - Each ministry needs separate storage/indexes
  • Complexity - More resources to manage and monitor
  • Cost per ministry - Base cost increases with each ministry onboarded

Cost Tracking

Multi-tenant isolation also enables per-tenant cost tracking and chargeback. See the detailed cost allocation strategy:

Multi-Tenant Cost Tracking Architecture

Click diagram to view full Cost Tracking documentation

References

ADR-011: Control Plane vs Data Plane Access & Chisel Tunnel Pending
Status: Pending
Date: 2026-01
Driver: Policy Required
Category: Architecture
This ADR explains a fundamental Azure concept that drives many architecture decisions: the difference between Control Plane and Data Plane access, why OIDC works for one but not the other, and how Chisel provides data plane access for platform maintainers.

Context

Azure services have two distinct access paths that are often confused:

Control Plane (ARM APIs)

What: Managing Azure resources - create, update, delete, configure

Endpoint: management.azure.com (always public)

Authentication: OIDC tokens, Service Principals, Managed Identity

Examples:

  • Creating a Key Vault
  • Configuring a Storage Account
  • Setting up Private Endpoints
  • RBAC role assignments
  • Azure Portal UI (viewing resources)

Data Plane (Service-specific APIs)

What: Accessing data inside resources

Endpoint: *.vault.azure.net, *.blob.core.windows.net, etc.

Authentication: Same tokens, BUT requires network access

Examples:

  • Reading/writing Key Vault secrets
  • Reading/writing Storage blobs
  • Querying databases (PostgreSQL, CosmosDB)
  • Calling Azure OpenAI APIs
  • Terraform state file read/write

The Problem: Private Endpoints Block Data Plane

🔒
BC Gov Policy: All Azure services must use private endpoints. This blocks data plane access from the public internet - even with valid OIDC credentials!
┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHAT WORKS vs WHAT'S BLOCKED                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   From Public Internet (GitHub Actions, Azure Portal, Your Laptop):          │
│                                                                              │
│   ✅ CONTROL PLANE (management.azure.com)                                    │
│      • Create Key Vault                    → ARM API → Works                 │
│      • Create Storage Account              → ARM API → Works                 │
│      • View resource properties in Portal  → ARM API → Works                 │
│      • OIDC authentication                 → Azure AD → Works                │
│                                                                              │
│   ❌ DATA PLANE (*.vault.azure.net, *.blob.core.windows.net)                 │
│      • Read Key Vault secret               → Private Endpoint → BLOCKED      │
│      • Write Storage blob                  → Private Endpoint → BLOCKED      │
│      • "View Secret Value" in Portal       → Private Endpoint → BLOCKED      │
│      • Terraform state read/write          → Private Endpoint → BLOCKED      │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   From Inside VNet (Chisel Tunnel, Jumpbox, or Optional Self-hosted Runners):│
│                                                                              │
│   ✅ CONTROL PLANE → Still works (ARM is public)                             │
│   ✅ DATA PLANE    → Works via Private Endpoints (network path exists)       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Azure Portal Shows Resources But Not Data

The Azure Portal is a Control Plane UI. When you browse to a Key Vault in the portal:

  • ✅ You can see the vault exists (control plane: list resources)
  • ✅ You can see its configuration (control plane: read properties)
  • ❌ You CANNOT click "Show Secret Value" (data plane: blocked by private endpoint)

This is why ADR-008 states "No Azure Portal or Foundry Studio UI Access" for data operations - the portal physically cannot reach private data plane endpoints from your browser.

Why OIDC Is Used for Control Plane

OIDC (OpenID Connect) federation provides passwordless authentication from GitHub Actions to Azure:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        OIDC AUTHENTICATION FLOW                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   1. GitHub Actions workflow starts                                          │
│           ↓                                                                  │
│   2. GitHub OIDC Provider issues JWT token                                   │
│      (claims: repo, branch, environment, workflow)                           │
│           ↓                                                                  │
│   3. Token sent to Azure AD                                                  │
│           ↓                                                                  │
│   4. Azure AD validates against Federated Credential                         │
│      (checks: issuer=GitHub, subject=repo:bcgov/ai-hub-tracking:...)        │
│           ↓                                                                  │
│   5. Azure AD issues Access Token (valid ~1 hour)                            │
│           ↓                                                                  │
│   6. Access Token used for ARM API calls (Control Plane)                     │
│           ↓                                                                  │
│   ✅ terraform plan/apply (resource management)                              │
│   ✅ az cli commands (control plane operations)                              │
│   ✅ No secrets stored in GitHub!                                            │
│                                                                              │
│   BUT: OIDC gives credentials, NOT network access.                           │
│        Data plane still blocked without VNet connectivity.                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Decision: Multiple Access Methods for Different Needs

We provide four toggleable access mechanisms via Terraform, each serving a different use case:

Access Method Terraform Toggle Who Uses It Use Case Plane Access
Self-Hosted Runners github_runners_aca_enabled Other tenant repos Optional: persistent VNet compute for CI/CD workloads that can't use Chisel Control + Data
Bastion + Jumpbox enable_bastion, enable_jumpbox Platform Maintainers Emergency debugging, manual admin tasks Control + Data
Chisel Tunnel enable_azure_proxy Platform Maintainers Local dev access to private databases/APIs Control + Data
Public GitHub Runners (default) CI/CD (limited) Control plane only operations Control only

Chisel Tunnel: Data Plane Access for Platform Maintainers

🔧
Chisel is for Platform Maintainers ONLY - not for tenant/project developers. Tenants access AI services through APIM/App Gateway (the designed public entry point).

What is Chisel? A fast TCP/UDP tunnel over HTTP that creates a secure reverse proxy from your local machine into the Azure VNet.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CHISEL TUNNEL ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Platform Maintainer's Laptop                                               │
│   ┌─────────────────────────────┐                                            │
│   │  Chisel Client (Docker)     │                                            │
│   │  Listens on localhost:5432  │                                            │
│   └──────────────┬──────────────┘                                            │
│                  │ HTTPS (encrypted)                                         │
│                  ▼                                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐    │
│   │                    Azure App Service (Chisel Server)                │    │
│   │                    Inside VNet (app-service-subnet)                 │    │
│   │                    Random auth: tunnel:XXXXXXXX                     │    │
│   └──────────────┬──────────────────────────────────────────────────────┘    │
│                  │ VNet Integration (Private Network)                        │
│                  ▼                                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐    │
│   │                    Private Endpoints                                │    │
│   │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │    │
│   │  │  PostgreSQL  │  │   Key Vault  │  │   CosmosDB   │              │    │
│   │  │  :5432       │  │   :443       │  │   :443       │              │    │
│   │  └──────────────┘  └──────────────┘  └──────────────┘              │    │
│   └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│   Result: psql -h localhost -p 5432 → tunnels to private PostgreSQL          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

What Terraform Operations Need Data Plane?

Most Terraform operations are control plane only. Data plane is only needed when:

Resource/Operation Plane Works from Public? Example
azurerm_key_vault (create) Control ✅ Yes Creating the vault itself
azurerm_key_vault_secret (write) Data ❌ No Writing secrets INTO the vault
data "azurerm_key_vault_secret" Data ❌ No Reading secrets FROM the vault
Terraform state backend (Storage) Data ❌ No Reading/writing .tfstate blob
azurerm_storage_account (create) Control ✅ Yes Creating the account
azurerm_storage_blob (upload) Data ❌ No Uploading files to storage
azurerm_private_endpoint Control ✅ Yes Creating the private endpoint
RBAC role assignments Control ✅ Yes Granting permissions

Access Model Summary

┌─────────────────────────────────────────────────────────────────────────────┐
│                        COMPLETE ACCESS MODEL                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TENANT DEVELOPERS (Ministry Teams)                                          │
│  ┌─────────────┐      ┌──────────────┐      ┌─────────────────┐             │
│  │ Their Apps  │─────▶│ App Gateway  │─────▶│ APIM (rate      │─────▶ AI    │
│  │             │      │ + WAF        │      │ limited, metered)│      APIs   │
│  └─────────────┘      └──────────────┘      └─────────────────┘             │
│                              ▲                                               │
│                              │ Public endpoint (by design)                   │
│                              │ No direct private endpoint access             │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PLATFORM MAINTAINERS (This Team)                                            │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │ Method              │ Toggle Variable              │ Use Case          │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │ Self-hosted Runners │ github_runners_aca_enabled   │ Optional tenant CI/CD │ │
│  │ Bastion + Jumpbox   │ enable_bastion/jumpbox       │ Admin access      │ │
│  │ Chisel Tunnel       │ enable_azure_proxy           │ Local dev access  │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  All three provide: Control Plane + Data Plane access                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consequences

Positive

  • Clear mental model - Understanding control vs data plane explains "why" behind many decisions
  • Flexible access - Enable only what you need (cost optimization)
  • Policy compliant - No public data plane endpoints ever
  • Secure tenant isolation - Tenants use APIM, never touch private endpoints directly

Negative

  • Complexity - Must understand two planes, not just "Azure access"
  • Chisel proxy dependency - Terraform workflows require the Azure Proxy App Service to be healthy before running
  • Portal limitations - Can't "see" data in Portal even with permissions

References

ADR-012: Hybrid Cost Allocation & Usage Monitoring Strategy Proposed
Status: Proposed
Date: 2026-01
Deciders: Platform Team
Category: Observability

Context

The AI Services Hub operates as a multi-tenant platform serving multiple ministries. This shared infrastructure creates a financial governance challenge: we must accurately allocate costs to specific tenants to ensure accountability and cost recovery.

The architecture includes two types of resources:

  • Tenant-Dedicated: Resources used exclusively by one tenant (e.g., Azure OpenAI instances, Document Intelligence, Cosmos DB).
  • Shared Infrastructure: Resources serving all tenants simultaneously (e.g., APIM, App Gateway, Application Insights).

We need a standardized strategy to track consumption and generate accurate chargeback invoices that account for both resource types across our dual-region deployment (Canada Central & East).

Decision

We will adopt a Hybrid Cost Allocation Model that combines native Azure billing with custom usage tracking:

  1. Direct Attribution (90% of costs): We will isolate high-cost resources (AI Foundry Projects, Document Intelligence) into tenant-specific resource groups. These will be billed directly to tenants using Azure Cost Management tags (tenant-id), requiring no manual calculation.
  2. Proportional Allocation (10% of costs): We will split the cost of shared infrastructure (APIM, App Gateway) based on actual usage metrics.
    • APIM & Gateway costs split by API Request Volume.
    • Monitoring costs split by Log Ingestion Volume.
  3. Centralized Tracking via APIM: Azure API Management will serve as the single source of truth for usage metrics. All traffic must flow through APIM to ensure consistent tenant identification and logging.
  4. Custom Egress Calculation: We will implement a custom calculation pipeline using App Gateway logs to track and charge for cross-region network egress (between Canada Central and Canada East).

Rationale

  • Minimizes Operational Overhead: By using Direct Attribution for the most expensive resources (OpenAI tokens, Search), we rely on Azure's native billing engine for the vast majority of chargebacks. We only maintain custom logic for the shared platform components.
  • Fairness: A flat fee for shared infrastructure would be unfair to smaller tenants. Proportional allocation based on request volume ensures ministries only pay for the platform capacity they actually consume.
  • Granular Visibility: Using APIM as the central logging point allows us to capture operational metrics (latency, errors, token usage) alongside billing data without adding sidecar proxies to every service.

Consequences

Positive

  • Transparency: Tenants can verify their direct Azure costs in the portal using their tenant tag.
  • Scalability: New tenants can be onboarded simply by adding tags; the cost model adjusts automatically.
  • Cost Recovery: Ensures the Platform Team fully recovers infrastructure costs rather than absorbing the overhead of shared services.

Negative

  • Complexity of Egress: Cross-region data transfer is billed at the subscription level and is difficult to attribute. We accept the tradeoff of maintaining a custom Kusto query to calculate this specific cost.
  • Maintenance: The logic for splitting shared costs (Python functions/Kusto queries) is custom code that must be maintained and verified monthly against Azure invoices.
ADR-013: Scaled Stack Architecture with Isolated State Files Accepted
Status: Accepted
Date: 2026-02
Deciders: AI Services Hub Team
Category: Infrastructure

Context

The AI Services Hub infrastructure was originally deployed as a single Terraform root module with one monolithic state file containing ~174 resources. This created several operational problems:

  • Blast radius: Any Terraform error or state corruption could affect all 174 resources simultaneously.
  • Lock contention: Only one Terraform operation could run at a time across the entire infrastructure, even for independent resources.
  • Serial execution: Foundry projects required parallelism=1 due to Azure ETag conflicts, which forced the entire apply to run serially.
  • Apply duration: A full apply took 5m 44s because independent stacks (APIM, foundry, tenant-user-mgmt) had to wait for each other.
  • Phased targeting: The deployment script used -target flags to orchestrate a multi-phase apply (Phase 1: everything except foundry, Phase 2: foundry only, Phase 3: validation). This was fragile and hid dependency issues.

Decision

We split the monolithic root module into 5 isolated Terraform stacks, each with its own state file, backend configuration, and dependency management via terraform_remote_state data sources:

StackState KeyPhasePurpose
sharedshared.tfstate1 (serial)VNet, subnets, AI Foundry Hub, App Gateway, WAF, Key Vault, ACR, monitoring
tenanttenant-{key}.tfstate2 (parallel fan-out)Per-tenant resources: AI Search, CosmosDB, Document Intelligence, Storage, Key Vault
foundryfoundry.tfstate3 (parallel)AI Foundry projects per tenant (parallelism=1 to avoid ETag conflicts)
apimapim.tfstate3 (parallel)API Management gateway, policies, tenant subscriptions, role assignments
tenant-user-mgmttenant-user-management.tfstate3 (parallel)Entra ID user/group assignments (requires Graph API permissions)

A stack engine (deploy-scaled.sh) orchestrates execution in dependency order:

  • Phase 1: shared runs first (all other stacks depend on it).
  • Phase 2: tenant runs per-tenant in parallel (each tenant gets isolated TF_DATA_DIR and state).
  • Phase 3: foundry, apim, and tenant-user-mgmt run concurrently (no cross-dependencies between them).

For destroy, the order is reversed: Phase 3 stacks first (parallel), then tenants (parallel), then shared last.

Rationale

  • Reduced blast radius: A state corruption or failed apply in apim cannot affect shared or tenant resources. Each stack can be independently recovered.
  • Parallel execution: Phase 3 stacks have no dependencies on each other and can run concurrently, reducing total apply time by ~60 seconds.
  • Isolated parallelism: The foundry stack runs with parallelism=1 without forcing the entire infrastructure to be serial. APIM and tenant-user-mgmt run with full parallelism simultaneously.
  • Independent state locking: Multiple operators or CI pipelines can work on different stacks without lock contention (e.g., updating APIM policies while a tenant deployment is in progress).
  • Cleaner dependencies: Using terraform_remote_state data sources makes cross-stack dependencies explicit and typed, replacing implicit module-to-module references.

Consequences

Positive

  • 19% faster applies: Total apply time reduced from 5m 44s to 4m 39s (measured on test environment with 2 tenants).
  • Eliminated -target phasing: The old Phase 1/2/3 with -target flags is replaced by natural stack ordering. No more fragile target expressions.
  • Auto-recovery: The stack engine includes deposed object cleanup, import-on-conflict, and transient error retry — previously only available at the monolith level.
  • Live output streaming: Each stack streams Terraform output in real time via tee, improving debuggability in CI/CD logs.
  • Per-tenant state isolation: Each tenant has its own state file, making tenant onboarding/offboarding a state-level operation rather than a resource-level one.

Negative

  • More files: 5 stacks × 5 standard files (main.tf, variables.tf, outputs.tf, providers.tf, backend.tf) = 25 files vs. the original 6. Some variable declarations are duplicated across stacks.
  • State migration required: One-time migration from the monolith state to 5 stack states using terraform state mv. This was performed manually with verification scripts.
  • Cross-stack refactoring: Moving a resource between stacks requires a state move operation, not just a code move. This adds operational complexity for future refactors.
  • Remote state coupling: Stacks are coupled via terraform_remote_state data sources. Adding a new output in shared that apim needs requires deploying shared first.

References

  • Terraform Remote State Data Source
  • infra-ai-hub/stacks/ — Stack root modules
  • infra-ai-hub/scripts/deploy-scaled.sh — Stack engine
  • infra-ai-hub/scripts/deploy-terraform.sh — Public entrypoint
ADR-014: APIM Subscription Key Rotation Proposed
Status: Proposed
Date: 2026-02
Deciders: AI Services Hub Team
Category: Security
Pending STRA Approval
This decision has been made by the AI Services Hub team and all infrastructure is in place. Final approval from the Security Threat and Risk Assessment (STRA) process is pending. The rotation mechanism can be enabled per-environment via a single config flag.

Context

APIM subscription keys authenticate tenant API calls to the AI Services Hub gateway. Without rotation, these long-lived secrets present a growing risk:

  • Credential staleness: Keys provisioned at tenant onboarding remain valid indefinitely unless manually changed. A compromised key grants persistent access.
  • No expiry enforcement: Azure APIM subscription keys have no built-in TTL or auto-expiry. The platform must implement rotation externally.
  • BC Gov compliance: Government security policy expects secrets to be rotated periodically. The rotation interval and mechanism require STRA sign-off before production use.
  • Multi-tenant blast radius: Each tenant has isolated subscription keys, but without rotation a single leaked key provides indefinite access to that tenant’s API surface.

Decision

We implement an alternating primary/secondary key rotation pattern with zero downtime, driven by a Container App Job (scheduled) deployed as a custom container:

  1. Alternating slots: APIM subscriptions have two key slots (primary and secondary). Each rotation cycle regenerates one slot while the other remains valid and untouched.
  2. Centralized hub Key Vault: After regeneration, both keys are written to a single hub Key Vault ({app_name}-{env}-hkv) with 90-day expiry and rotation metadata. No per-tenant Key Vault is required.
  3. Self-service retrieval: Tenants fetch current keys via GET /{tenant}/internal/apim-keys — an APIM policy endpoint that reads from the hub Key Vault using APIM’s managed identity. No Azure SDK required.
  4. Configurable schedule: Rotation is controlled by two flags in params/{env}/shared.tfvars: rotation_enabled (master on/off) and rotation_interval_days (7 for dev/test, 30 for prod).
  5. Managed identity authentication: The Container App Job uses a system-assigned managed identity for APIM and Key Vault access. No stored secrets required.
ComponentPurposeLocation
Container App JobScheduled Python job: discovers APIM + hub KV, rotates keys, stores in KVjobs/apim-key-rotation/
Container buildBuilds custom container image to GHCR on PR/merge.github/workflows/.builds.yml (matrix entry)
Terraform moduleDeploys Container App Job, Container App Environment, RBACinfra-ai-hub/modules/key-rotation-function/
Hub Key VaultCentralized storage for all tenant keys (scales to 1000+)stacks/shared/main.tf
APIM policy endpoint/internal/apim-keys reads from hub KVparams/apim/api_policy.xml.tftpl
Terraform configSeeds initial KV secrets + RBAC for APIM MI → hub KVstacks/apim/main.tf

Rationale

  • Zero downtime: The alternating slot pattern guarantees one key is always valid. Tenants are never locked out during rotation.
  • Container App Job over GHA workflow: A scheduled Container App Job runs reliably within Azure (no 60-day inactivity disable risk). The previous Bash script + GHA approach required periodic repo activity to avoid GitHub silently disabling scheduled workflows.
  • Centralized over distributed: A single hub Key Vault with tenant-prefixed secrets scales better than per-tenant Key Vaults. One RBAC assignment for APIM’s managed identity covers all tenants.
  • Self-service key retrieval: The /internal/apim-keys endpoint eliminates the need for tenants to have Azure portal access or Key Vault Reader roles. Any valid subscription key authenticates the request.
  • Idempotent and safe: The function checks elapsed time since last rotation before acting. Multiple invocations within the interval are no-ops. A --dry-run mode allows validation without changes.

Consequences

Positive

  • Automated secret hygiene: Keys rotate on a known schedule with full audit trail in Key Vault versioning and GitHub Actions logs.
  • Minimal tenant burden: Tenants can use the APIM internal endpoint, daily cron polling, or simply contact the platform team. No Azure SDK or Key Vault access needed.
  • Emergency rotation: Both slots can be regenerated immediately via the documented runbook, invalidating all compromised keys.
  • Per-environment control: Rotation can be enabled independently per environment — currently active in dev/test, disabled in prod pending STRA approval.

Negative

  • Container infrastructure: The Container App Job requires a Container App Environment and Container Registry, adding infrastructure components compared to the previous GHA-only approach. These are managed via the key-rotation-function Terraform module.
  • STRA gate: Production rotation cannot be enabled until the STRA process completes. Until then, prod keys are static (same risk as baseline).
  • Tenant coordination: Tenants using hard-coded keys (rather than the self-service endpoint) must update their configuration within the rotation interval or face 401 errors.

References

  • APIM Key Rotation Guide — full operational documentation
  • jobs/apim-key-rotation/ — Container App Job source code
  • infra-ai-hub/modules/key-rotation-function/ — Terraform module
  • .github/workflows/.builds.yml — container build workflow (matrix entry)
  • params/{env}/shared.tfvars — per-environment rotation config
ADR-015: Tenant Isolation: Resource Group vs Subscription Pending Platform/MS
Status: Pending Platform/MS
Date: 2026-02
Deciders: AI Services Hub Team, MS Team, Platform Services Team
Category: Architecture
Pending Discussion with Microsoft & Platform Services
This decision requires input from Microsoft’s team on Azure quota scaling options and from the Landing Zone Platform team on subscription provisioning constraints. The current RG-based design is the preferred approach for centralized governance. See #63 and #74.

Context

The AI Services Hub serves multiple BC Government ministries from a shared Azure subscription. Tenant isolation is currently implemented at the Resource Group level — each tenant gets its own RG with dedicated data-plane resources while sharing control-plane infrastructure (VNet, App Gateway, APIM, AI Foundry Hub) in a central RG.

As the platform scales beyond the initial 2 tenants, key Azure subscription-level quotas create hard scaling ceilings:

ResourcePer-Subscription LimitPer-Tenant UsageCeiling (tenants)
Model deployments per AI account32 (default)5–7~4–5
AI Search services16 (Basic/Standard)0–116
Cosmos DB accounts500–150
Cognitive Services accounts2002 (Doc Intel + Speech)~100
APIM APIs per instance4005–6~80
Private endpoints per subnet1000~5~200
GlobalStandard TPM (per model)Varies (e.g., 2M for gpt-4.1-mini)7.5K–30KDepends on model

The most immediate bottleneck is the 32 model deployments per AI Foundry Hub account. With 2 tenants deploying 6–7 models each (13 total today), the limit is reached at roughly 4–5 tenants. Additionally, LLM quotas (PTU and GlobalStandard TPM) are tied to the subscription scope, so all tenants compete for the same throughput pool.

Issue #74 raises specific questions about PTU scaling: turnaround time for quota increases, dynamic scaling between PTU and pay-as-you-go, and whether the single-subscription model can support production-scale workloads.

Options Evaluated

Option A: Resource Group Isolation (Current & Preferred)

All tenants share a single Azure subscription. Shared infrastructure (VNet, AppGW, APIM, AI Foundry Hub) lives in a central RG. Each tenant gets a dedicated RG with isolated data-plane resources. Scaling limits are mitigated at the application layer.

Option B: Subscription-Per-Tenant

Each tenant (or group of tenants) gets a dedicated Azure subscription. Shared infrastructure is replicated or connected via VNet peering. Each subscription has independent quota pools.

Option A: Resource Group Isolation — Pros & Cons

Pros

  • Centralized governance: Single subscription = single set of Azure Policies, RBAC, Defender for Cloud, cost management. One pane of glass for the platform team.
  • Simplified networking: All resources in one VNet with one PE subnet. No cross-subscription VNet peering, no transit routing, no DNS forwarding complexity.
  • Lower cost: Shared App Gateway, APIM, AI Foundry Hub, Log Analytics. No per-subscription overhead (reserved instances apply once, shared Defender plans).
  • Faster tenant onboarding: Adding a tenant = adding a Terraform tenant config block. No subscription provisioning, OIDC setup, or Landing Zone request.
  • Existing investment: Current architecture (5 stacks, phased deployment, APIM key rotation) is built and validated for this model.
  • Operational simplicity: One deployment pipeline, one set of state files per environment, one OIDC identity per environment.
  • Cost attribution: Per-tenant RGs enable native Azure Cost Management tag-based and RG-based cost reporting without subscription boundaries.

Cons

  • Quota ceilings: All tenants share subscription-scoped quotas. The 32-deployment AI account limit is the most immediate constraint (~4–5 tenants).
  • TPM/PTU contention: All model deployments on the shared Hub compete for the same GlobalStandard TPM pool. High-demand tenants crowd out others.
  • Blast radius: A subscription-level issue (quota exhaustion, policy misconfiguration, billing suspension) impacts all tenants simultaneously.
  • Quota increase friction: Requesting Azure quota increases is a per-subscription manual process. Lead times can be days to weeks for PTU.
  • AI Search hard limit: 16 search services per subscription is a fixed limit with no increase path — hard wall at 16 search-enabled tenants.
  • Foundry serialization: Model deployments run serially across all tenants to avoid ETag conflicts on the shared Hub. More tenants = slower deploys.

Option B: Subscription-Per-Tenant — Pros & Cons

Pros

  • Independent quota pools: Each subscription gets its own 32 model deployments, 200 Cognitive Services accounts, 16 AI Search services, etc. Eliminates quota-based scaling ceilings.
  • PTU isolation: Each tenant can request and manage its own PTU commitments. No cross-tenant throughput contention.
  • Blast radius reduction: Subscription-level issues are isolated to one tenant.
  • Independent scaling: Each subscription can scale resources (VM sizes, throughput, storage) without affecting others.
  • Compliance flexibility: Some future tenants may have regulatory requirements (FOIPPA, health data) that mandate subscription-level isolation.
  • Clean cost separation: Native Azure billing per subscription. No tag-based attribution needed.

Cons

  • Loss of central governance: Each subscription needs its own Azure Policies, RBAC assignments, Defender plans. Policy drift risk increases linearly.
  • Networking complexity: Requires cross-subscription VNet peering (or VWAN hub-and-spoke), cross-subscription private DNS zones, transit routing. Significant complexity increase.
  • Infrastructure duplication: App Gateway, APIM, Key Vault, and potentially AI Foundry Hub must be replicated per subscription — or a complex shared services model must be designed.
  • Higher cost: Duplicated infrastructure (App Gateway ~$300/mo, APIM StandardV2 ~$700/mo per subscription). Reserved instance savings are fragmented.
  • Subscription provisioning lead time: BC Gov Landing Zone subscription requests go through the Platform Services team. Lead time is days to weeks, blocking rapid tenant onboarding.
  • Pipeline complexity: Each subscription needs its own OIDC federation, deployment pipeline, state backend. The current single-pipeline model does not extend.
  • Operational overhead: N subscriptions = N sets of monitoring, alerting, secret rotation, certificate management. Platform team workload scales linearly.
  • APIM routing: A shared APIM fronting multiple subscription backends requires cross-subscription private endpoints or public exposure — both add complexity.

Decision

We continue with Resource Group isolation (Option A) as the preferred architecture, with specific mitigations for quota scaling constraints. This decision is pending confirmation from Microsoft on quota flexibility and from the Platform team on subscription provisioning options as a future fallback.

Mitigations for RG-Based Scaling Limits

ConstraintLimitMitigation StrategyStatus
Model deployments per AI account 32 Request quota increase via Azure Support. Deploy a second AI Foundry Hub account if increase denied. Consolidate shared models (e.g., single embedding model for all tenants). Pending MS
GlobalStandard TPM contention Per-model cap Implement APIM rate limiting per tenant (already in place). Explore PTU for high-priority tenants. Use dynamic_throttling_enabled on AI account. Investigate PTU ↔ pay-as-you-go spillover. Pending MS
AI Search services 16 Not all tenants need AI Search (1 of 2 currently enabled). For tenants with simple needs, use shared index with document-level permissions or skip Search entirely. Mitigated
Foundry project serialization Serial deploys Already mitigated in ADR-013 (scaled stacks). Foundry stack runs serial but other phases are parallel. prevent_destroy on model deployments reduces redeploy churn. In Place
APIM API count 400 Consolidate API definitions. Use a single versioned API with tenant routing via APIM policies rather than per-tenant API duplicates. Future

Trigger for Revisiting This Decision

The following conditions would warrant moving specific tenants or tenant groups to dedicated subscriptions (hybrid model):

  • Tenant count exceeds 10–15 and quota increase requests are denied by Microsoft
  • A tenant requires dedicated PTU with guaranteed throughput SLAs that conflict with shared pool
  • Regulatory requirements (e.g., health data under FOIPPA) mandate subscription-level isolation explicitly
  • PTU turnaround time for quota increases exceeds acceptable lead times for tenant onboarding
  • Total GlobalStandard TPM across all tenants approaches the per-model subscription ceiling

Rationale

  • Right-sizing for now: With 2 active tenants and realistic growth to 5–10 in the near term, the RG model has headroom with mitigations. Subscription-level rearchitecture is premature.
  • Governance is the priority: BC Gov Landing Zone policies, RBAC, and audit requirements are significantly easier to enforce in a single subscription. The compliance benefit outweighs the quota risk.
  • Cost efficiency: Shared infrastructure saves ~$1,000+/mo per avoided subscription (AppGW + APIM alone). This is material in a government context.
  • Onboarding velocity: A Terraform config change vs. a multi-week subscription provisioning request. This directly impacts ministry adoption speed.
  • Hybrid escape hatch: The architecture supports a future hybrid model where high-demand or compliance-sensitive tenants move to dedicated subscriptions while most remain on the shared RG model. This is not an irreversible decision.

Open Questions for Microsoft

These questions are tracked in #74 and require input from the Microsoft CAF/FastTrack team:
  1. What is the turnaround time for PTU quota increases when current allocation is at full capacity?
  2. Does PTU support dynamic scaling (elastic range) or is it fixed at a provisioned point?
  3. Can you provide Terraform samples for auto-fallback between PTU and GlobalStandard (pay-as-you-go)?
  4. Can the 32 model deployment limit per Cognitive Services account be increased? To what ceiling?
  5. Is there a recommended pattern for multi-AI-Foundry-Hub accounts within a single subscription to distribute deployments?

Consequences

Positive

  • No immediate rearchitecture needed: The platform continues operating with the validated RG-based model while answers from Microsoft are pending.
  • Clear scaling triggers: The team knows exactly which quotas to monitor and at what tenant count to revisit the decision.
  • Documented escape path: If RG-based scaling hits limits, the migration path to subscription-per-tenant (or hybrid) is architecturally understood.

Negative

  • Near-term ceiling: The 32-deployment limit means maximum ~4–5 tenants without a quota increase or model consolidation. This is a known constraint.
  • MS dependency: Key mitigations (quota increases, PTU scaling guidance) depend on Microsoft response timelines.
  • Potential future migration: If the hybrid model is eventually needed, migrating a tenant from shared subscription to dedicated subscription requires re-creating resources, migrating data, and updating APIM routing — non-trivial effort.

References

ADR-016: Backend Circuit Breaker Pattern Resilience Accepted
StatusAccepted
Date2026-02
DecidersPlatform Team
CategoryResilience / API Gateway

Context

The APIM gateway proxies requests to multiple backend services: Azure OpenAI, Document Intelligence, AI Search, Speech Services, and Storage. When a backend experiences failures (overload, outages, throttling), continued request forwarding wastes resources and degrades the client experience with slow timeouts instead of fast failures.

Azure OpenAI specifically returns 429 Too Many Requests with large Retry-After values (up to 86,400 seconds / 1 day) when quota is exhausted. Without circuit breaking, APIM would continue sending requests to a backend that cannot serve them.

Decision

Implement the circuit breaker pattern on all APIM backend entities using the native circuit_breaker_rule in azurerm_api_management_backend.

Configuration per backend

ParameterValue (AI services)Value (Storage)
Failure count threshold35
Failure window1 minute (PT1M)1 minute (PT1M)
Trip duration1 minute (PT1M)1 minute (PT1M)
Accept Retry-AfterYesYes
Trigger status codes429, 500–599429, 500–599

Backends covered

  • ai_foundry — AI Foundry Hub endpoint
  • openai — Azure OpenAI endpoint
  • docint — Document Intelligence
  • ai_search — Azure AI Search
  • speech_services_stt — Speech-to-Text
  • speech_services_tts — Text-to-Speech
  • storage — Blob Storage (higher threshold: 5 failures)

What happens when the circuit trips

  1. Backend accumulates failures matching the configured status codes within the failure window.
  2. When the failure count exceeds the threshold, the circuit opens (trips).
  3. APIM immediately returns HTTP 503 Service Unavailable to all subsequent requests targeting that backend — requests are not forwarded.
  4. The global policy <on-error> handler intercepts the 503 and returns a structured JSON error with a Retry-After header (default: 60 seconds).
  5. After the trip duration (or the backend’s Retry-After value if accept_retry_after_enabled = true), the circuit resets and traffic resumes.
Azure OpenAI caution: When Azure OpenAI returns 429, the Retry-After header can be very large (e.g., 86,400 seconds = 1 day). With accept_retry_after_enabled = true, the circuit stays open for that duration. This is intentional — the backend cannot serve requests during that period anyway.

Client-facing error response (503)

{
  "error": {
    "code": "503",
    "message": "Service temporarily unavailable — backend circuit breaker is open. The service is recovering from excessive failures. Retry after the indicated period.",
    "retryAfter": "60",
    "requestId": "abc-123-def-456"
  }
}

Rationale

  • Fast failure: Clients get an immediate 503 instead of waiting for backend timeouts (up to 300 seconds).
  • Backend protection: Prevents overwhelming a struggling backend with additional requests.
  • Native support: Uses azurerm_api_management_backend.circuit_breaker_rule — no custom policy logic required.
  • Retry-After propagation: Passes backend throttle signals directly to clients for proper backoff.
  • Event Grid integration: APIM emits Event Grid events on circuit trip/reset for monitoring and alerting.

Consequences

Positive

  • Reduced latency during backend outages (instant 503 vs. timeout).
  • Backend services get breathing room to recover.
  • Consistent error format with Retry-After enables client-side retry logic.
  • Zero policy-level code — purely infrastructure configuration.

Negative

  • Approximate tripping: APIM gateway instances do not synchronize circuit breaker state. Each instance tracks failures independently, so tripping is approximate in multi-instance deployments.
  • Single rule per backend: Only one circuit breaker rule per backend is currently supported by the Azure API.
  • 503 during recovery: Legitimate requests during the trip window will be rejected even if the backend has recovered before the trip duration expires.

References

ADR-017: Custom Tenant Onboarding Portal Inside AI Hub Accepted
Status: Accepted
Date: 2026-03
Driver: Choice
Category: Platform

Context

The AI Hub needed a tenant onboarding experience that does more than collect a request. The onboarding flow must support structured tenant metadata, governed admin review, future automation from submission through approval, and environment-aware downstream actions such as generating Terraform inputs and preparing non-prod and prod deployment workflows.

We considered using an existing platform instead of building a portal inside the Hub workspace. The main alternatives were the BC Government Platform Product Registry, CHEFS, and the Azure API Management Developer Portal. All three reduce initial build effort, but none matches the lifecycle and control-plane requirements of Hub onboarding.

Options Considered

Custom portal inside AI Hub (selected)

  • Owns the end-to-end onboarding workflow
  • Supports structured validation and Hub-specific data models
  • Can drive future automation after approval

BC Government Platform Product Registry

  • Designed to manage existing products on Private Cloud OpenShift and Public Cloud Landing Zones
  • Solves a different problem than Hub onboarding
  • Supports product metadata and resource change requests for higher level platforms which may not be a need at all for the AI Hub

CHEFS

  • Strong for hosted form submission
  • Submission lifecycle is oriented around form intake, not long-running onboarding state
  • Not a fit for secure post-approval actions or controlled reveal flows

APIM Developer Portal

  • Can be deployed as part of the Hub infrastructure (APIM)
  • Offers self-service API subscription and developer onboarding UX out of the box
  • Customisable via delegation and custom HTML/JS widgets, but tightly coupled to APIM concepts (products, subscriptions)
  • Not designed for multi-step approval workflows, structured Terraform config generation, or environment-aware provisioning actions

Decision

We will use a custom tenant onboarding portal inside the AI Hub repository and deployment boundary instead of the BC Government Platform Product Registry, CHEFS, or the APIM Developer Portal.

Rationale

  • Governed approval workflow: Hub onboarding requires an admin review and approval process, not just a one-time request capture. The custom portal can model request states, reviewer actions, and follow-up steps directly.
  • Structured Hub-specific configuration: The workflow needs to collect and validate data that maps cleanly into tenant-specific Terraform inputs and other structured configuration artifacts. A generic registry or hosted form would require extra translation layers and still would not own the lifecycle.
  • Tight integration with Hub logic and storage: The onboarding flow must stay close to Hub-specific validation rules, tenant state, and downstream provisioning behavior. Keeping the portal in the same codebase reduces impedance between intake, approval, generated config, and deployment automation.
  • Authenticated post-submission lifecycle: The process does not end at submission. The platform needs room for authenticated follow-up actions, operator review, state transitions, and future self-service interactions that go beyond a write-once form.
  • CHEFS is too immutable for this lifecycle: CHEFS is well suited to hosted intake, but its submission model is intentionally form-centric and immutable. It supports status and notes, but it is not designed to mutate form data, drive secure post-approval reveal flows, or act as the system of record for ongoing onboarding operations.
  • The Platform Product Registry solves a different problem: The registry is built for teams that need to create or manage a product on Private Cloud OpenShift or the Public Cloud Landing Zone, which may not be a need for AI Hub. AI Hub tenant onboarding can be irrespective of the platform the tenants use, captures Hub-specific configuration, and needs a purpose-built approval and provisioning workflow rather than generic platform registry product change process, similar to what Keycloak and Kong App Gateway have their own portals.
  • Secure post-approval actions: The team needs room for actions after approval, including controlled credential-related workflows and other sensitive follow-up behavior. Those concerns should live in a dedicated application boundary rather than in generic form metadata or a public-facing registry experience.
  • Future automation path: The chosen design supports evolving from request intake to approval and then to automated promotion and deployment workflows across non-prod and prod environments without replacing the front door later.
  • APIM Developer Portal is the wrong abstraction: The APIM Developer Portal is built around API products and subscriptions, not tenant provisioning workflows. Customising it far enough to support multi-step approval, structured config generation, and environment-aware provisioning actions would require extensive delegation and external backend work — effectively rebuilding the same application on a more constrained foundation.

Consequences

Positive

  • Single ownership boundary: Intake, review, generated config, and deployment hooks can evolve together in the Hub codebase
  • Better security posture for follow-up actions: Sensitive post-approval behavior stays in a purpose-built application instead of being forced into form notes or registry constructs
  • Clearer evolution path: The portal can add richer workflow, audit, and automation capabilities without re-platforming
  • Operational consistency: The same repo, CI/CD patterns, and Azure deployment model can be reused for the portal and Hub-adjacent automation

Negative

  • Custom application to build and maintain: We own the frontend, backend, tests, deployment, and documentation
  • Higher initial delivery cost: Building a tailored workflow is slower than standing up a generic form or pointing users at an existing portal
  • More platform decisions to maintain: Auth, storage, review workflow, and automation semantics become our responsibility

Mitigations

  • Keep the portal thin and focused: Implement only onboarding workflow concerns that are specific to AI Hub
  • Automate validation and delivery: Maintain build, unit, E2E, and deployment workflows so sustainment cost stays controlled
  • Document the why: This ADR exists so future teams do not revisit CHEFS or the Platform Product Registry without understanding the lifecycle mismatch

References

↑ Back to Decision Index
ADR-018: External PII Redaction Service Accepted
Status: Accepted
Date: 2026-03
Driver: Choice
Category: Security / Data Processing

Context

The AI Hub uses Azure Language Service PII recognition to redact personally identifiable information from chat completion payloads before they reach upstream LLM backends. APIM delegates all PII processing to a dedicated external service rather than calling the Language API inline, because APIM policies have no loop construct and the Language API imposes strict per-call limits.

The Azure Language Service /language/:analyze-text endpoint accepts a maximum of 5 documents per call, and each document is limited to 5 120 characters. Real-world chat completion requests can contain many messages or very long messages that chunk into more than 5 documents. Covering the full payload requires batched calls with deadline enforcement and transient retry handling — logic that belongs in a real programming language, not APIM XML.

The core constraint: APIM policies cannot loop. A single send-request covers at most 5 documents. Payloads exceeding that limit need an external orchestrator that can issue batched calls, handle transient retry/backoff, and stay within the APIM timeout budget.

Options Considered

External PII Redaction Container App (selected)

  • Dedicated FastAPI service deployed as a Container App on the shared internal CAE
  • APIM routes all PII-enabled requests to the external service
  • Service handles bounded batching, retry/backoff, deadline enforcement, and coverage verification
  • Single code path keeps the APIM policy simple

Azure Functions (Consumption or Flex)

  • Serverless compute that scales to zero
  • Cold start latency (seconds) conflicts with the 90 s APIM timeout budget
  • Less control over runtime, networking, and concurrency model
  • Does not align with existing Container App Environment already in use

Expand APIM inline policy (no external service)

  • Keep everything in APIM XML policies
  • APIM policies have no loop construct; would require N hard-coded send-request blocks
  • Extremely fragile, hard to test, and impossible to maintain at scale
  • Policy execution time adds up and risks breaching the APIM timeout

Client-side redaction SDK

  • Push PII responsibility to each tenant application
  • Cannot be enforced centrally; tenants may skip or misconfigure
  • Defeats the purpose of transparent gateway-level PII protection

Decision

We will deploy a dedicated PII redaction microservice as a Container App on the shared internal Container App Environment, reachable only from APIM via VNet-internal ingress. APIM routes all PII-enabled requests to this service.

Rationale

  • Language API batching limit: The /language/:analyze-text endpoint accepts at most 5 documents per call (each up to 5 120 chars). Payloads with many or long messages exceed this and need orchestrated batch calls that APIM policies cannot express.
  • APIM cannot loop: Policy XML has no iteration construct. Hard-coding multiple send-request blocks is brittle, hard to test, and caps out quickly. A real programming language (Python + asyncio) handles batching, timeouts, retry/backoff, and error recovery naturally.
  • Single routing path: A single route from APIM to the external service keeps the policy fragment simple. The service handles all payloads with the same batching logic regardless of size.
  • Reuse existing infrastructure: The shared Container App Environment, Managed Identity RBAC, GHCR image builds, and Terraform module patterns are already in place. Adding one more Container App is incremental.
  • Timeout budget alignment: The service enforces an 85 s total processing deadline that fits within the 90 s APIM send-request timeout, with per-attempt timeouts (10 s) and bounded transient retries for 429 and 5xx responses.
  • Fail-open / fail-closed flexibility: The service returns a structured response with coverage status. APIM decides whether to block (503) or pass through based on the tenant's fail_closed configuration.
  • Testability: Python unit tests cover chunking, batching, reassembly, and API error handling — far easier to maintain than equivalent logic embedded in APIM XML.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                 ALL-EXTERNAL PII REDACTION                           │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   Client → App Gateway → APIM                                        │
│                            │                                         │
│                  ┌─────────┴────────────────────────┐                │
│                  │  POST /redact                     │                │
│                  │  (PII Redaction Service)          │                │
│                  └─────────┬────────────────────────┘                │
│                            │                                         │
│                  ┌─────────┴────────────────────────┐                │
│                  │  Bounded concurrent batches       │                │
│                  │  (max 5 docs × 15 batches)       │                │
│                  │  Word-boundary chunking            │                │
│                  │  Retry-after / backoff handling    │                │
│                  │  Deadline enforcement (85s)        │                │
│                  │  Full-coverage check               │                │
│                  └─────────┬────────────────────────┘                │
│                            │                                         │
│                  Redacted body → APIM → Backend                      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Key Design Constraints

Constraint Value Reason
Max chars per document 5 000 Language API limit (5 120) with safety margin
Max documents per Language API call 5 Language API batch limit
Max batches per request 15 (→ 75 documents) Caps total processing time; rejects with 413 if exceeded
Per-attempt timeout 10 s Isolate slow Language API calls
Total processing timeout 85 s Fits within APIM 90 s send-request timeout
Transient retry handling 429 + 5xx Honor Retry-After for 429 and use exponential backoff for 5xx, all within the same 85 s budget
Chunking strategy Word-boundary split Avoids splitting mid-word which degrades PII detection accuracy

Consequences

Positive

  • Transparent payload handling: Tenants do not need to know about Language API limits; APIM routes all PII-enabled requests to the external service automatically
  • Single code path: All PII redaction flows through the Container App, keeping behaviour consistent across all payload sizes
  • Testable orchestration logic: Chunking, batching, reassembly, and error handling are covered by Python unit tests rather than being embedded in untestable APIM XML
  • Incremental infrastructure cost: Runs on the existing shared CAE with Managed Identity RBAC — no new networking or identity infrastructure
  • Structured observability: JSON-formatted logs with correlation IDs, batch counts, and elapsed-time diagnostics

Negative

  • Additional component to deploy and maintain: One more Container App, Dockerfile, Terraform module, and deployment phase
  • Extra network hop for all PII requests: Every PII-enabled request pays the cost of APIM → Container App → Language API instead of APIM → Language API directly

Mitigations

  • Reuse proven patterns: The Container App module, GHCR build workflow, and deploy ordering follow the same conventions as the key-rotation job
  • Integration tests: The APIM integration test suite covers PII redaction end-to-end through the external service
  • Conservative thresholds: The 75-document cap and 85 s deadline ensure the service stays within APIM timeout bounds even under adverse conditions, including transient retry/backoff

References

↑ Back to Decision Index

ADR Template

Use this template when adding new ADRs:

## ADR-XXX: [Title]

**Status:** [Proposed | Accepted | Deprecated | Superseded]
**Date:** YYYY-MM
**Deciders:** [Team/People]
**Category:** [Security | Networking | Infrastructure | Documentation | etc.]

### Context
[What is the issue? What forces are at play?]

### Decision
[What is the decision? Be specific.]

### Rationale
[Why this decision over alternatives?]

### Consequences
#### Positive
- [Good outcomes]

#### Negative
- [Tradeoffs accepted]

### References
- [Links to relevant docs, diagrams, discussions]

You have reached the end of the ADR list.

↑ Back to Decision Index