Architecture Decision Records

This page documents the key architecture decisions behind the AI Services Hub. Each architecture decision record explains the problem being solved, the choice that was made, and the trade-offs that came with that choice. If you are new to the project, this page is here to explain not just what was built, but why it was built that way.

BC Government Policy Context
Many of the choices documented here were not optional design preferences. They were driven by British Columbia Government security and platform rules. Important constraints include:

No public endpoints - All Azure services must use private endpoints only
Private networking required - Resources must be isolated inside virtual networks
No long-lived secrets - Identity and short-lived token patterns are preferred over stored passwords and keys
Bastion-only access - Direct SSH/RDP from internet is prohibited

Start Here: Architecture Decision Record 001

Architecture Decision Record 001 is the foundation that explains why the rest of the infrastructure exists. Read it first if you want the big-picture explanation before reviewing the more specific decisions.

About Architecture Decision Records
Architecture decision records capture a decision together with its background and consequences. They help future maintainers avoid reopening the same debate without context, and they give new team members a faster way to understand the platform.

Decision Index

ID	Title	Driver	Status
ADR-001	Shared AI Landing Zone	Policy	Accepted
ADR-002	Use OIDC instead of Service Principal Secrets	Policy	Accepted
ADR-003	Optional Use of Azure Bastion for VM Access	Policy	Accepted
ADR-004	Private Endpoints for All Azure Services	Policy	Accepted
ADR-005	Zero-Dependency Documentation System	Choice	Accepted
ADR-006	Terraform as Infrastructure as Code (IaC)	Choice	Accepted
ADR-007	Client Connectivity via App Gateway + APIM	Policy	Accepted
ADR-008	No Azure Portal or Foundry Studio UI Access	Policy	Pending Platform/MS
ADR-009	Why AI Landing Zone vs Custom Solution	Choice	Accepted
ADR-010	Multi-Tenant Isolation Model	Policy	Accepted
ADR-011	Control Plane vs Data Plane Access & Chisel Tunnel	Policy	Accepted
ADR-012	Usage Monitoring, Cost Allocation and Chargeback Metrics	Operations	Proposed
ADR-013	Scaled Stack Architecture with Isolated State Files	Choice	Accepted
ADR-014	APIM Subscription Key Rotation	Choice	Proposed
ADR-015	Tenant Isolation: Resource Group vs Subscription	Policy	Pending Platform/MS
ADR-016	Backend Circuit Breaker Pattern	Resilience	Accepted
ADR-017	Custom Tenant Onboarding Portal Inside AI Hub	Choice	Accepted
ADR-018	External PII Redaction Service	Choice	Accepted
ADR-019	Holistic Python Integration Tests with AI Evaluation	Choice	Accepted

ADR-001: Shared AI Landing Zone Accepted

Status: Accepted

Date: 2025-01

Driver: Policy Required

Category: Architecture

This is the foundational ADR. It explains why we need VNets, Bastion, Jumpbox, and the Chisel tunnel CI/CD approach. All other ADRs build on these concepts.

Context

Policy Driver: BC Gov requires all Azure services to use private endpoints only for data plane access. GitHub Actions runners on the public internet cannot reach private endpoints. This creates a fundamental problem: how do we run Terraform if GitHub can't talk to Azure resources?

The Problem: Why Can't GitHub Just Run Terraform Directly?

The Network Barrier

When you run terraform apply from GitHub Actions, here's what happens:

GitHub spins up a runner (a VM on Microsoft's public cloud)
Terraform tries to connect to Azure PaaS services for data plane access for ec: key vault secrets
BLOCKED - KV has no public endpoint
Terraform tries to connect to Key Vault, databases, etc.
BLOCKED - All services data plane access are private-only within the vnet.

Result: Terraform can run from the public internet but limited to Storage account and control plane access of Azure resources.

The Solution: Landing Zone Architecture

We need infrastructure inside the private network that can:

Receive commands from GitHub (via OIDC tokens)
Run Terraform against private endpoints
Allow humans to access resources for debugging

VNet (Virtual Network)

What: Private network in Azure

Why: All resources live here, isolated from public internet

Analogy: The building's internal network

Jumpbox VM

What: Linux VM inside the VNet

Why: Runs Terraform, can reach all private endpoints

Analogy: A workstation inside the secure office

Azure Bastion

What: Managed gateway service (Standard SKU)

Why: Secure way to access Jumpbox (no public SSH). Standard SKU enables native CLI tunneling and Entra ID SSH login.

Analogy: The secure lobby with ID verification

How Terraform Actually Runs

┌─────────────────────────────────────────────────────────────────────────┐
│                         DEPLOYMENT FLOW                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   GitHub Actions          │ Azure Landing Zone (Private Network)        │
│   (Public Internet)       │                                             │
│                           │                                             │
│   ┌──────────────┐  OIDC  │    ┌──────────────┐    ┌─────────────────┐  │
│   │ GitHub-Hosted│───────▶│    │ Azure Proxy  │───▶│ Private         │  │
│   │ Runner       │        │    │ (Chisel App  │    │ Endpoints       │  │
│   │ ubuntu-24.04 │◀SOCKS5─┘    │  Service)    │    │ (Storage, KV)   │  │
│   └──────────────┘        │    └──────────────┘    └─────────────────┘  │
│          │                │                                             │
│          ▼                │                                             │
│   ┌──────────────┐        │                                             │
│   │ terraform    │        │                                             │
│   │ plan/apply   │        │                                             │
│   └──────────────┘        │                                             │
│   (via SOCKS tunnel)      │                                             │
└─────────────────────────────────────────────────────────────────────────┘

Two deployment options:

Option A: Chisel Tunnel with GitHub-Hosted Runners

Standard GitHub-hosted runners (ubuntu-24.04) combined with a Chisel SOCKS5 proxy deployed as an Azure App Service inside the VNet. This eliminates the need for persistent self-hosted runner infrastructure.

GitHub-hosted runner starts (ephemeral, no maintenance)
Workflow deploys/starts the Chisel proxy App Service
Runner connects through SOCKS5 tunnel into VNet
Terraform traffic routed via tunnel to private endpoints
No persistent runner pool needed — pay only for workflow minutes

💡

Implementation: The azure-proxy/chisel container runs on an App Service Plan inside the tools VNet subnet. The .deployer.yml reusable workflow deploys it first, then subsequent steps use it as a SOCKS proxy via ALL_PROXY environment variable.

Used by this platform for all CI/CD

Option B: Jumpbox + Bastion

Manual access via Bastion for debugging, testing, and emergency fixes.

Human connects via Azure Portal
Bastion brokers SSH connection
Land on Jumpbox inside VNet
Run commands, debug issues

Required for human access

What About Other Repositories?

✅

Good News: Other repos do NOT need their own Bastion/Jumpbox/VNet!
This is shared infrastructure - the Landing Zone is set up once and used by all projects.

What This Repo Provides (Set Up Once)

Component	Purpose	Shared?
VNet + Subnets	Private network for all resources	Yes - all projects use this
Azure Bastion	Secure access gateway	Yes - one Bastion for all
Jumpbox VM	Admin access point	Yes - shared by admins
Azure Proxy (Chisel Server)	SOCKS5 tunnel for CI/CD private-endpoint access	Yes - shared by all stacks
Private DNS Zones	Name resolution for private endpoints	Managed by Platform Services

🔒

DNS & ExpressRoute Note: Private DNS zones are managed centrally by Platform Services, not by this Landing Zone. ExpressRoute connectivity exists for on-premises access, but AI Hub clients will connect through App Gateway + APIM endpoints rather than directly via ExpressRoute.

How Access Actually Works (Public vs Private)

🔑

Key Concept: "No public endpoints" doesn't mean "no access". It means access flows through controlled private channels, not the open internet.

Who	Access Path	Public IPs?	How It Works
Platform Team (Admin)	Internet → Azure Portal → Bastion → VMs	Bastion only	Bastion is Azure-managed PaaS with public IP. VMs have private IPs only. This is the ONE public-to-private bridge.
Ministry Apps/Users	Gov Network → ExpressRoute → App Gateway → APIM → Services	None	ExpressRoute is a private dedicated circuit from BC Gov data centers to Azure backbone. Traffic never touches public internet.
GitHub Actions (CI/CD)	GitHub-hosted runner + Chisel SOCKS tunnel → Private Endpoints	None	GitHub-hosted runners (`ubuntu-24.04`) route Terraform traffic through a Chisel SOCKS5 proxy (App Service inside the VNet). The runner itself is on the public internet but all data-plane calls are tunnelled privately.

What is ExpressRoute?

ExpressRoute is NOT a public endpoint. It's a dedicated fiber connection from BC Gov's data centers directly into Azure's network backbone.

Traffic stays on private circuits (not internet)
Managed by BC Gov Platform Services
Already exists - we just use it
Like a private tunnel to Azure

Why APIM is Internal

APIM is internal only for security purposes. APIM is reachable via App gateway :

ExpressRoute connects Gov Network → Azure VNet
App Gateway acts as TCP layer 4 load balancer and proxy to APIM with strong WAF protection.
Gov users can reach internal IPs via ExpressRoute if needed with proper Firewall rules in place.

Summary: Only ONE Public Endpoint

┌─────────────────────────────────────────────────────────────────────────────┐
│                           BC Gov Network                                    │
│  ┌──────────────┐                                                           │
│  │ Ministry     │                                                           │
│  │ Applications │──┐                                                        │
│  └──────────────┘  │                                                        │
│                    │                                                        │
│  ┌──────────────┐  │    ┌──────────────────────────────────────────────────┐│
│  │ Ministry     │──┼────│         ExpressRoute (Private Circuit)           ││
│  │ Users        │  │    │         NOT public internet!                     ││
│  └──────────────┘  │    └──────────────────────────────────────────────────┘│
└────────────────────┼────────────────────────────────────────────────────────┘
                     │ Firewall Rules (Allowed Traffic)
                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Azure (Private VNet)                              │
│                                                                             │
│  ┌────────────────┐    ┌────────────────┐    ┌────────────────────────────┐ │
│  │  App Gateway   │───▶│     APIM       │───▶│   Private Endpoints        │ │
│  │  (Public IP)   │    │  (Internal IP) │    │   (Storage, OpenAI, etc.)  │ │
│  └────────────────┘    └────────────────┘    └────────────────────────────┘ │
│         ▲                                                                   │
│         │                                                                   │
│         │ All private IPs - reachable via ExpressRoute                      │
│                                                                             │
│  ════════════════════════════════════════════════════════════════════════   │
│                                                                             │
│  ┌────────────────┐    ┌────────────────┐                                   │
│  │    BASTION     │─▶ │   Jumpbox VM   │    ◀── Only Bastion has public IP  │
│  │  (PUBLIC IP)   │    │  (Private IP)  │        (for admin access)         │
│  └────────────────┘    └────────────────┘                                   │
│         ▲                                                                   │
└─────────┼───────────────────────────────────────────────────────────────────┘
          │
          │ Admin accesses via Azure Portal (browser)
          │
┌─────────┴────────┐
│    INTERNET      │
│  (Platform Team) │
└──────────────────┘

Why Not Just Open Public Endpoints Temporarily?

Policy prohibits this. Even temporary public access:

Creates audit findings
Requires security exemption paperwork
Introduces attack window
Must be reverted manually (often forgotten)

The Landing Zone approach is always private - no exemptions needed.

Consequences

Positive

Fully policy compliant - No public endpoints ever
Shared infrastructure - One-time setup, many projects benefit
Consistent security - All projects inherit the same secure baseline
Cost efficient - Single Bastion (~$140/mo) serves all projects

Negative

Initial complexity - Landing Zone must be built first
Proxy dependency - Chisel App Service must be healthy before Terraform workflows run
VNet planning - Must allocate IP ranges carefully

References

ADR-002: Use OIDC instead of Service Principal Secrets Accepted

Status: Accepted

Date: 2025-01

Driver: Policy Required

Category: Security

Context

🏛️

Policy Driver: BC Gov security policy requires minimizing long-lived credentials and implementing credential rotation. The platform team provides rotating keys, but OIDC eliminates the need for a cron job to fetch and update credentials.

GitHub Actions workflows need to authenticate with Azure to deploy infrastructure via Terraform. We evaluated three options for credential management:

Option A: Static Secrets

Create Azure AD App Registration
Generate Client Secret
Store in GitHub Secrets
Rotate manually every 1-2 years

Not policy compliant - Long-lived credentials prohibited

Option B: Platform Rotating Keys

Platform team rotates keys every 2 days
Keys expire after 4 days
Requires cron job to fetch/update
Must sync to GitHub Secrets

Policy compliant - But adds operational overhead

Option C: OIDC Federation

Create Managed Identity
Configure GitHub OIDC trust
Token fetched in pipeline
No cron job needed

Policy compliant - Zero operational overhead

Decision

We chose Option C: OIDC Federated Credentials.

While the platform team's rotating key solution (Option B) is policy compliant, it requires maintaining a cron job to continuously fetch and update credentials. OIDC eliminates this operational burden - the bearer token is obtained directly within the GitHub Actions workflow at runtime, with no external synchronization required.

Rationale

Criteria	Static Secrets	Platform Rotating	OIDC
Policy Compliant	No	Yes	Yes
Secret Management	Manual rotation	Cron job required	No secrets needed
Security Risk	Long-lived credential leak	4-day window if compromised	~10 min window
Operational Overhead	Annual rotation	Cron job maintenance	Set and forget
Token Lifetime	1-2 years	4 days max	~10 minutes
Failure Mode	Expired secret breaks deploy	Cron failure breaks deploy	Self-contained in pipeline
Scope Control	Per application	Per application	Per repo/branch/env

Consequences

Positive

Zero secrets to rotate - Eliminates credential management overhead
Reduced blast radius - Tokens valid for minutes, not years
Fine-grained access - Can restrict to specific branches/environments
Better audit trail - Every token exchange is logged with JWT claims
No secret sprawl - Secrets don't end up in logs, config files, or developer machines

Negative

More complex initial setup - Federated credential configuration is more involved
Newer technology - Less documentation and community examples available
GitHub dependency - Tightly coupled to GitHub's OIDC provider

Neutral

Requires understanding of JWT claims and subject matching
Debugging auth failures requires knowledge of OIDC flow

References

ADR-003: Optional Use of Azure Bastion for VM Access Accepted

Status: Accepted

Date: 2025-01

Driver: Policy Required

Category: Networking

Context

🏛️

Policy Driver: BC Gov policy prohibits public IP addresses on virtual machines. VMs must only be accessible through private networks. Azure Bastion provides compliant access without exposing VMs to the internet.

🌉

Bastion = The Public-to-Private Bridge
Azure Bastion is the one approved exception to the "no public endpoints" rule. It's an Azure-managed PaaS service with a public IP that provides secure browser-based access to private VMs. The key point: Bastion has the public IP, not the VMs themselves.

Public Endpoints in the Architecture

Service	Has Public IP?	Why
Azure Bastion	Yes (exception)	Required for browser-based VM access. Azure-managed, AAD-authenticated. This is the public-to-private bridge for admin access.
App Gateway	Yes	Receives traffic from public internet with WAF (Web Application Firewall).
APIM	No (internal)	Deployed in internal mode, sits behind App Gateway. No public exposure.
VMs (Jumpbox)	No	Private IPs only. Access via Bastion.
Storage, Key Vault, etc.	No	Private endpoints only. Public access disabled.

Operators need secure access to jumpbox VMs for debugging and administration. The VMs cannot have public IPs per policy. We evaluated:

Option A: VPN Gateway

Point-to-site VPN for developer access

Option B: Public IP + NSG

Expose SSH/RDP with IP allowlisting

Option C: Azure Bastion

Browser-based RDP/SSH via Azure Portal

Decision

We chose Option C: Azure Bastion (Standard SKU). Standard SKU enables native client tunneling (az network bastion ssh / az network bastion tunnel) and supports Entra ID SSH login via the AADSSHLoginForLinux VM extension, eliminating the need for SSH key management.

Rationale

Criteria	VPN Gateway	Public IP	Bastion
Setup Complexity	High (certs, clients)	Low	Medium
Client Requirements	VPN client software	SSH/RDP client	Browser only
Attack Surface	VPN endpoint	High (exposed ports)	Minimal
Cost	~$140/month	~$5/month	~$140/month (when on)
On-demand	No (always on)	Yes	Yes (can destroy)

Consequences

Positive

No public IPs on VMs - VMs only have private IPs
No client software - Works from any browser
Azure AD integration - Uses existing identity
Entra ID SSH login - Optional passwordless SSH via AADSSHLoginForLinux extension with Virtual Machine Administrator Login RBAC role
Native CLI tunneling - Standard SKU supports az network bastion ssh --auth-type AAD and az network bastion tunnel for local client access
Session recording - Can enable for audit
Cost control - Can deploy/destroy on demand via workflow

Negative

Azure Portal or CLI required - Must use Azure UI or az network bastion ssh CLI
Latency - Browser-based adds some lag (CLI tunneling mitigates this)
Higher SKU cost - Standard SKU (~$0.35/hr) vs Basic (~$0.19/hr), offset by CLI tunneling and Entra ID capabilities

References

ADR-004: Private Endpoints for All Azure Services Accepted

Status: Accepted

Date: 2025-01

Driver: Policy Required

Category: Networking

Context

🏛️

Policy Driver: BC Gov security policy mandates that all Azure PaaS services must be accessed via Private Endpoints only. Public endpoints must be disabled. This ensures all traffic stays within the Azure backbone and private networks.

Azure PaaS services (Storage Accounts, Key Vaults, Databases, etc.) by default have public endpoints accessible from the internet. BC Gov policy requires these to be locked down.

Policy Requirements

What Policy Prohibits

Public endpoints on any Azure service
Key Vaults with public network access
Databases with public connectivity
Any service reachable without VNet integration

What Policy Requires

Private Endpoints for all PaaS services
Private DNS zones for name resolution (managed by Platform Services)
VNet integration for all access
Network Security Groups controlling traffic
"Deny public access" enabled on all services

Implementation

🔒

DNS Management: All private DNS zones are managed centrally by Platform Services. The AI Hub team does not manage DNS - we only create private endpoints that link to the existing DNS zones.

Service	Private Endpoint	DNS Zone (Platform Services)
Key Vault (if used)	`privateEndpoint-vault`	`privatelink.vaultcore.azure.net`
Container Registry (if used)	`privateEndpoint-acr`	`privatelink.azurecr.io`

Client Connectivity Model

ExpressRoute + App Gateway + APIM

BC Gov has ExpressRoute connectivity to Azure, but AI Hub clients will not access services directly via ExpressRoute. Instead:

App Gateway: Provides ingress, WAF protection, and SSL termination
APIM: API management layer for authentication, rate limiting, and routing
Private Endpoints: Backend services (AI models, storage) remain fully private

Client flow: Client → App Gateway → APIM → Private Endpoint → Azure Service

Consequences

Challenges

GitHub Actions cannot reach private endpoints directly - Requires the Chisel SOCKS tunnel or VNet-internal access (see ADR-001)
Local development complexity - Developers cannot access resources without VPN/Bastion
DNS resolution - Must configure private DNS zones correctly
Debugging difficulty - Cannot easily test from outside the network

Workarounds

Terraform State: Use the Chisel SOCKS tunnel (GitHub-hosted runner + Azure Proxy) to access the private storage endpoint. The use_azuread_auth = true setting enables Azure AD authentication for state access.
Development: Use Bastion + Jumpbox for all Azure resource access (see ADR-003)
CI/CD: Use the Chisel SOCKS tunnel for full private-endpoint access during Terraform runs (see ADR-001)

References

ADR-005: Zero-Dependency Documentation System Accepted

Status: Accepted

Date: 2025-12

Driver: Choice

Category: Documentation

Context

We needed a documentation site for the project. Options considered:

Static Site Generators

Jekyll (Ruby)
Hugo (Go)
Docusaurus (Node.js)
MkDocs (Python)

Custom Bash Build

Header/footer partials
Variable substitution
~60 lines of shell script
Zero external dependencies

Decision

We built a custom Bash-based static site generator.

Rationale

Portability: Runs anywhere with Bash (Linux, Mac, WSL, CI)
No dependency management: No npm, gem, pip, go install required
Simplicity: Anyone can understand the 60-line build script
Speed: Builds in milliseconds
GitHub Pages native: No special plugins or build configurations
AI-friendly: HTML generation is trivial for AI assistants

Consequences

Positive

Zero build dependencies to maintain or update
Works in any environment without setup
Easy to understand and modify
No security vulnerabilities from npm packages

Negative

No built-in Markdown support (write HTML directly)
No automatic table of contents generation
No built-in search (added custom client-side solution)

Mitigations

Created template page with all components for easy copy-paste
AI assistants generate HTML as easily as Markdown
Added custom SVG viewer for diagrams

ADR-006: Terraform as Infrastructure as Code (IaC) Accepted

Status: Accepted

Date: 2025-12

Driver: Choice

Category: Infrastructure

Context

This repo needs a repeatable, auditable way to provision and update Azure infrastructure (networking, private endpoints, RBAC, diagnostics, and PaaS services) under BC Government policy constraints.

Given the Landing Zone design (private endpoints, limited portal use, and CI/CD execution from within the VNet), we need Infrastructure as Code that supports:

Idempotent, reviewable changes (pull requests as the change record)
Policy-driven patterns (private endpoints, NSGs, diagnostics settings)
Composable modules (prefer Azure Verified Modules where possible)
Automation via GitHub Actions using OIDC and Chisel SOCKS tunnel

Options Considered

Terraform (selected)

Large ecosystem and Azure provider support
Strong module approach (including AVM for Terraform)
Works well in CI/CD and supports multi-environment workflows

Alternatives

Bicep / ARM templates
Pulumi
Portal-based configuration (click-ops)

Decision

We use Terraform as the primary Infrastructure as Code (IaC) tool for this repo.

Rationale

Fits Landing Zone operations: Terraform runs cleanly on GitHub-hosted runners via the Chisel SOCKS tunnel, providing data-plane access to private endpoints when required.
Standardization via modules: We can prefer Azure Verified Modules (AVM) for consistent, policy-aligned deployments.
Auditable change control: Plans and applies can be gated by pull request review and CI checks.
Multi-environment support: Variables and modules make it straightforward to deploy dev/test/prod consistently.
Sustainability: Terraform's widespread adoption ensures long-term community and vendor support. Team members working across AWS, Azure, and OpenShift can use one IaC tool consistently across the stack

Consequences

Positive

Repeatable deployments - Same inputs produce the same infrastructure
Versioned infrastructure - Git history becomes the change log
Policy-aligned defaults - Modules can encode private endpoint and logging patterns

Negative

Learning curve - Contributors must understand Terraform workflows
State management - Requires careful backend configuration and access controls
Upgrades - Provider/module version bumps require ongoing maintenance

Mitigations

Use pinned module versions and keep provider versions explicit
Use CI to run terraform fmt, terraform validate, and plans
Prefer AVM modules over custom resources where feasible

References

Terraform Reference - Complete module, variable, and output docs
Terraform Documentation
Azure Verified Modules (Terraform index)

ADR-007: Client Connectivity via App Gateway + APIM Accepted

Status: Accepted

Date: 2025-12

Driver: Policy Required

Category: Networking

Context

🏛️

Platform Context: BC Gov Platform Services provides ExpressRoute connectivity to Azure for private, high-bandwidth access. ExpressRoute is available, not denied. However, for this multi-tenant AI Hub platform, we recommend all clients connect through App Gateway + APIM to ensure consistent security controls across all tenants.

BC Government Platform Services has established ExpressRoute connectivity between on-premises networks and Azure. For this AI Hub Landing Zone, the question arose: should ministry applications connect directly to AI services via ExpressRoute, or through a managed ingress layer?

Options Considered

Option A: Direct ExpressRoute Access

Clients connect directly to private endpoints via ExpressRoute.

Lowest latency (no middlemen)
Simpler architecture for single-tenant
ExpressRoute is provided by Platform Services

Case-by-Case: Available upon request with justification. Requires separate security review as it bypasses WAF, rate limiting, and centralized audit logging.

Option B: App Gateway + APIM (Recommended)

All traffic flows through App Gateway and APIM before reaching backends.

WAF protection at ingress
Centralized authentication
Rate limiting and quotas per tenant
Complete audit trail
Consistent multi-tenant isolation

Recommended: Standard path for all AI Hub clients

Decision

For a clear multi-tenant platform, we recommend all clients connect through App Gateway → APIM → Private Endpoints.

ExpressRoute connectivity is provided by Platform Services and is available. However, to support consistent security controls, audit logging, and fair resource allocation across multiple ministries, we recommend the App Gateway + APIM path as the standard connectivity model.

💡

Direct ExpressRoute Access: If a client has a specific requirement for direct ExpressRoute access to backend services (e.g., extremely latency-sensitive workloads), this can be analyzed on a case-by-case basis. Such requests require justification and a separate security review.

Traffic Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CLIENT CONNECTIVITY MODEL                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   On-Premises                    │         Azure Landing Zone               │
│   (Ministry Apps)                │                                          │
│                                  │                                          │
│   ┌──────────────┐              │    ┌─────────────────────────────────┐   │
│   │ Ministry App │              │    │         App Gateway              │   │
│   │ (Health)     │──────────────────▶│  • SSL Termination              │   │
│   └──────────────┘   Express    │    │  • WAF Protection               │   │
│                      Route      │    │  • DDoS Mitigation              │   │
│   ┌──────────────┐              │    └──────────────┬──────────────────┘   │
│   │ Ministry App │              │                   │                       │
│   │ (SDPR)       │──────────────────▶               ▼                       │
│   └──────────────┘              │    ┌─────────────────────────────────┐   │
│                                 │    │            APIM                  │   │
│   ┌──────────────┐              │    │  • Authentication (API Keys)    │   │
│   │ Ministry App │              │    │  • Rate Limiting                │   │
│   │ (Justice)    │──────────────────▶│  • Request Validation           │   │
│   └──────────────┘              │    │  • Ministry Routing             │   │
│                                 │    │  • Audit Logging                │   │
│                                 │    └──────────────┬──────────────────┘   │
│                                 │                   │                       │
│                                 │                   ▼                       │
│                                 │    ┌─────────────────────────────────┐   │
│                                 │    │      Private Endpoints          │   │
│                                 │    │  • AI Foundry                   │   │
│                                 │    │  • Azure OpenAI                 │   │
│                                 │    │  • AI Search                    │   │
│                                 │    │  • Storage (RAG docs)           │   │
│                                 │    └─────────────────────────────────┘   │
│                                 │                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Security Controls at Each Layer

Layer	Security Control	Purpose
App Gateway	Web Application Firewall (WAF)	Block OWASP top 10, SQL injection, XSS
App Gateway	SSL/TLS Termination	Enforce HTTPS, manage certificates
App Gateway	DDoS Protection	Mitigate volumetric attacks
APIM	Subscription Keys	Identify and authenticate ministry
APIM	Rate Limiting	Prevent abuse, ensure fair usage
APIM	Request Validation	Validate payload structure
APIM	Audit Logging	Track all requests with ministry context
Private Endpoints	Network Isolation	Backend services unreachable from internet

Consequences

Positive

Defense in depth - Multiple security layers before reaching backends
Centralized policy - All clients subject to same rules
Audit compliance - Every request logged with full context
Flexibility - Can update policies without changing backends
Cost attribution - Can track usage per ministry via APIM metrics

Negative

Added latency - Two extra hops (App Gateway + APIM)
Cost - App Gateway and APIM have significant monthly costs
Complexity - More components to configure and maintain

Mitigations

Latency is typically <10ms additional per hop
Costs are shared across all ministries (per ADR-010)
Infrastructure-as-code manages complexity

References

ADR-008: No Azure Portal or Foundry Studio UI Access Pending Platform/MS

Status: Pending Platform/MS

Date: 2025-12

Driver: Policy Required

Category: Operations

Context

🏛️

Policy Driver: BC Gov requires all Azure services to use private endpoints only, with no public network access. Azure Portal, AI Foundry Studio, and other browser-based management tools require public endpoints to function. This creates a fundamental incompatibility.

Microsoft's standard approach to AI Landing Zones assumes users will manage resources through:

Azure Portal (portal.azure.com)
AI Foundry Studio (ai.azure.com)
Azure Machine Learning Studio

All of these require public endpoint access to the Azure services being managed.

The Problem: UI Requires Public Endpoints

User Browser (Public Internet)
        │
        ▼
┌─────────────────────────────────────────────────────────────────┐
│  ai.azure.com / portal.azure.com (Public Endpoint)              │
│                                                                  │
│  "To manage your AI Foundry project, we need to reach           │
│   your Azure resources over the public internet"                │
│                                                                  │
│        │                                                        │
│        ▼                                                        │
│  ┌─────────────────────────────────────────────────┐           │
│  │  Your AI Foundry / Storage / Search             │           │
│  │                                                  │           │
│  │  ❌ PUBLIC ENDPOINT REQUIRED                    │           │
│  │     BC Gov Policy: DENIED                       │           │
│  └─────────────────────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Decision

No browser-based UI access is supported for tenant management.

All resource provisioning and management must occur through:

Terraform/AVM modules - Infrastructure as Code
Azure CLI - Via Chisel tunnel (CI/CD) or Jumpbox (admin only)
REST APIs - Via APIM with subscription keys

What This Means for Tenants

👁️

Important Clarification: Tenants CAN navigate to Azure Portal and AI Foundry Studio. They can see their resources, browse the UI, and view configurations. However, operations that require the UI to communicate with Azure services will fail because those services have no public endpoints. The UI is read-only at best, non-functional at worst.

Can Do (View Only)

Browse to portal.azure.com
See resource groups and resources
View configurations (read-only)
Navigate AI Foundry Studio UI
See project structure

Cannot Do (UI Blocked)

Create/modify resources via Portal
Upload documents in Foundry Studio
Test models in Foundry playground
Configure settings via web forms
Any operation requiring service connection

Supported Methods

Request resources via Terraform PR
Upload documents via API (APIM)
Query AI models via API (APIM)
Automate via CLI in pipelines
Use SDK from within VNet

Why Does This Happen?

When you click "Upload Document" in Foundry Studio, the browser (on public internet) tries to connect directly to your Storage Account. But your Storage Account has no public endpoint - it only accepts connections from within the VNet via Private Endpoint.

Browser (Public) → Storage Account (Private Only) = ❌ Connection Refused
Pipeline (VNet)  → Storage Account (Private EP)   = ✅ Works

The UI shows the resource exists, but can't actually interact with it.

Why Not Provide Bastion Access to Everyone?

This was considered and rejected:

Approach	Problem
Bastion + VM per tenant	Not scalable (20 ministries = 20 VMs = $$$), security nightmare
Shared Jumpbox for all	Multi-tenant isolation violated, credential management chaos
VPN per tenant	Massive operational overhead, not self-service

Result: Bastion/Jumpbox is for platform team administration only. Tenants interact via API.

AVM Module Requirement

Because all management must be IaC-based, only services with Azure Verified Modules (AVM) are supported.

Supported AVM Modules

AI Services:

cognitiveservices-account
machinelearningservices-workspace
search-searchservice

Data:

storage-storageaccount
documentdb-databaseaccount
keyvault-vault

Compute:

containerservice-managedcluster
containerregistry-registry
apimanagement-service

Full catalog: azure.github.io/Azure-Verified-Modules

Path Forward: Secure UI Access

⚠️

Open Question: A secure and scalable way to access browser-based tooling — including Document Intelligence Studio, AI Foundry Portal, and other Azure service UIs — needs to be investigated in collaboration with the BC Gov Azure Platform Services team. Current constraints leave tenants without an interactive management interface. A solution is needed at the platform level before this ADR can be fully closed.

Consequences

Positive

Policy compliant - No public endpoints ever exposed
Reproducible - All infrastructure is code, auditable, version controlled
Scalable - Onboard 100 tenants with same process as 1
Secure - No browser-based attack surface

Negative

Steeper learning curve - Tenants must use IaC, not click-ops
No visual management - Can't "see" resources in Portal
Debugging harder - Must use CLI/API, not browser dev tools
Microsoft disconnect - Their guidance assumes UI access

References

ADR-009: Why AI Landing Zone vs Custom Solution Accepted

Status: Accepted

Date: 2025-12

Driver: Choice

Category: Architecture

Context

A valid question arises: If we're customizing everything for BC Gov requirements anyway, why use Microsoft's AI Landing Zone at all? Why not just build our own custom solution from scratch?

This ADR explains what value the AI Landing Zone and Azure Verified Modules (AVM) actually provide, even when we can't use Microsoft's default assumptions.

What AI Landing Zone Actually Provides

💡

Key Insight: We're not using Microsoft's AI Landing Zone as a turnkey solution. We're using it as a reference architecture and leveraging the Azure Verified Modules (AVM) it's built on. The value is in the tested, maintained, modular components - not the out-of-box configuration.

The Real Value: AVM Modules

Building From Scratch

If we wrote our own Terraform modules:

Write 1000+ lines of Terraform per service
Handle every Azure API change ourselves
Debug edge cases Microsoft already solved
Maintain security patches ourselves
No community validation or review
Reinvent private endpoint patterns
Figure out RBAC configurations
Test against every Azure region

Using AVM Modules wherever possible

With Azure Verified Modules:

~50 lines of Terraform to deploy a service
Microsoft maintains API compatibility
Edge cases handled by module maintainers
Security updates pushed automatically
Community tested, Microsoft validated
Private endpoint patterns built-in
RBAC best practices included
Tested across all Azure regions

What We Get From AI Landing Zone Reference

Component	What Microsoft Provides	What We Customize
Network Topology	Hub-spoke pattern, subnet sizing guidance, NSG rule templates	IP ranges, Canada regions, Platform Services DNS integration
AI Foundry Setup	AVM module for workspace, project structure, compute patterns	Disable public access, multi-tenant project isolation
Private Endpoints	Patterns for connecting services privately, DNS zone integration	Link to Platform Services DNS, IP allocation per tenant
APIM Integration	AVM module, backend pool patterns, policy templates	Subscription per tenant, OpenAPI routing, rate limits
Security Baseline	RBAC templates, managed identity patterns, Key Vault integration	BC Gov RBAC requirements, ministry-level isolation

AVM Module Maturity - Honest Assessment

⚠️

Important: Not all AVM modules are equally mature. AI Foundry modules are newer and still evolving. We must be realistic about what's production-ready vs what's in development.

Module	Status	Maturity	Notes
`avm-res-storage-storageaccount`	Released	Production Ready	Mature, well-tested
`avm-res-keyvault-vault`	Released	Production Ready	Mature, well-tested
`avm-res-network-virtualnetwork`	Released	Production Ready	Mature, well-tested
`avm-res-network-applicationgateway`	Released	Production Ready	Mature, well-tested
`avm-res-network-bastionhost`	Released	Production Ready	Mature, well-tested
`avm-res-documentdb-databaseaccount`	Released	Production Ready	Cosmos DB - mature
`avm-res-apimanagement-service`	Pending	Early Release	v0.0.5 - may need custom work
`avm-res-cognitiveservices-account`	Pending	Maturing	OpenAI, Doc Intel - verify features
`avm-res-search-searchservice`	Pending	Maturing	AI Search - verify private EP support
`avm-res-machinelearningservices-workspace`	Pending	Maturing	Core resource for Foundry Hub/Project
`avm-ptn-aiml-ai-foundry`	In Development	Not Production Ready	Pattern module - active development
`avm-ptn-ai-foundry-enterprise`	Archived	Abandoned	Was archived July 2025 - do not use
`avm-res-containerservice-managedcluster`	Pending	Pre-release	AKS - v0.4.0-pre2

What This Means

Safe to Use AVM

Virtual Networks, Subnets, NSGs
Storage Accounts
Key Vault
Bastion Host
App Gateway
Cosmos DB

Evaluate / May Need Custom

AI Foundry (use resource module, not pattern)
APIM (early version, test thoroughly)
Cognitive Services (verify private EP)
AI Search (verify features needed)
AKS (pre-release)

💡

Our Strategy: Use mature AVM modules for infrastructure (networking, storage, security). For AI Foundry, start with the core machinelearningservices-workspace resource module rather than the pattern modules. Be prepared to write custom Terraform for AI-specific configurations that AVM doesn't yet support.

Concrete Example: Storage Account

Without AVM (Custom from scratch)

# ~200 lines of Terraform to handle:
resource "azurerm_storage_account" "main" { ... }
resource "azurerm_storage_account_network_rules" "main" { ... }
resource "azurerm_private_endpoint" "blob" { ... }
resource "azurerm_private_endpoint" "file" { ... }
resource "azurerm_private_endpoint" "queue" { ... }
resource "azurerm_private_dns_zone_virtual_network_link" "blob" { ... }
resource "azurerm_role_assignment" "contributor" { ... }
resource "azurerm_role_assignment" "reader" { ... }
resource "azurerm_monitor_diagnostic_setting" "main" { ... }
# Plus encryption, lifecycle policies, soft delete, versioning...
# Plus handling for every Azure API version change...

With AVM Module

module "storage" {
  source  = "Azure/avm-res-storage-storageaccount/azurerm"
  version = "0.6.7"

  name                = "ministryhealthstorage"
  resource_group_name = azurerm_resource_group.ministry.name
  location            = "canadacentral"

  # Private endpoints - one line enables the pattern
  private_endpoints = {
    blob = { subnet_id = module.network.subnet_ids["private-endpoints"] }
  }

  # All the security, RBAC, monitoring handled by module
  tags = var.common_tags
}

Result: Same outcome, 10x less code, maintained by Microsoft.

Why Not 100% Custom?

Custom = Maintenance Burden

Azure releases ~100 API changes/month
Each change could break custom modules
Security vulnerabilities need patching
New features require manual implementation
Team turnover = knowledge loss
Who maintains this in 3 years?

AVM = Shared Maintenance

Microsoft + community maintain modules
API changes handled upstream
Security patches published as versions
New features added automatically
Documentation maintained centrally
Sustainable long-term

What We're Actually Doing

Our Approach: AVM + BC Gov Customization Layer

┌─────────────────────────────────────────────────────────────────────┐
│                    BC Gov AI Hub Architecture                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  BC Gov Customization Layer (Our Code)                      │   │
│   │  • Multi-tenant resource group structure                    │   │
│   │  • APIM subscription per ministry                           │   │
│   │  • IP allocation policies                                   │   │
│   │  • Platform Services DNS integration                        │   │
│   │  • Canada data residency enforcement                        │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  Azure Verified Modules (Microsoft Maintained)              │   │
│   │  • avm-res-cognitiveservices-account                        │   │
│   │  • avm-res-machinelearningservices-workspace                │   │
│   │  • avm-res-storage-storageaccount                           │   │
│   │  • avm-res-network-virtualnetwork                           │   │
│   │  • avm-res-apimanagement-service                            │   │
│   │  • ... 100+ more modules                                    │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │  Azure Resource Manager APIs                                │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

We write the thin customization layer. Microsoft maintains the heavy lifting.

Decision

We use AI Landing Zone reference architecture and AVM modules as our foundation, with a BC Gov customization layer on top.

We do NOT use Microsoft's default configuration. We use their:

Reference patterns - How to wire services together
AVM modules - Tested, maintained infrastructure components
Best practices - Security, networking, RBAC patterns

Consequences

Positive

Reduced maintenance - Microsoft maintains 90% of the code
Faster development - Use proven patterns instead of inventing
Security updates - Get patches via module version bumps
Community support - Issues/bugs reported and fixed by others
Audit trail - Using "official" modules helps with compliance

Negative

Module constraints - Can only do what AVM modules support
Version management - Must track and update module versions
Abstraction leakage - Sometimes need to understand module internals

References

ADR-010: Multi-Tenant Isolation Model Accepted

Status: Accepted

Date: 2025-12

Driver: Policy Required

Category: Architecture

Context

🏛️

Policy Driver: BC Gov requires strict data isolation between ministries. Each ministry's data must be logically separated, with access controls preventing cross-ministry data exposure. The AI Hub serves multiple BC Government tenants (WLRS, SDPR, NR-DAP, etc.) from shared infrastructure.

The AI Hub Landing Zone is designed to serve multiple BC Government ministries from a single shared infrastructure deployment. This creates a multi-tenancy challenge: how do we provide cost-efficient shared services while ensuring strict data isolation between ministries?

The Problem: Shared Infrastructure, Isolated Data

What We Cannot Do

Allow Ministry A to access Ministry B's documents
Share AI search indexes across ministries
Use single storage accounts for all ministry data
Allow cross-ministry API access without authorization

What We Can Share

Network infrastructure (VNets, Bastion, NSGs)
Compute resources (AI Foundry Hub)
Ingress layer (App Gateway, APIM)
Monitoring and logging infrastructure

Decision

We implement a four-layer isolation model:

1. Storage Isolation

Separate storage accounts per ministry

Each ministry gets dedicated blob containers
Azure RBAC restricts access to ministry principals
Private endpoints per storage account
Encryption keys can be ministry-specific (CMK)

2. AI Search Index Isolation

Separate search indexes per ministry

Each ministry's documents indexed separately
RAG queries scoped to ministry index only
No cross-index queries permitted
Index-level access control via Azure RBAC

3. API Isolation (APIM)

APIM subscriptions per ministry

Unique subscription keys per ministry
Rate limiting scoped to subscription
API policies enforce ministry context
Audit logging includes ministry identifier

4. Network Isolation (NSGs)

Network policies enforce boundaries

NSG rules restrict subnet-to-subnet traffic
Private endpoints isolated by ministry where needed
No direct cross-ministry network paths
All traffic routed through controlled ingress

Implementation Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MULTI-TENANT ISOLATION MODEL                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Ministry Health        Ministry SDPR         Ministry Justice             │
│   ┌─────────────┐       ┌─────────────┐       ┌─────────────┐              │
│   │ APIM Sub: H │       │ APIM Sub: S │       │ APIM Sub: J │              │
│   └──────┬──────┘       └──────┬──────┘       └──────┬──────┘              │
│          │                     │                     │                      │
│          ▼                     ▼                     ▼                      │
│   ┌─────────────────────────────────────────────────────────────┐          │
│   │              Shared APIM (Policy Enforcement)                │          │
│   │         (validates subscription → routes to ministry)        │          │
│   └─────────────────────────────────────────────────────────────┘          │
│          │                     │                     │                      │
│          ▼                     ▼                     ▼                      │
│   ┌─────────────┐       ┌─────────────┐       ┌─────────────┐              │
│   │ Storage: H  │       │ Storage: S  │       │ Storage: J  │              │
│   │ Index: H    │       │ Index: S    │       │ Index: J    │              │
│   └─────────────┘       └─────────────┘       └─────────────┘              │
│                                                                              │
│   ════════════════════════════════════════════════════════════════          │
│                    Shared Infrastructure Layer                               │
│        (VNet, Bastion, AI Foundry Compute, Monitoring)                      │
│   ════════════════════════════════════════════════════════════════          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consequences

Positive

Strong data isolation - Ministry data never co-mingled
Cost efficiency - Shared compute and network infrastructure
Audit compliance - Clear ministry attribution in all logs
Scalable onboarding - New ministries get isolated resources automatically
Flexible isolation levels - Can increase isolation (dedicated compute) if needed

Negative

Resource multiplication - Each ministry needs separate storage/indexes
Complexity - More resources to manage and monitor
Cost per ministry - Base cost increases with each ministry onboarded

Cost Tracking

Multi-tenant isolation also enables per-tenant cost tracking and chargeback. See the detailed cost allocation strategy:

Click diagram to view full Cost Tracking documentation

References

Cost Tracking Documentation - Full cost allocation strategy
Multi-Tenant Isolation Diagram
Azure Multi-Tenant Architecture Guide
AI Request Data Flow Diagram

ADR-011: Control Plane vs Data Plane Access & Chisel Tunnel Pending

Status: Pending

Date: 2026-01

Driver: Policy Required

Category: Architecture

This ADR explains a fundamental Azure concept that drives many architecture decisions: the difference between Control Plane and Data Plane access, why OIDC works for one but not the other, and how Chisel provides data plane access for platform maintainers.

Context

Azure services have two distinct access paths that are often confused:

Control Plane (ARM APIs)

What: Managing Azure resources - create, update, delete, configure

Endpoint: management.azure.com (always public)

Authentication: OIDC tokens, Service Principals, Managed Identity

Examples:

Creating a Key Vault
Configuring a Storage Account
Setting up Private Endpoints
RBAC role assignments
Azure Portal UI (viewing resources)

Data Plane (Service-specific APIs)

What: Accessing data inside resources

Endpoint: *.vault.azure.net, *.blob.core.windows.net, etc.

Authentication: Same tokens, BUT requires network access

Examples:

Reading/writing Key Vault secrets
Reading/writing Storage blobs
Querying databases (PostgreSQL, CosmosDB)
Calling Azure OpenAI APIs
Terraform state file read/write

The Problem: Private Endpoints Block Data Plane

🔒

BC Gov Policy: All Azure services must use private endpoints. This blocks data plane access from the public internet - even with valid OIDC credentials!

┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHAT WORKS vs WHAT'S BLOCKED                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   From Public Internet (GitHub Actions, Azure Portal, Your Laptop):          │
│                                                                              │
│   ✅ CONTROL PLANE (management.azure.com)                                    │
│      • Create Key Vault                    → ARM API → Works                 │
│      • Create Storage Account              → ARM API → Works                 │
│      • View resource properties in Portal  → ARM API → Works                 │
│      • OIDC authentication                 → Azure AD → Works                │
│                                                                              │
│   ❌ DATA PLANE (*.vault.azure.net, *.blob.core.windows.net)                 │
│      • Read Key Vault secret               → Private Endpoint → BLOCKED      │
│      • Write Storage blob                  → Private Endpoint → BLOCKED      │
│      • "View Secret Value" in Portal       → Private Endpoint → BLOCKED      │
│      • Terraform state read/write          → Private Endpoint → BLOCKED      │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   From Inside VNet (Chisel Tunnel, Jumpbox, or Optional Self-hosted Runners):│
│                                                                              │
│   ✅ CONTROL PLANE → Still works (ARM is public)                             │
│   ✅ DATA PLANE    → Works via Private Endpoints (network path exists)       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Azure Portal Shows Resources But Not Data

The Azure Portal is a Control Plane UI. When you browse to a Key Vault in the portal:

✅ You can see the vault exists (control plane: list resources)
✅ You can see its configuration (control plane: read properties)
❌ You CANNOT click "Show Secret Value" (data plane: blocked by private endpoint)

This is why ADR-008 states "No Azure Portal or Foundry Studio UI Access" for data operations - the portal physically cannot reach private data plane endpoints from your browser.

Why OIDC Is Used for Control Plane

OIDC (OpenID Connect) federation provides passwordless authentication from GitHub Actions to Azure:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        OIDC AUTHENTICATION FLOW                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   1. GitHub Actions workflow starts                                          │
│           ↓                                                                  │
│   2. GitHub OIDC Provider issues JWT token                                   │
│      (claims: repo, branch, environment, workflow)                           │
│           ↓                                                                  │
│   3. Token sent to Azure AD                                                  │
│           ↓                                                                  │
│   4. Azure AD validates against Federated Credential                         │
│      (checks: issuer=GitHub, subject=repo:bcgov/ai-hub-tracking:...)        │
│           ↓                                                                  │
│   5. Azure AD issues Access Token (valid ~1 hour)                            │
│           ↓                                                                  │
│   6. Access Token used for ARM API calls (Control Plane)                     │
│           ↓                                                                  │
│   ✅ terraform plan/apply (resource management)                              │
│   ✅ az cli commands (control plane operations)                              │
│   ✅ No secrets stored in GitHub!                                            │
│                                                                              │
│   BUT: OIDC gives credentials, NOT network access.                           │
│        Data plane still blocked without VNet connectivity.                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Decision: Multiple Access Methods for Different Needs

We provide four toggleable access mechanisms via Terraform, each serving a different use case:

Access Method	Terraform Toggle	Who Uses It	Use Case	Plane Access
Self-Hosted Runners	`github_runners_aca_enabled`	Other tenant repos	Optional: persistent VNet compute for CI/CD workloads that can't use Chisel	Control + Data
Bastion + Jumpbox	`enable_bastion`, `enable_jumpbox`	Platform Maintainers	Emergency debugging, manual admin tasks	Control + Data
Chisel Tunnel	`enable_azure_proxy`	Platform Maintainers	Local dev access to private databases/APIs	Control + Data
Public GitHub Runners	(default)	CI/CD (limited)	Control plane only operations	Control only

Chisel Tunnel: Data Plane Access for Platform Maintainers

🔧

Chisel is for Platform Maintainers ONLY - not for tenant/project developers. Tenants access AI services through APIM/App Gateway (the designed public entry point).

What is Chisel? A fast TCP/UDP tunnel over HTTP that creates a secure reverse proxy from your local machine into the Azure VNet.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CHISEL TUNNEL ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Platform Maintainer's Laptop                                               │
│   ┌─────────────────────────────┐                                            │
│   │  Chisel Client (Docker)     │                                            │
│   │  Listens on localhost:5432  │                                            │
│   └──────────────┬──────────────┘                                            │
│                  │ HTTPS (encrypted)                                         │
│                  ▼                                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐    │
│   │                    Azure App Service (Chisel Server)                │    │
│   │                    Inside VNet (app-service-subnet)                 │    │
│   │                    Random auth: tunnel:XXXXXXXX                     │    │
│   └──────────────┬──────────────────────────────────────────────────────┘    │
│                  │ VNet Integration (Private Network)                        │
│                  ▼                                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐    │
│   │                    Private Endpoints                                │    │
│   │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │    │
│   │  │  PostgreSQL  │  │   Key Vault  │  │   CosmosDB   │              │    │
│   │  │  :5432       │  │   :443       │  │   :443       │              │    │
│   │  └──────────────┘  └──────────────┘  └──────────────┘              │    │
│   └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│   Result: psql -h localhost -p 5432 → tunnels to private PostgreSQL          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

What Terraform Operations Need Data Plane?

Most Terraform operations are control plane only. Data plane is only needed when:

Resource/Operation	Plane	Works from Public?	Example
`azurerm_key_vault` (create)	Control	✅ Yes	Creating the vault itself
`azurerm_key_vault_secret` (write)	Data	❌ No	Writing secrets INTO the vault
`data "azurerm_key_vault_secret"`	Data	❌ No	Reading secrets FROM the vault
Terraform state backend (Storage)	Data	❌ No	Reading/writing .tfstate blob
`azurerm_storage_account` (create)	Control	✅ Yes	Creating the account
`azurerm_storage_blob` (upload)	Data	❌ No	Uploading files to storage
`azurerm_private_endpoint`	Control	✅ Yes	Creating the private endpoint
RBAC role assignments	Control	✅ Yes	Granting permissions

Access Model Summary

┌─────────────────────────────────────────────────────────────────────────────┐
│                        COMPLETE ACCESS MODEL                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TENANT DEVELOPERS (Ministry Teams)                                          │
│  ┌─────────────┐      ┌──────────────┐      ┌─────────────────┐             │
│  │ Their Apps  │─────▶│ App Gateway  │─────▶│ APIM (rate      │─────▶ AI    │
│  │             │      │ + WAF        │      │ limited, metered)│      APIs   │
│  └─────────────┘      └──────────────┘      └─────────────────┘             │
│                              ▲                                               │
│                              │ Public endpoint (by design)                   │
│                              │ No direct private endpoint access             │
│                                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PLATFORM MAINTAINERS (This Team)                                            │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │ Method              │ Toggle Variable              │ Use Case          │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │ Self-hosted Runners │ github_runners_aca_enabled   │ Optional tenant CI/CD │ │
│  │ Bastion + Jumpbox   │ enable_bastion/jumpbox       │ Admin access      │ │
│  │ Chisel Tunnel       │ enable_azure_proxy           │ Local dev access  │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  All three provide: Control Plane + Data Plane access                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consequences

Positive

Clear mental model - Understanding control vs data plane explains "why" behind many decisions
Flexible access - Enable only what you need (cost optimization)
Policy compliant - No public data plane endpoints ever
Secure tenant isolation - Tenants use APIM, never touch private endpoints directly

Negative

Complexity - Must understand two planes, not just "Azure access"
Chisel proxy dependency - Terraform workflows require the Azure Proxy App Service to be healthy before running
Portal limitations - Can't "see" data in Portal even with permissions

References

ADR-012: Hybrid Cost Allocation & Usage Monitoring Strategy Proposed

Status: Proposed

Date: 2026-01

Deciders: Platform Team

Category: Observability

Context

The AI Services Hub operates as a multi-tenant platform serving multiple ministries. This shared infrastructure creates a financial governance challenge: we must accurately allocate costs to specific tenants to ensure accountability and cost recovery.

The architecture includes two types of resources:

Tenant-Dedicated: Resources used exclusively by one tenant (e.g., Azure OpenAI instances, Document Intelligence, Cosmos DB).
Shared Infrastructure: Resources serving all tenants simultaneously (e.g., APIM, App Gateway, Application Insights).

We need a standardized strategy to track consumption and generate accurate chargeback invoices that account for both resource types across our dual-region deployment (Canada Central & East).

Decision

We will adopt a Hybrid Cost Allocation Model that combines native Azure billing with custom usage tracking:

Direct Attribution (90% of costs): We will isolate high-cost resources (AI Foundry Projects, Document Intelligence) into tenant-specific resource groups. These will be billed directly to tenants using Azure Cost Management tags (tenant-id), requiring no manual calculation.
Proportional Allocation (10% of costs): We will split the cost of shared infrastructure (APIM, App Gateway) based on actual usage metrics.
- APIM & Gateway costs split by API Request Volume.
- Monitoring costs split by Log Ingestion Volume.
Centralized Tracking via APIM: Azure API Management will serve as the single source of truth for usage metrics. All traffic must flow through APIM to ensure consistent tenant identification and logging.
Custom Egress Calculation: We will implement a custom calculation pipeline using App Gateway logs to track and charge for cross-region network egress (between Canada Central and Canada East).

Rationale

Minimizes Operational Overhead: By using Direct Attribution for the most expensive resources (OpenAI tokens, Search), we rely on Azure's native billing engine for the vast majority of chargebacks. We only maintain custom logic for the shared platform components.
Fairness: A flat fee for shared infrastructure would be unfair to smaller tenants. Proportional allocation based on request volume ensures ministries only pay for the platform capacity they actually consume.
Granular Visibility: Using APIM as the central logging point allows us to capture operational metrics (latency, errors, token usage) alongside billing data without adding sidecar proxies to every service.

Consequences

Positive

Transparency: Tenants can verify their direct Azure costs in the portal using their tenant tag.
Scalability: New tenants can be onboarded simply by adding tags; the cost model adjusts automatically.
Cost Recovery: Ensures the Platform Team fully recovers infrastructure costs rather than absorbing the overhead of shared services.

Negative

Complexity of Egress: Cross-region data transfer is billed at the subscription level and is difficult to attribute. We accept the tradeoff of maintaining a custom Kusto query to calculate this specific cost.
Maintenance: The logic for splitting shared costs (Python functions/Kusto queries) is custom code that must be maintained and verified monthly against Azure invoices.

ADR-013: Scaled Stack Architecture with Isolated State Files Accepted

Status: Accepted

Date: 2026-02

Deciders: AI Services Hub Team

Category: Infrastructure

Context

The AI Services Hub infrastructure was originally deployed as a single Terraform root module with one monolithic state file containing ~174 resources. This created several operational problems:

Blast radius: Any Terraform error or state corruption could affect all 174 resources simultaneously.
Lock contention: Only one Terraform operation could run at a time across the entire infrastructure, even for independent resources.
Serial execution: Foundry projects required parallelism=1 due to Azure ETag conflicts, which forced the entire apply to run serially.
Apply duration: A full apply took 5m 44s because independent stacks (APIM, foundry, tenant-user-mgmt) had to wait for each other.
Phased targeting: The deployment script used -target flags to orchestrate a multi-phase apply (Phase 1: everything except foundry, Phase 2: foundry only, Phase 3: validation). This was fragile and hid dependency issues.

Decision

We split the monolithic root module into 5 isolated Terraform stacks, each with its own state file, backend configuration, and dependency management via terraform_remote_state data sources:

Stack	State Key	Phase	Purpose
`shared`	`shared.tfstate`	1 (serial)	VNet, subnets, AI Foundry Hub, App Gateway, WAF, Key Vault, ACR, monitoring
`tenant`	`tenant-{key}.tfstate`	2 (parallel fan-out)	Per-tenant resources: AI Search, CosmosDB, Document Intelligence, Storage, Key Vault
`foundry`	`foundry.tfstate`	3 (parallel)	AI Foundry projects per tenant (`parallelism=1` to avoid ETag conflicts)
`apim`	`apim.tfstate`	3 (parallel)	API Management gateway, policies, tenant subscriptions, role assignments
`tenant-user-mgmt`	`tenant-user-management.tfstate`	3 (parallel)	Entra ID user/group assignments (requires Graph API permissions)

A stack engine (deploy-scaled.sh) orchestrates execution in dependency order:

Phase 1: shared runs first (all other stacks depend on it).
Phase 2: tenant runs per-tenant in parallel (each tenant gets isolated TF_DATA_DIR and state).
Phase 3: foundry, apim, and tenant-user-mgmt run concurrently (no cross-dependencies between them).

For destroy, the order is reversed: Phase 3 stacks first (parallel), then tenants (parallel), then shared last.

Rationale

Reduced blast radius: A state corruption or failed apply in apim cannot affect shared or tenant resources. Each stack can be independently recovered.
Parallel execution: Phase 3 stacks have no dependencies on each other and can run concurrently, reducing total apply time by ~60 seconds.
Isolated parallelism: The foundry stack runs with parallelism=1 without forcing the entire infrastructure to be serial. APIM and tenant-user-mgmt run with full parallelism simultaneously.
Independent state locking: Multiple operators or CI pipelines can work on different stacks without lock contention (e.g., updating APIM policies while a tenant deployment is in progress).
Cleaner dependencies: Using terraform_remote_state data sources makes cross-stack dependencies explicit and typed, replacing implicit module-to-module references.

Consequences

Positive

19% faster applies: Total apply time reduced from 5m 44s to 4m 39s (measured on test environment with 2 tenants).
Eliminated -target phasing: The old Phase 1/2/3 with -target flags is replaced by natural stack ordering. No more fragile target expressions.
Auto-recovery: The stack engine includes deposed object cleanup, import-on-conflict, and transient error retry — previously only available at the monolith level.
Live output streaming: Each stack streams Terraform output in real time via tee, improving debuggability in CI/CD logs.
Per-tenant state isolation: Each tenant has its own state file, making tenant onboarding/offboarding a state-level operation rather than a resource-level one.

Negative

More files: 5 stacks × 5 standard files (main.tf, variables.tf, outputs.tf, providers.tf, backend.tf) = 25 files vs. the original 6. Some variable declarations are duplicated across stacks.
State migration required: One-time migration from the monolith state to 5 stack states using terraform state mv. This was performed manually with verification scripts.
Cross-stack refactoring: Moving a resource between stacks requires a state move operation, not just a code move. This adds operational complexity for future refactors.
Remote state coupling: Stacks are coupled via terraform_remote_state data sources. Adding a new output in shared that apim needs requires deploying shared first.

References

Terraform Remote State Data Source
infra-ai-hub/stacks/ — Stack root modules
infra-ai-hub/scripts/deploy-scaled.sh — Stack engine
infra-ai-hub/scripts/deploy-terraform.sh — Public entrypoint

ADR-014: APIM Subscription Key Rotation Proposed

Status: Proposed

Date: 2026-02

Deciders: AI Services Hub Team

Category: Security

Pending STRA Approval
This decision has been made by the AI Services Hub team and all infrastructure is in place. Final approval from the Security Threat and Risk Assessment (STRA) process is pending. The rotation mechanism can be enabled per-environment via a single config flag.

Context

APIM subscription keys authenticate tenant API calls to the AI Services Hub gateway. Without rotation, these long-lived secrets present a growing risk:

Credential staleness: Keys provisioned at tenant onboarding remain valid indefinitely unless manually changed. A compromised key grants persistent access.
No expiry enforcement: Azure APIM subscription keys have no built-in TTL or auto-expiry. The platform must implement rotation externally.
BC Gov compliance: Government security policy expects secrets to be rotated periodically. The rotation interval and mechanism require STRA sign-off before production use.
Multi-tenant blast radius: Each tenant has isolated subscription keys, but without rotation a single leaked key provides indefinite access to that tenant’s API surface.

Decision

We implement an alternating primary/secondary key rotation pattern with zero downtime, driven by a Container App Job (scheduled) deployed as a custom container:

Alternating slots: APIM subscriptions have two key slots (primary and secondary). Each rotation cycle regenerates one slot while the other remains valid and untouched.
Centralized hub Key Vault: After regeneration, both keys are written to a single hub Key Vault ({app_name}-{env}-hkv) with 90-day expiry and rotation metadata. No per-tenant Key Vault is required.
Self-service retrieval: Tenants fetch current keys via GET /{tenant}/internal/apim-keys — an APIM policy endpoint that reads from the hub Key Vault using APIM’s managed identity. No Azure SDK required.
Configurable schedule: Rotation is controlled by two flags in params/{env}/shared.tfvars: rotation_enabled (master on/off) and rotation_interval_days (7 for dev/test, 30 for prod).
Managed identity authentication: The Container App Job uses a system-assigned managed identity for APIM and Key Vault access. No stored secrets required.

Component	Purpose	Location
Container App Job	Scheduled Python job: discovers APIM + hub KV, rotates keys, stores in KV	`jobs/apim-key-rotation/`
Container build	Builds custom container image to GHCR on PR/merge	`.github/workflows/.builds.yml` (matrix entry)
Terraform module	Deploys Container App Job, Container App Environment, RBAC	`infra-ai-hub/modules/key-rotation-function/`
Hub Key Vault	Centralized storage for all tenant keys (scales to 1000+)	`stacks/shared/main.tf`
APIM policy endpoint	`/internal/apim-keys` reads from hub KV	`params/apim/api_policy.xml.tftpl`
Terraform config	Seeds initial KV secrets + RBAC for APIM MI → hub KV	`stacks/apim/main.tf`

Rationale

Zero downtime: The alternating slot pattern guarantees one key is always valid. Tenants are never locked out during rotation.
Container App Job over GHA workflow: A scheduled Container App Job runs reliably within Azure (no 60-day inactivity disable risk). The previous Bash script + GHA approach required periodic repo activity to avoid GitHub silently disabling scheduled workflows.
Centralized over distributed: A single hub Key Vault with tenant-prefixed secrets scales better than per-tenant Key Vaults. One RBAC assignment for APIM’s managed identity covers all tenants.
Self-service key retrieval: The /internal/apim-keys endpoint eliminates the need for tenants to have Azure portal access or Key Vault Reader roles. Any valid subscription key authenticates the request.
Idempotent and safe: The function checks elapsed time since last rotation before acting. Multiple invocations within the interval are no-ops. A --dry-run mode allows validation without changes.

Consequences

Positive

Automated secret hygiene: Keys rotate on a known schedule with full audit trail in Key Vault versioning and GitHub Actions logs.
Minimal tenant burden: Tenants can use the APIM internal endpoint, daily cron polling, or simply contact the platform team. No Azure SDK or Key Vault access needed.
Emergency rotation: Both slots can be regenerated immediately via the documented runbook, invalidating all compromised keys.
Per-environment control: Rotation can be enabled independently per environment — currently active in dev/test, disabled in prod pending STRA approval.

Negative

Container infrastructure: The Container App Job requires a Container App Environment and Container Registry, adding infrastructure components compared to the previous GHA-only approach. These are managed via the key-rotation-function Terraform module.
STRA gate: Production rotation cannot be enabled until the STRA process completes. Until then, prod keys are static (same risk as baseline).
Tenant coordination: Tenants using hard-coded keys (rather than the self-service endpoint) must update their configuration within the rotation interval or face 401 errors.

References

APIM Key Rotation Guide — full operational documentation
jobs/apim-key-rotation/ — Container App Job source code
infra-ai-hub/modules/key-rotation-function/ — Terraform module
.github/workflows/.builds.yml — container build workflow (matrix entry)
params/{env}/shared.tfvars — per-environment rotation config

ADR-015: Tenant Isolation: Resource Group vs Subscription Pending Platform/MS

Status: Pending Platform/MS

Date: 2026-02

Deciders: AI Services Hub Team, MS Team, Platform Services Team

Category: Architecture

Pending Discussion with Microsoft & Platform Services
This decision requires input from Microsoft’s team on Azure quota scaling options and from the Landing Zone Platform team on subscription provisioning constraints. The current RG-based design is the preferred approach for centralized governance. See #63 and #74.

Context

The AI Services Hub serves multiple BC Government ministries from a shared Azure subscription. Tenant isolation is currently implemented at the Resource Group level — each tenant gets its own RG with dedicated data-plane resources while sharing control-plane infrastructure (VNet, App Gateway, APIM, AI Foundry Hub) in a central RG.

As the platform scales beyond the initial 2 tenants, key Azure subscription-level quotas create hard scaling ceilings:

Resource	Per-Subscription Limit	Per-Tenant Usage	Ceiling (tenants)
Model deployments per AI account	32 (default)	5–7	~4–5
AI Search services	16 (Basic/Standard)	0–1	16
Cosmos DB accounts	50	0–1	50
Cognitive Services accounts	200	2 (Doc Intel + Speech)	~100
APIM APIs per instance	400	5–6	~80
Private endpoints per subnet	1000	~5	~200
GlobalStandard TPM (per model)	Varies (e.g., 2M for gpt-4.1-mini)	7.5K–30K	Depends on model

The most immediate bottleneck is the 32 model deployments per AI Foundry Hub account. With 2 tenants deploying 6–7 models each (13 total today), the limit is reached at roughly 4–5 tenants. Additionally, LLM quotas (PTU and GlobalStandard TPM) are tied to the subscription scope, so all tenants compete for the same throughput pool.

Issue #74 raises specific questions about PTU scaling: turnaround time for quota increases, dynamic scaling between PTU and pay-as-you-go, and whether the single-subscription model can support production-scale workloads.

Options Evaluated

Option A: Resource Group Isolation (Current & Preferred)

All tenants share a single Azure subscription. Shared infrastructure (VNet, AppGW, APIM, AI Foundry Hub) lives in a central RG. Each tenant gets a dedicated RG with isolated data-plane resources. Scaling limits are mitigated at the application layer.

Option B: Subscription-Per-Tenant

Each tenant (or group of tenants) gets a dedicated Azure subscription. Shared infrastructure is replicated or connected via VNet peering. Each subscription has independent quota pools.

Option A: Resource Group Isolation — Pros & Cons

Pros

Centralized governance: Single subscription = single set of Azure Policies, RBAC, Defender for Cloud, cost management. One pane of glass for the platform team.
Simplified networking: All resources in one VNet with one PE subnet. No cross-subscription VNet peering, no transit routing, no DNS forwarding complexity.
Lower cost: Shared App Gateway, APIM, AI Foundry Hub, Log Analytics. No per-subscription overhead (reserved instances apply once, shared Defender plans).
Faster tenant onboarding: Adding a tenant = adding a Terraform tenant config block. No subscription provisioning, OIDC setup, or Landing Zone request.
Existing investment: Current architecture (5 stacks, phased deployment, APIM key rotation) is built and validated for this model.
Operational simplicity: One deployment pipeline, one set of state files per environment, one OIDC identity per environment.
Cost attribution: Per-tenant RGs enable native Azure Cost Management tag-based and RG-based cost reporting without subscription boundaries.

Cons

Quota ceilings: All tenants share subscription-scoped quotas. The 32-deployment AI account limit is the most immediate constraint (~4–5 tenants).
TPM/PTU contention: All model deployments on the shared Hub compete for the same GlobalStandard TPM pool. High-demand tenants crowd out others.
Blast radius: A subscription-level issue (quota exhaustion, policy misconfiguration, billing suspension) impacts all tenants simultaneously.
Quota increase friction: Requesting Azure quota increases is a per-subscription manual process. Lead times can be days to weeks for PTU.
AI Search hard limit: 16 search services per subscription is a fixed limit with no increase path — hard wall at 16 search-enabled tenants.
Foundry serialization: Model deployments run serially across all tenants to avoid ETag conflicts on the shared Hub. More tenants = slower deploys.

Option B: Subscription-Per-Tenant — Pros & Cons

Pros

Independent quota pools: Each subscription gets its own 32 model deployments, 200 Cognitive Services accounts, 16 AI Search services, etc. Eliminates quota-based scaling ceilings.
PTU isolation: Each tenant can request and manage its own PTU commitments. No cross-tenant throughput contention.
Blast radius reduction: Subscription-level issues are isolated to one tenant.
Independent scaling: Each subscription can scale resources (VM sizes, throughput, storage) without affecting others.
Compliance flexibility: Some future tenants may have regulatory requirements (FOIPPA, health data) that mandate subscription-level isolation.
Clean cost separation: Native Azure billing per subscription. No tag-based attribution needed.

Cons

Loss of central governance: Each subscription needs its own Azure Policies, RBAC assignments, Defender plans. Policy drift risk increases linearly.
Networking complexity: Requires cross-subscription VNet peering (or VWAN hub-and-spoke), cross-subscription private DNS zones, transit routing. Significant complexity increase.
Infrastructure duplication: App Gateway, APIM, Key Vault, and potentially AI Foundry Hub must be replicated per subscription — or a complex shared services model must be designed.
Higher cost: Duplicated infrastructure (App Gateway ~$300/mo, APIM StandardV2 ~$700/mo per subscription). Reserved instance savings are fragmented.
Subscription provisioning lead time: BC Gov Landing Zone subscription requests go through the Platform Services team. Lead time is days to weeks, blocking rapid tenant onboarding.
Pipeline complexity: Each subscription needs its own OIDC federation, deployment pipeline, state backend. The current single-pipeline model does not extend.
Operational overhead: N subscriptions = N sets of monitoring, alerting, secret rotation, certificate management. Platform team workload scales linearly.
APIM routing: A shared APIM fronting multiple subscription backends requires cross-subscription private endpoints or public exposure — both add complexity.

Decision

We continue with Resource Group isolation (Option A) as the preferred architecture, with specific mitigations for quota scaling constraints. This decision is pending confirmation from Microsoft on quota flexibility and from the Platform team on subscription provisioning options as a future fallback.

Mitigations for RG-Based Scaling Limits

Constraint	Limit	Mitigation Strategy	Status
Model deployments per AI account	32	Request quota increase via Azure Support. Deploy a second AI Foundry Hub account if increase denied. Consolidate shared models (e.g., single embedding model for all tenants).	Pending MS
GlobalStandard TPM contention	Per-model cap	Implement APIM rate limiting per tenant (already in place). Explore PTU for high-priority tenants. Use `dynamic_throttling_enabled` on AI account. Investigate PTU ↔ pay-as-you-go spillover.	Pending MS
AI Search services	16	Not all tenants need AI Search (1 of 2 currently enabled). For tenants with simple needs, use shared index with document-level permissions or skip Search entirely.	Mitigated
Foundry project serialization	Serial deploys	Already mitigated in ADR-013 (scaled stacks). Foundry stack runs serial but other phases are parallel. `prevent_destroy` on model deployments reduces redeploy churn.	In Place
APIM API count	400	Consolidate API definitions. Use a single versioned API with tenant routing via APIM policies rather than per-tenant API duplicates.	Future

Trigger for Revisiting This Decision

The following conditions would warrant moving specific tenants or tenant groups to dedicated subscriptions (hybrid model):

Tenant count exceeds 10–15 and quota increase requests are denied by Microsoft
A tenant requires dedicated PTU with guaranteed throughput SLAs that conflict with shared pool
Regulatory requirements (e.g., health data under FOIPPA) mandate subscription-level isolation explicitly
PTU turnaround time for quota increases exceeds acceptable lead times for tenant onboarding
Total GlobalStandard TPM across all tenants approaches the per-model subscription ceiling

Rationale

Right-sizing for now: With 2 active tenants and realistic growth to 5–10 in the near term, the RG model has headroom with mitigations. Subscription-level rearchitecture is premature.
Governance is the priority: BC Gov Landing Zone policies, RBAC, and audit requirements are significantly easier to enforce in a single subscription. The compliance benefit outweighs the quota risk.
Cost efficiency: Shared infrastructure saves ~$1,000+/mo per avoided subscription (AppGW + APIM alone). This is material in a government context.
Onboarding velocity: A Terraform config change vs. a multi-week subscription provisioning request. This directly impacts ministry adoption speed.
Hybrid escape hatch: The architecture supports a future hybrid model where high-demand or compliance-sensitive tenants move to dedicated subscriptions while most remain on the shared RG model. This is not an irreversible decision.

Open Questions for Microsoft

❓

These questions are tracked in #74 and require input from the Microsoft CAF/FastTrack team:

What is the turnaround time for PTU quota increases when current allocation is at full capacity?
Does PTU support dynamic scaling (elastic range) or is it fixed at a provisioned point?
Can you provide Terraform samples for auto-fallback between PTU and GlobalStandard (pay-as-you-go)?
Can the 32 model deployment limit per Cognitive Services account be increased? To what ceiling?
Is there a recommended pattern for multi-AI-Foundry-Hub accounts within a single subscription to distribute deployments?

Consequences

Positive

No immediate rearchitecture needed: The platform continues operating with the validated RG-based model while answers from Microsoft are pending.
Clear scaling triggers: The team knows exactly which quotas to monitor and at what tenant count to revisit the decision.
Documented escape path: If RG-based scaling hits limits, the migration path to subscription-per-tenant (or hybrid) is architecturally understood.

Negative

Near-term ceiling: The 32-deployment limit means maximum ~4–5 tenants without a quota increase or model consolidation. This is a known constraint.
MS dependency: Key mitigations (quota increases, PTU scaling guidance) depend on Microsoft response timelines.
Potential future migration: If the hybrid model is eventually needed, migrating a tenant from shared subscription to dedicated subscription requires re-creating resources, migrating data, and updating APIM routing — non-trivial effort.

References

Issue #63: Decision | Tenant Isolation | RG Or Subscription
Issue #74: Foundry Hub | Single Subscription | Scaling (parent issue)
ADR-010: Multi-Tenant Isolation Model — four-layer isolation architecture
ADR-013: Scaled Stack Architecture — phased deployment with isolated state
Azure OpenAI Quotas and Limits
Azure Subscription Service Limits

ADR-016: Backend Circuit Breaker Pattern Resilience Accepted

Status	Accepted
Date	2026-02
Deciders	Platform Team
Category	Resilience / API Gateway

Context

The APIM gateway proxies requests to multiple backend services: Azure OpenAI, Document Intelligence, AI Search, Speech Services, and Storage. When a backend experiences failures (overload, outages, degraded availability), continued request forwarding wastes resources and degrades the client experience with slow timeouts instead of fast failures.

Azure OpenAI returns 429 Too Many Requests for per-model token rate limits (TPM/RPM) and 5xx errors for actual service faults. Only 5xx responses indicate the backend is unhealthy and cannot serve requests — 429 signals a healthy backend enforcing quotas. This distinction drives the failure condition design below.

Decision

Implement the circuit breaker pattern on all APIM backend entities using the native circuit_breaker_rule in azurerm_api_management_backend.

Configuration per backend

Parameter	Value (AI services)	Value (Storage)
Failure count threshold	3	5
Failure window	1 minute (`PT1M`)	1 minute (`PT1M`)
Trip duration	1 minute (`PT1M`)	1 minute (`PT1M`)
Accept Retry-After	Yes	Yes
Trigger status codes	`500–599` only	`500–599` only

Backends covered

openai — Azure OpenAI (standard deployment)
openai_ptu — Azure OpenAI (provisioned throughput)
docint — Document Intelligence
ai_search — Azure AI Search
speech_services_stt — Speech-to-Text
speech_services_tts — Text-to-Speech
storage — Blob Storage (higher threshold: 5 failures)

What happens when the circuit trips

Backend accumulates 5xx failures (HTTP 500–599) within the failure window; 429 responses are not counted as failures and pass through the outbound section directly.
When the failure count exceeds the threshold, the circuit opens (trips).
APIM's native backend circuit breaker enters the <on-error> path with HTTP 503 Service Unavailable; requests are not forwarded to the backend while the circuit is open.
The shared global policy preserves the 503 status and adds diagnostic headers — x-circuit-breaker-open: true, Retry-After, retry-after-ms, x-should-retry: true, and x-request-id — so clients can detect the circuit state and back off. 503 is semantically correct (backend unavailable) and is not rewritten to 429.
After the trip duration (or the backend’s Retry-After value if accept_retry_after_enabled = true), the circuit resets and traffic resumes.

Client-facing error response (503)

{
    "error": {
        "code": "503",
        "message": "Service Unavailable - backend circuit breaker is open. Retry after the indicated period.",
        "retryAfter": "60",
        "requestId": "abc-123-def-456"
    }
}

SDK behaviour when circuit is open

The OpenAI Python SDK automatically retries both 429 and 503 responses using exponential backoff. When the circuit is open, clients receive 503, which the SDK surfaces as openai.InternalServerError — not RateLimitError. Both exception types trigger automatic retries; however, application code that branches specifically on RateLimitError will not catch circuit-breaker trips. Use the x-circuit-breaker-open: true response header as the reliable signal regardless of status code. See openai-python error handling & retry logic.

Rationale

Fast failure: Clients get an immediate 503 instead of waiting for backend timeouts (up to 300 seconds).
Backend protection: Prevents overwhelming a struggling backend with additional requests during fault conditions.
429 excluded from failure conditions — single-backend architecture: This deployment uses one backend per tenant per service (no backend pool). In a multi-backend pool, tripping on 429 routes traffic to a healthy replica — a valid design. Microsoft FastTrack's APIM + OpenAI reference uses this pool-based 429 tripping pattern. Without a failover target, tripping on 429 here would block all requests for 60 s even if most would not have been rate-limited. The inbound llm-token-limit policy already enforces per-model quotas proactively, so backend 429s should be rare edge cases handled by outbound pass-through.
503 not rewritten to 429 — RFC semantics: RFC 6585 §4 defines 429 as a client-scoped error (“the user has sent too many requests”). RFC 7231 §6.6.4 defines 503 as a server-scoped condition (“the server is currently unable to handle the request”). A circuit breaker affects all clients globally — 503 is definitionally correct. Rewriting to 429 implied a client-side rate-limit error, which is misleading and conflicts with the accurate 429 pass-through from the backend.
Retry-After propagation: Backend 429 responses pass through <outbound> unchanged, preserving the accurate Retry-After from Azure OpenAI (often 1–3 s for TPM limits). Circuit breaker 503 responses carry a fixed Retry-After: 60 matching the trip duration.
Event Grid integration: APIM emits Event Grid events on circuit trip/reset (BackendCircuitBreakerOpened / BackendCircuitBreakerClosed) for monitoring and alerting.

Consequences

Positive

Reduced latency during backend outages (instant 503 vs. timeout).
Backend services get breathing room to recover.
Consistent error format with Retry-After enables client-side retry logic.
A single shared policy rewrite keeps retry behavior consistent across SDK-based clients.

Negative

Approximate tripping: APIM gateway instances do not synchronize circuit breaker state. Each instance tracks failures independently, so tripping is approximate in multi-instance deployments.
Single rule per backend: Only one circuit breaker rule per backend is currently supported by the Azure API.
503 during recovery: Legitimate requests during the trip window are rejected as 503 even if the backend has recovered before the trip duration expires.
SDK exception type: The OpenAI Python SDK surfaces circuit-breaker 503s as openai.InternalServerError, not openai.RateLimitError. Application code that branches on RateLimitError will miss circuit-breaker trips. Use x-circuit-breaker-open: true header to distinguish. See openai-python retry logic.

References

Azure APIM Backends — Circuit Breaker (Microsoft Learn)
Circuit Breaker Pattern (Azure Architecture Center)
Microsoft FastTrack: APIM Circuit Breaker + Load Balancing with Azure OpenAI — reference for pool-based 429 tripping pattern
RFC 7231 §6.6.4 — 503 Service Unavailable
RFC 6585 §4 — 429 Too Many Requests
openai-python: Error Handling & Retry Logic — SDK exception types for 429 vs 503
APIM Event Grid Events — circuit breaker trip/reset events
Issue #98: AI Gateway Gap Analysis

ADR-017: Custom Tenant Onboarding Portal Inside AI Hub Accepted

Status: Accepted

Date: 2026-03

Driver: Choice

Category: Platform

Context

The AI Hub needed a tenant onboarding experience that does more than collect a request. The onboarding flow must support structured tenant metadata, governed admin review, future automation from submission through approval, and environment-aware downstream actions such as generating Terraform inputs and preparing non-prod and prod deployment workflows.

We considered using an existing platform instead of building a portal inside the Hub workspace. The main alternatives were the BC Government Platform Product Registry, CHEFS, and the Azure API Management Developer Portal. All three reduce initial build effort, but none matches the lifecycle and control-plane requirements of Hub onboarding.

Options Considered

Custom portal inside AI Hub (selected)

Owns the end-to-end onboarding workflow
Supports structured validation and Hub-specific data models
Can drive future automation after approval

BC Government Platform Product Registry

Designed to manage existing products on Private Cloud OpenShift and Public Cloud Landing Zones
Solves a different problem than Hub onboarding
Supports product metadata and resource change requests for higher level platforms which may not be a need at all for the AI Hub

CHEFS

Strong for hosted form submission
Submission lifecycle is oriented around form intake, not long-running onboarding state
Not a fit for secure post-approval actions or controlled reveal flows

APIM Developer Portal

Can be deployed as part of the Hub infrastructure (APIM)
Offers self-service API subscription and developer onboarding UX out of the box
Customisable via delegation and custom HTML/JS widgets, but tightly coupled to APIM concepts (products, subscriptions)
Not designed for multi-step approval workflows, structured Terraform config generation, or environment-aware provisioning actions

Decision

We will use a custom tenant onboarding portal inside the AI Hub repository and deployment boundary instead of the BC Government Platform Product Registry, CHEFS, or the APIM Developer Portal.

Rationale

Governed approval workflow: Hub onboarding requires an admin review and approval process, not just a one-time request capture. The custom portal can model request states, reviewer actions, and follow-up steps directly.
Structured Hub-specific configuration: The workflow needs to collect and validate data that maps cleanly into tenant-specific Terraform inputs and other structured configuration artifacts. A generic registry or hosted form would require extra translation layers and still would not own the lifecycle.
Tight integration with Hub logic and storage: The onboarding flow must stay close to Hub-specific validation rules, tenant state, and downstream provisioning behavior. Keeping the portal in the same codebase reduces impedance between intake, approval, generated config, and deployment automation.
Authenticated post-submission lifecycle: The process does not end at submission. The platform needs room for authenticated follow-up actions, operator review, state transitions, and future self-service interactions that go beyond a write-once form.
CHEFS is too immutable for this lifecycle: CHEFS is well suited to hosted intake, but its submission model is intentionally form-centric and immutable. It supports status and notes, but it is not designed to mutate form data, drive secure post-approval reveal flows, or act as the system of record for ongoing onboarding operations.
The Platform Product Registry solves a different problem: The registry is built for teams that need to create or manage a product on Private Cloud OpenShift or the Public Cloud Landing Zone, which may not be a need for AI Hub. AI Hub tenant onboarding can be irrespective of the platform the tenants use, captures Hub-specific configuration, and needs a purpose-built approval and provisioning workflow rather than generic platform registry product change process, similar to what Keycloak and Kong App Gateway have their own portals.
Secure post-approval actions: The team needs room for actions after approval, including controlled credential-related workflows and other sensitive follow-up behavior. Those concerns should live in a dedicated application boundary rather than in generic form metadata or a public-facing registry experience.
Future automation path: The chosen design supports evolving from request intake to approval and then to automated promotion and deployment workflows across non-prod and prod environments without replacing the front door later.
APIM Developer Portal is the wrong abstraction: The APIM Developer Portal is built around API products and subscriptions, not tenant provisioning workflows. Customising it far enough to support multi-step approval, structured config generation, and environment-aware provisioning actions would require extensive delegation and external backend work — effectively rebuilding the same application on a more constrained foundation.

Consequences

Positive

Single ownership boundary: Intake, review, generated config, and deployment hooks can evolve together in the Hub codebase
Better security posture for follow-up actions: Sensitive post-approval behavior stays in a purpose-built application instead of being forced into form notes or registry constructs
Clearer evolution path: The portal can add richer workflow, audit, and automation capabilities without re-platforming
Operational consistency: The same repo, CI/CD patterns, and Azure deployment model can be reused for the portal and Hub-adjacent automation

Negative

Custom application to build and maintain: We own the frontend, backend, tests, deployment, and documentation
Higher initial delivery cost: Building a tailored workflow is slower than standing up a generic form or pointing users at an existing portal
More platform decisions to maintain: Auth, storage, review workflow, and automation semantics become our responsibility

Mitigations

Keep the portal thin and focused: Implement only onboarding workflow concerns that are specific to AI Hub
Automate validation and delivery: Maintain build, unit, E2E, and deployment workflows so sustainment cost stays controlled
Document the why: This ADR exists so future teams do not revisit CHEFS or the Platform Product Registry without understanding the lifecycle mismatch

References

↑ Back to Decision Index

ADR-018: External PII Redaction Service Accepted

Status: Accepted

Date: 2026-03

Driver: Choice

Category: Security / Data Processing

Context

The AI Hub uses Azure Language Service PII recognition to redact personally identifiable information from chat completion payloads before they reach upstream LLM backends. APIM delegates all PII processing to a dedicated external service rather than calling the Language API inline, because APIM policies have no loop construct and the Language API imposes strict per-call limits.

The Azure Language Service /language/:analyze-text endpoint accepts a maximum of 5 documents per call, and each document is limited to 5 120 characters. Real-world chat completion requests can contain many messages or very long messages that chunk into more than 5 documents. Covering the full payload requires batched calls with deadline enforcement and transient retry handling — logic that belongs in a real programming language, not APIM XML.

The core constraint: APIM policies cannot loop. A single send-request covers at most 5 documents. Payloads exceeding that limit need an external orchestrator that can issue batched calls, handle transient retry/backoff, and stay within the APIM timeout budget.

Options Considered

External PII Redaction Container App (selected)

Dedicated FastAPI service deployed as a Container App on the shared internal CAE
APIM routes all PII-enabled requests to the external service
Service handles bounded batching, retry/backoff, deadline enforcement, and coverage verification
Single code path keeps the APIM policy simple

Azure Functions (Consumption or Flex)

Serverless compute that scales to zero
Cold start latency (seconds) conflicts with the 90 s APIM timeout budget
Less control over runtime, networking, and concurrency model
Does not align with existing Container App Environment already in use

Expand APIM inline policy (no external service)

Keep everything in APIM XML policies
APIM policies have no loop construct; would require N hard-coded send-request blocks
Extremely fragile, hard to test, and impossible to maintain at scale
Policy execution time adds up and risks breaching the APIM timeout

Client-side redaction SDK

Push PII responsibility to each tenant application
Cannot be enforced centrally; tenants may skip or misconfigure
Defeats the purpose of transparent gateway-level PII protection

Decision

We will deploy a dedicated PII redaction microservice as a Container App on the shared internal Container App Environment, reachable only from APIM via VNet-internal ingress. APIM routes all PII-enabled requests to this service.

Rationale

Language API batching limit: The /language/:analyze-text endpoint accepts at most 5 documents per call (each up to 5 120 chars). Payloads with many or long messages exceed this and need orchestrated batch calls that APIM policies cannot express.
APIM cannot loop: Policy XML has no iteration construct. Hard-coding multiple send-request blocks is brittle, hard to test, and caps out quickly. A real programming language (Python + asyncio) handles batching, timeouts, retry/backoff, and error recovery naturally.
Single routing path: A single route from APIM to the external service keeps the policy fragment simple. The service handles all payloads with the same batching logic regardless of size.
Reuse existing infrastructure: The shared Container App Environment, Managed Identity RBAC, GHCR image builds, and Terraform module patterns are already in place. Adding one more Container App is incremental.
Timeout budget alignment: The service enforces an 85 s total processing deadline that fits within the 90 s APIM send-request timeout, with per-attempt timeouts (10 s) and bounded transient retries for 429 and 5xx responses.
Fail-open / fail-closed flexibility: The service returns a structured response with coverage status. APIM decides whether to block (503) or pass through based on the tenant's fail_closed configuration.
Testability: Python unit tests cover chunking, batching, reassembly, and API error handling — far easier to maintain than equivalent logic embedded in APIM XML.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                 ALL-EXTERNAL PII REDACTION                           │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   Client → App Gateway → APIM                                        │
│                            │                                         │
│                  ┌─────────┴────────────────────────┐                │
│                  │  POST /redact                     │                │
│                  │  (PII Redaction Service)          │                │
│                  └─────────┬────────────────────────┘                │
│                            │                                         │
│                  ┌─────────┴────────────────────────┐                │
│                  │  Bounded concurrent batches       │                │
│                  │  (max 5 docs × 15 batches)       │                │
│                  │  Word-boundary chunking            │                │
│                  │  Retry-after / backoff handling    │                │
│                  │  Deadline enforcement (85s)        │                │
│                  │  Full-coverage check               │                │
│                  └─────────┬────────────────────────┘                │
│                            │                                         │
│                  Redacted body → APIM → Backend                      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Key Design Constraints

Constraint	Value	Reason
Max chars per document	5 000	Language API limit (5 120) with safety margin
Max documents per Language API call	5	Language API batch limit
Max batches per request	15 (→ 75 documents)	Caps total processing time; rejects with 413 if exceeded
Per-attempt timeout	10 s	Isolate slow Language API calls
Total processing timeout	85 s	Fits within APIM 90 s send-request timeout
Transient retry handling	429 + 5xx	Honor `Retry-After` for 429 and use exponential backoff for 5xx, all within the same 85 s budget
Chunking strategy	Word-boundary split	Avoids splitting mid-word which degrades PII detection accuracy

Consequences

Positive

Transparent payload handling: Tenants do not need to know about Language API limits; APIM routes all PII-enabled requests to the external service automatically
Single code path: All PII redaction flows through the Container App, keeping behaviour consistent across all payload sizes
Testable orchestration logic: Chunking, batching, reassembly, and error handling are covered by Python unit tests rather than being embedded in untestable APIM XML
Incremental infrastructure cost: Runs on the existing shared CAE with Managed Identity RBAC — no new networking or identity infrastructure
Structured observability: JSON-formatted logs with correlation IDs, batch counts, and elapsed-time diagnostics

Negative

Additional component to deploy and maintain: One more Container App, Dockerfile, Terraform module, and deployment phase
Extra network hop for all PII requests: Every PII-enabled request pays the cost of APIM → Container App → Language API instead of APIM → Language API directly

Mitigations

Reuse proven patterns: The Container App module, GHCR build workflow, and deploy ordering follow the same conventions as the key-rotation job
Integration tests: The APIM integration test suite covers PII redaction end-to-end through the external service
Conservative thresholds: The 75-document cap and 85 s deadline ensure the service stays within APIM timeout bounds even under adverse conditions, including transient retry/backoff

References

Language Service PII Redaction — operational documentation for PII redaction via the external service
Services Overview — Container App deployment topology
Azure Language Service — Analyze Text API
Azure AI Language — PII Detection Overview

↑ Back to Decision Index

ADR-019: Holistic Python Integration Tests with AI Evaluation Accepted

Status: Accepted

Date: 2026-04

Driver: Choice

Category: Testing / Quality Engineering

Context

The original integration harness in tests/integration was built around BATS, Bash, curl, and jq. That approach was sufficient when the suite was mostly endpoint smoke tests, but the test surface has expanded to include shared config loading from Terraform outputs, App Gateway and APIM route validation, Key Vault fallback behaviour, and dataset-driven response evaluation.

The new AI quality requirement is materially different from shell-based HTTP checks. Azure AI Evaluation needs a Python runtime, SDK integrations, dataset files, a callable target, configurable thresholds, and reusable scoring logic. Keeping BATS for transport tests while adding a separate Python evaluation harness would split configuration, CI entrypoints, and maintenance across two test systems that target the same deployed platform.

The core constraint: AI evaluation is not just another HTTP assertion. It needs a programmable runtime with SDK support, dataset loading, metric aggregation, threshold validation, and skip logic for optional judge-model configuration.

Options Considered

Keep BATS as the primary harness

Preserves the existing shell-first workflow
Works for simple HTTP request/response assertions
Poor fit for SDK-backed evaluation, shared runtime code, and unit-testable helpers
Would continue the growth of duplicated Bash helpers and suite-specific glue

Hybrid model: BATS for transport, Python for evaluation

Minimises immediate migration work
Adds a second harness with separate configuration, CI wiring, and operator documentation
Creates overlapping abstractions for environment loading, auth, retries, and reporting
Raises the cost of every future test change because two frameworks must stay aligned

Holistic Python harness with pytest and AI evaluation (selected)

Single uv-managed project for live integration tests, unit tests, and AI evaluation
Shared runtime modules for config loading, APIM/App Gateway clients, and Key Vault fallback
Pytest markers preserve direct/proxy execution grouping without a second harness
Azure AI Evaluation runs in the same project with dataset files and threshold checks

Decision

We will retire the BATS-based integration harness and consolidate active integration coverage into a single uv-managed Python project under tests/integration, with pytest as the authoritative test runner and Azure AI Evaluation as an optional first-class layer in the same harness.

This decision includes the following implementation constraints:

Single source of truth runner: run-tests.py is authoritative, and run-tests.sh remains only as a thin convenience wrapper
Shared Python runtime: common config, APIM/App Gateway client logic, retry behaviour, and Key Vault fallback live under src/ai_hub_integration/
Optional AI evaluation: dataset-driven evaluation runs through run-evaluation.py and tests/test_ai_evaluation.py, and only executes when judge-model settings are present
Marker-based execution groups: direct and proxy-only suites are separated by pytest markers rather than separate frameworks

Rationale

One language for one problem space: The same deployed AI Hub surface now needs both HTTP integration coverage and SDK-driven evaluation. Python can handle both; BATS cannot handle the evaluation layer cleanly.
Shared configuration model: The Python harness centralises environment loading from explicit env vars, deploy-terraform.sh output <env>, and terraform output -json, instead of duplicating discovery logic across shell helpers and separate tools.
Reusable runtime and unit tests: Config parsing, request retry logic, operation polling, suite alias mapping, and evaluation threshold checks are now testable as Python modules rather than embedded in shell scripts.
Holistic CI wiring: The secure-tunnel workflow can install one Python environment, run direct suites, optionally run evaluation, and then run proxy-only suites without cross-language glue.
Avoid parallel harness drift: A hybrid BATS-plus-Python model would create two places to maintain auth handling, path aliases, network-group rules, and operator documentation for the same platform.
Extensibility: Adding more service integrations, richer assertions, or more evaluation datasets now extends an existing project structure instead of growing shell complexity.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│         HOLISTIC PYTHON INTEGRATION TEST HARNESS                     │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  tests/integration/                                                  │
│   ├─ pyproject.toml        → dependencies, pytest, ruff              │
│   ├─ run-tests.py          → authoritative runner                    │
│   ├─ run-evaluation.py     → optional Azure AI Evaluation entrypoint │
│   ├─ src/ai_hub_integration/                                         │
│   │   ├─ config.py         → env + terraform output loading          │
│   │   ├─ client.py         → APIM/AppGW/Key Vault helpers            │
│   │   └─ evaluation.py     → dataset scoring + thresholds            │
│   ├─ tests/test_*.py       → live integration suites                 │
│   ├─ tests/unit/           → fast unit tests for harness logic       │
│   └─ eval_datasets/        → seed evaluation datasets                │
│                                                                      │
│  CI flow                                                              │
│   1. uv sync --group dev                                             │
│   2. run direct suites                                               │
│   3. run AI evaluation when judge config exists                      │
│   4. run proxy-only suites for Key Vault/private access              │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Consequences

Positive

One harness instead of two: Live API coverage, helper logic, and AI evaluation all live in the same project and follow the same configuration model
Better reuse and maintainability: Shared runtime modules replace duplicated shell helper behaviour
Unit-testable infrastructure: Runner logic, config parsing, and threshold validation can be verified without hitting Azure
Clearer CI semantics: Direct, evaluation, and proxy-only execution map cleanly to one workflow and one dependency toolchain
Evaluation is now executable policy: Quality thresholds can be expressed as code and run consistently when judge-model configuration is present
Docs and skills simplify: The repository now documents one integration-testing model instead of parallel shell and Python stories

Negative

Migration cost: Porting suites from BATS to Python requires upfront rewrite effort
More runtime dependencies: The harness now depends on Python, uv, pytest, requests, and Azure SDK packages
Loss of shell-only simplicity: Contributors familiar with BATS must now work in Python for integration tests
Evaluation adds optional external configuration: Judge endpoint, API key, deployment name, and thresholds must be wired separately in environments that want scoring enabled

Mitigations

Keep the shell entrypoint: run-tests.sh preserves a familiar command while delegating to the authoritative Python runner
Make evaluation optional: The evaluation runner and pytest suite skip cleanly when judge-model settings are absent
Preserve network execution boundaries: Direct and proxy-only group separation remains explicit through pytest markers and CI steps
Validate the harness itself: Ruff, unit tests, and collection-only pytest checks catch harness regressions before live environment execution

References

tests/integration/README.md — source-of-truth usage and layout for the Python harness
tests/integration/pyproject.toml — project dependencies, pytest markers, and Ruff configuration
tests/integration/run-tests.py and tests/integration/run-evaluation.py — canonical entrypoints
tests/integration/src/ai_hub_integration/ — shared runtime modules for config, client, runner, and evaluation
.github/workflows/.integration-tests-using-secure-tunnel.yml — CI execution model for direct, evaluation, and proxy test groups
GitHub Actions Workflows

↑ Back to Decision Index

ADR Template

Use this template when adding new ADRs:

## ADR-XXX: [Title]

**Status:** [Proposed | Accepted | Deprecated | Superseded]
**Date:** YYYY-MM
**Deciders:** [Team/People]
**Category:** [Security | Networking | Infrastructure | Documentation | etc.]

### Context
[What is the issue? What forces are at play?]

### Decision
[What is the decision? Be specific.]

### Rationale
[Why this decision over alternatives?]

### Consequences
#### Positive
- [Good outcomes]

#### Negative
- [Tradeoffs accepted]

### References
- [Links to relevant docs, diagrams, discussions]

You have reached the end of the ADR list.

↑ Back to Decision Index