Architecture Decision Records
This page documents the key architecture decisions behind the AI Services Hub. Each architecture decision record explains the problem being solved, the choice that was made, and the trade-offs that came with that choice. If you are new to the project, this page is here to explain not just what was built, but why it was built that way.
Many of the choices documented here were not optional design preferences. They were driven by British Columbia Government security and platform rules. Important constraints include:
- No public endpoints - All Azure services must use private endpoints only
- Private networking required - Resources must be isolated inside virtual networks
- No long-lived secrets - Identity and short-lived token patterns are preferred over stored passwords and keys
- Bastion-only access - Direct SSH/RDP from internet is prohibited
Start Here: Architecture Decision Record 001
Architecture Decision Record 001 is the foundation that explains why the rest of the infrastructure exists. Read it first if you want the big-picture explanation before reviewing the more specific decisions.
Architecture decision records capture a decision together with its background and consequences. They help future maintainers avoid reopening the same debate without context, and they give new team members a faster way to understand the platform.
Decision Index
| ID | Title | Driver | Status |
|---|---|---|---|
| ADR-001 | Shared AI Landing Zone | Policy | Accepted |
| ADR-002 | Use OIDC instead of Service Principal Secrets | Policy | Accepted |
| ADR-003 | Optional Use of Azure Bastion for VM Access | Policy | Accepted |
| ADR-004 | Private Endpoints for All Azure Services | Policy | Accepted |
| ADR-005 | Zero-Dependency Documentation System | Choice | Accepted |
| ADR-006 | Terraform as Infrastructure as Code (IaC) | Choice | Accepted |
| ADR-007 | Client Connectivity via App Gateway + APIM | Policy | Accepted |
| ADR-008 | No Azure Portal or Foundry Studio UI Access | Policy | Pending Platform/MS |
| ADR-009 | Why AI Landing Zone vs Custom Solution | Choice | Accepted |
| ADR-010 | Multi-Tenant Isolation Model | Policy | Accepted |
| ADR-011 | Control Plane vs Data Plane Access & Chisel Tunnel | Policy | Accepted |
| ADR-012 | Usage Monitoring, Cost Allocation and Chargeback Metrics | Operations | Proposed |
| ADR-013 | Scaled Stack Architecture with Isolated State Files | Choice | Accepted |
| ADR-014 | APIM Subscription Key Rotation | Choice | Proposed |
| ADR-015 | Tenant Isolation: Resource Group vs Subscription | Policy | Pending Platform/MS |
| ADR-016 | Backend Circuit Breaker Pattern | Resilience | Accepted |
| ADR-017 | Custom Tenant Onboarding Portal Inside AI Hub | Choice | Accepted |
| ADR-018 | External PII Redaction Service | Choice | Accepted |
ADR-001: Shared AI Landing Zone Accepted
Context
The Problem: Why Can't GitHub Just Run Terraform Directly?
The Network Barrier
When you run terraform apply from GitHub Actions, here's what happens:
- GitHub spins up a runner (a VM on Microsoft's public cloud)
- Terraform tries to connect to Azure PaaS services for data plane access for ec: key vault secrets
- BLOCKED - KV has no public endpoint
- Terraform tries to connect to Key Vault, databases, etc.
- BLOCKED - All services data plane access are private-only within the vnet.
Result: Terraform can run from the public internet but limited to Storage account and control plane access of Azure resources.
The Solution: Landing Zone Architecture
We need infrastructure inside the private network that can:
- Receive commands from GitHub (via OIDC tokens)
- Run Terraform against private endpoints
- Allow humans to access resources for debugging
VNet (Virtual Network)
What: Private network in Azure
Why: All resources live here, isolated from public internet
Analogy: The building's internal network
Jumpbox VM
What: Linux VM inside the VNet
Why: Runs Terraform, can reach all private endpoints
Analogy: A workstation inside the secure office
Azure Bastion
What: Managed gateway service
Why: Secure way to access Jumpbox (no public SSH)
Analogy: The secure lobby with ID verification
How Terraform Actually Runs
┌─────────────────────────────────────────────────────────────────────────┐ │ DEPLOYMENT FLOW │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ GitHub Actions │ Azure Landing Zone (Private Network) │ │ (Public Internet) │ │ │ │ │ │ ┌──────────────┐ OIDC │ ┌──────────────┐ ┌─────────────────┐ │ │ │ GitHub-Hosted│───────▶│ │ Azure Proxy │───▶│ Private │ │ │ │ Runner │ │ │ (Chisel App │ │ Endpoints │ │ │ │ ubuntu-24.04 │◀SOCKS5─┘ │ Service) │ │ (Storage, KV) │ │ │ └──────────────┘ │ └──────────────┘ └─────────────────┘ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ terraform │ │ │ │ │ plan/apply │ │ │ │ └──────────────┘ │ │ │ (via SOCKS tunnel) │ │ └─────────────────────────────────────────────────────────────────────────┘
Two deployment options:
Option A: Chisel Tunnel with GitHub-Hosted Runners
Standard GitHub-hosted runners (ubuntu-24.04) combined with a Chisel SOCKS5 proxy deployed as an Azure App Service inside the VNet. This eliminates the need for persistent self-hosted runner infrastructure.
- GitHub-hosted runner starts (ephemeral, no maintenance)
- Workflow deploys/starts the Chisel proxy App Service
- Runner connects through SOCKS5 tunnel into VNet
- Terraform traffic routed via tunnel to private endpoints
- No persistent runner pool needed — pay only for workflow minutes
azure-proxy/chisel container runs on an App Service Plan inside the tools VNet subnet. The .deployer.yml reusable workflow deploys it first, then subsequent steps use it as a SOCKS proxy via ALL_PROXY environment variable.
Used by this platform for all CI/CD
Option B: Jumpbox + Bastion
Manual access via Bastion for debugging, testing, and emergency fixes.
- Human connects via Azure Portal
- Bastion brokers SSH connection
- Land on Jumpbox inside VNet
- Run commands, debug issues
Required for human access
What About Other Repositories?
This is shared infrastructure - the Landing Zone is set up once and used by all projects.
What This Repo Provides (Set Up Once)
| Component | Purpose | Shared? |
|---|---|---|
| VNet + Subnets | Private network for all resources | Yes - all projects use this |
| Azure Bastion | Secure access gateway | Yes - one Bastion for all |
| Jumpbox VM | Admin access point | Yes - shared by admins |
| Azure Proxy (Chisel Server) | SOCKS5 tunnel for CI/CD private-endpoint access | Yes - shared by all stacks |
| Private DNS Zones | Name resolution for private endpoints | Managed by Platform Services |
How Access Actually Works (Public vs Private)
| Who | Access Path | Public IPs? | How It Works |
|---|---|---|---|
| Platform Team (Admin) | Internet → Azure Portal → Bastion → VMs | Bastion only | Bastion is Azure-managed PaaS with public IP. VMs have private IPs only. This is the ONE public-to-private bridge. |
| Ministry Apps/Users | Gov Network → ExpressRoute → App Gateway → APIM → Services | None | ExpressRoute is a private dedicated circuit from BC Gov data centers to Azure backbone. Traffic never touches public internet. |
| GitHub Actions (CI/CD) | GitHub-hosted runner + Chisel SOCKS tunnel → Private Endpoints | None | GitHub-hosted runners (ubuntu-24.04) route Terraform traffic through a Chisel SOCKS5 proxy (App Service inside the VNet). The runner itself is on the public internet but all data-plane calls are tunnelled privately. |
What is ExpressRoute?
ExpressRoute is NOT a public endpoint. It's a dedicated fiber connection from BC Gov's data centers directly into Azure's network backbone.
- Traffic stays on private circuits (not internet)
- Managed by BC Gov Platform Services
- Already exists - we just use it
- Like a private tunnel to Azure
Why APIM is Internal
APIM is internal only for security purposes. APIM is reachable via App gateway :
- ExpressRoute connects Gov Network → Azure VNet
- App Gateway acts as TCP layer 4 load balancer and proxy to APIM with strong WAF protection.
- Gov users can reach internal IPs via ExpressRoute if needed with proper Firewall rules in place.
Summary: Only ONE Public Endpoint
┌─────────────────────────────────────────────────────────────────────────────┐
│ BC Gov Network │
│ ┌──────────────┐ │
│ │ Ministry │ │
│ │ Applications │──┐ │
│ └──────────────┘ │ │
│ │ │
│ ┌──────────────┐ │ ┌──────────────────────────────────────────────────┐│
│ │ Ministry │──┼────│ ExpressRoute (Private Circuit) ││
│ │ Users │ │ │ NOT public internet! ││
│ └──────────────┘ │ └──────────────────────────────────────────────────┘│
└────────────────────┼────────────────────────────────────────────────────────┘
│ Firewall Rules (Allowed Traffic)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Azure (Private VNet) │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────────────────┐ │
│ │ App Gateway │───▶│ APIM │───▶│ Private Endpoints │ │
│ │ (Public IP) │ │ (Internal IP) │ │ (Storage, OpenAI, etc.) │ │
│ └────────────────┘ └────────────────┘ └────────────────────────────┘ │
│ ▲ │
│ │ │
│ │ All private IPs - reachable via ExpressRoute │
│ │
│ ════════════════════════════════════════════════════════════════════════ │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ BASTION │─▶ │ Jumpbox VM │ ◀── Only Bastion has public IP │
│ │ (PUBLIC IP) │ │ (Private IP) │ (for admin access) │
│ └────────────────┘ └────────────────┘ │
│ ▲ │
└─────────┼───────────────────────────────────────────────────────────────────┘
│
│ Admin accesses via Azure Portal (browser)
│
┌─────────┴────────┐
│ INTERNET │
│ (Platform Team) │
└──────────────────┘
Why Not Just Open Public Endpoints Temporarily?
Policy prohibits this. Even temporary public access:
- Creates audit findings
- Requires security exemption paperwork
- Introduces attack window
- Must be reverted manually (often forgotten)
The Landing Zone approach is always private - no exemptions needed.
Consequences
Positive
- Fully policy compliant - No public endpoints ever
- Shared infrastructure - One-time setup, many projects benefit
- Consistent security - All projects inherit the same secure baseline
- Cost efficient - Single Bastion (~$140/mo) serves all projects
Negative
- Initial complexity - Landing Zone must be built first
- Proxy dependency - Chisel App Service must be healthy before Terraform workflows run
- VNet planning - Must allocate IP ranges carefully
References
ADR-002: Use OIDC instead of Service Principal Secrets Accepted
Context
GitHub Actions workflows need to authenticate with Azure to deploy infrastructure via Terraform. We evaluated three options for credential management:
Option A: Static Secrets
- Create Azure AD App Registration
- Generate Client Secret
- Store in GitHub Secrets
- Rotate manually every 1-2 years
Not policy compliant - Long-lived credentials prohibited
Option B: Platform Rotating Keys
- Platform team rotates keys every 2 days
- Keys expire after 4 days
- Requires cron job to fetch/update
- Must sync to GitHub Secrets
Policy compliant - But adds operational overhead
Option C: OIDC Federation
- Create Managed Identity
- Configure GitHub OIDC trust
- Token fetched in pipeline
- No cron job needed
Policy compliant - Zero operational overhead
Decision
We chose Option C: OIDC Federated Credentials.
While the platform team's rotating key solution (Option B) is policy compliant, it requires maintaining a cron job to continuously fetch and update credentials. OIDC eliminates this operational burden - the bearer token is obtained directly within the GitHub Actions workflow at runtime, with no external synchronization required.
Rationale
| Criteria | Static Secrets | Platform Rotating | OIDC |
|---|---|---|---|
| Policy Compliant | No | Yes | Yes |
| Secret Management | Manual rotation | Cron job required | No secrets needed |
| Security Risk | Long-lived credential leak | 4-day window if compromised | ~10 min window |
| Operational Overhead | Annual rotation | Cron job maintenance | Set and forget |
| Token Lifetime | 1-2 years | 4 days max | ~10 minutes |
| Failure Mode | Expired secret breaks deploy | Cron failure breaks deploy | Self-contained in pipeline |
| Scope Control | Per application | Per application | Per repo/branch/env |
Consequences
Positive
- Zero secrets to rotate - Eliminates credential management overhead
- Reduced blast radius - Tokens valid for minutes, not years
- Fine-grained access - Can restrict to specific branches/environments
- Better audit trail - Every token exchange is logged with JWT claims
- No secret sprawl - Secrets don't end up in logs, config files, or developer machines
Negative
- More complex initial setup - Federated credential configuration is more involved
- Newer technology - Less documentation and community examples available
- GitHub dependency - Tightly coupled to GitHub's OIDC provider
Neutral
- Requires understanding of JWT claims and subject matching
- Debugging auth failures requires knowledge of OIDC flow
References
ADR-003: Optional Use of Azure Bastion for VM Access Accepted
Context
Azure Bastion is the one approved exception to the "no public endpoints" rule. It's an Azure-managed PaaS service with a public IP that provides secure browser-based access to private VMs. The key point: Bastion has the public IP, not the VMs themselves.
Public Endpoints in the Architecture
| Service | Has Public IP? | Why |
|---|---|---|
| Azure Bastion | Yes (exception) | Required for browser-based VM access. Azure-managed, AAD-authenticated. This is the public-to-private bridge for admin access. |
| App Gateway | Yes | Receives traffic from public internet with WAF (Web Application Firewall). |
| APIM | No (internal) | Deployed in internal mode, sits behind App Gateway. No public exposure. |
| VMs (Jumpbox) | No | Private IPs only. Access via Bastion. |
| Storage, Key Vault, etc. | No | Private endpoints only. Public access disabled. |
Operators need secure access to jumpbox VMs for debugging and administration. The VMs cannot have public IPs per policy. We evaluated:
Option A: VPN Gateway
Point-to-site VPN for developer access
Option B: Public IP + NSG
Expose SSH/RDP with IP allowlisting
Option C: Azure Bastion
Browser-based RDP/SSH via Azure Portal
Decision
We chose Option C: Azure Bastion.
Rationale
| Criteria | VPN Gateway | Public IP | Bastion |
|---|---|---|---|
| Setup Complexity | High (certs, clients) | Low | Medium |
| Client Requirements | VPN client software | SSH/RDP client | Browser only |
| Attack Surface | VPN endpoint | High (exposed ports) | Minimal |
| Cost | ~$140/month | ~$5/month | ~$140/month (when on) |
| On-demand | No (always on) | Yes | Yes (can destroy) |
Consequences
Positive
- No public IPs on VMs - VMs only have private IPs
- No client software - Works from any browser
- Azure AD integration - Uses existing identity
- Session recording - Can enable for audit
- Cost control - Can deploy/destroy on demand via workflow
Negative
- Azure Portal dependency - Must use Azure UI or CLI
- File transfer limitations - No native SCP/SFTP
- Latency - Browser-based adds some lag
References
ADR-004: Private Endpoints for All Azure Services Accepted
Context
Azure PaaS services (Storage Accounts, Key Vaults, Databases, etc.) by default have public endpoints accessible from the internet. BC Gov policy requires these to be locked down.
Policy Requirements
What Policy Prohibits
- Public endpoints on any Azure service
- Key Vaults with public network access
- Databases with public connectivity
- Any service reachable without VNet integration
What Policy Requires
- Private Endpoints for all PaaS services
- Private DNS zones for name resolution (managed by Platform Services)
- VNet integration for all access
- Network Security Groups controlling traffic
- "Deny public access" enabled on all services
Implementation
| Service | Private Endpoint | DNS Zone (Platform Services) |
|---|---|---|
| Key Vault (if used) | privateEndpoint-vault |
privatelink.vaultcore.azure.net |
| Container Registry (if used) | privateEndpoint-acr |
privatelink.azurecr.io |
Client Connectivity Model
ExpressRoute + App Gateway + APIM
BC Gov has ExpressRoute connectivity to Azure, but AI Hub clients will not access services directly via ExpressRoute. Instead:
- App Gateway: Provides ingress, WAF protection, and SSL termination
- APIM: API management layer for authentication, rate limiting, and routing
- Private Endpoints: Backend services (AI models, storage) remain fully private
Client flow: Client → App Gateway → APIM → Private Endpoint → Azure Service
Consequences
Challenges
- GitHub Actions cannot reach private endpoints directly - Requires the Chisel SOCKS tunnel or VNet-internal access (see ADR-001)
- Local development complexity - Developers cannot access resources without VPN/Bastion
- DNS resolution - Must configure private DNS zones correctly
- Debugging difficulty - Cannot easily test from outside the network
Workarounds
- Terraform State: Use the Chisel SOCKS tunnel (GitHub-hosted runner + Azure Proxy) to access the private storage endpoint. The
use_azuread_auth = truesetting enables Azure AD authentication for state access. - Development: Use Bastion + Jumpbox for all Azure resource access (see ADR-003)
- CI/CD: Use the Chisel SOCKS tunnel for full private-endpoint access during Terraform runs (see ADR-001)
References
ADR-005: Zero-Dependency Documentation System Accepted
Context
We needed a documentation site for the project. Options considered:
Static Site Generators
- Jekyll (Ruby)
- Hugo (Go)
- Docusaurus (Node.js)
- MkDocs (Python)
Custom Bash Build
- Header/footer partials
- Variable substitution
- ~60 lines of shell script
- Zero external dependencies
Decision
We built a custom Bash-based static site generator.
Rationale
- Portability: Runs anywhere with Bash (Linux, Mac, WSL, CI)
- No dependency management: No npm, gem, pip, go install required
- Simplicity: Anyone can understand the 60-line build script
- Speed: Builds in milliseconds
- GitHub Pages native: No special plugins or build configurations
- AI-friendly: HTML generation is trivial for AI assistants
Consequences
Positive
- Zero build dependencies to maintain or update
- Works in any environment without setup
- Easy to understand and modify
- No security vulnerabilities from npm packages
Negative
- No built-in Markdown support (write HTML directly)
- No automatic table of contents generation
- No built-in search (added custom client-side solution)
Mitigations
- Created template page with all components for easy copy-paste
- AI assistants generate HTML as easily as Markdown
- Added custom SVG viewer for diagrams
ADR-006: Terraform as Infrastructure as Code (IaC) Accepted
Context
This repo needs a repeatable, auditable way to provision and update Azure infrastructure (networking, private endpoints, RBAC, diagnostics, and PaaS services) under BC Government policy constraints.
Given the Landing Zone design (private endpoints, limited portal use, and CI/CD execution from within the VNet), we need Infrastructure as Code that supports:
- Idempotent, reviewable changes (pull requests as the change record)
- Policy-driven patterns (private endpoints, NSGs, diagnostics settings)
- Composable modules (prefer Azure Verified Modules where possible)
- Automation via GitHub Actions using OIDC and Chisel SOCKS tunnel
Options Considered
Terraform (selected)
- Large ecosystem and Azure provider support
- Strong module approach (including AVM for Terraform)
- Works well in CI/CD and supports multi-environment workflows
Alternatives
- Bicep / ARM templates
- Pulumi
- Portal-based configuration (click-ops)
Decision
We use Terraform as the primary Infrastructure as Code (IaC) tool for this repo.
Rationale
- Fits Landing Zone operations: Terraform runs cleanly on GitHub-hosted runners via the Chisel SOCKS tunnel, providing data-plane access to private endpoints when required.
- Standardization via modules: We can prefer Azure Verified Modules (AVM) for consistent, policy-aligned deployments.
- Auditable change control: Plans and applies can be gated by pull request review and CI checks.
- Multi-environment support: Variables and modules make it straightforward to deploy dev/test/prod consistently.
- Sustainability: Terraform's widespread adoption ensures long-term community and vendor support. Team members working across AWS, Azure, and OpenShift can use one IaC tool consistently across the stack
Consequences
Positive
- Repeatable deployments - Same inputs produce the same infrastructure
- Versioned infrastructure - Git history becomes the change log
- Policy-aligned defaults - Modules can encode private endpoint and logging patterns
Negative
- Learning curve - Contributors must understand Terraform workflows
- State management - Requires careful backend configuration and access controls
- Upgrades - Provider/module version bumps require ongoing maintenance
Mitigations
- Use pinned module versions and keep provider versions explicit
- Use CI to run
terraform fmt,terraform validate, and plans - Prefer AVM modules over custom resources where feasible
References
- Terraform Reference - Complete module, variable, and output docs
- Terraform Documentation
- Azure Verified Modules (Terraform index)
ADR-007: Client Connectivity via App Gateway + APIM Accepted
Context
BC Government Platform Services has established ExpressRoute connectivity between on-premises networks and Azure. For this AI Hub Landing Zone, the question arose: should ministry applications connect directly to AI services via ExpressRoute, or through a managed ingress layer?
Options Considered
Option A: Direct ExpressRoute Access
Clients connect directly to private endpoints via ExpressRoute.
- Lowest latency (no middlemen)
- Simpler architecture for single-tenant
- ExpressRoute is provided by Platform Services
Case-by-Case: Available upon request with justification. Requires separate security review as it bypasses WAF, rate limiting, and centralized audit logging.
Option B: App Gateway + APIM (Recommended)
All traffic flows through App Gateway and APIM before reaching backends.
- WAF protection at ingress
- Centralized authentication
- Rate limiting and quotas per tenant
- Complete audit trail
- Consistent multi-tenant isolation
Recommended: Standard path for all AI Hub clients
Decision
For a clear multi-tenant platform, we recommend all clients connect through App Gateway → APIM → Private Endpoints.
ExpressRoute connectivity is provided by Platform Services and is available. However, to support consistent security controls, audit logging, and fair resource allocation across multiple ministries, we recommend the App Gateway + APIM path as the standard connectivity model.
Traffic Flow
┌─────────────────────────────────────────────────────────────────────────────┐ │ CLIENT CONNECTIVITY MODEL │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ On-Premises │ Azure Landing Zone │ │ (Ministry Apps) │ │ │ │ │ │ ┌──────────────┐ │ ┌─────────────────────────────────┐ │ │ │ Ministry App │ │ │ App Gateway │ │ │ │ (Health) │──────────────────▶│ • SSL Termination │ │ │ └──────────────┘ Express │ │ • WAF Protection │ │ │ Route │ │ • DDoS Mitigation │ │ │ ┌──────────────┐ │ └──────────────┬──────────────────┘ │ │ │ Ministry App │ │ │ │ │ │ (SDPR) │──────────────────▶ ▼ │ │ └──────────────┘ │ ┌─────────────────────────────────┐ │ │ │ │ APIM │ │ │ ┌──────────────┐ │ │ • Authentication (API Keys) │ │ │ │ Ministry App │ │ │ • Rate Limiting │ │ │ │ (Justice) │──────────────────▶│ • Request Validation │ │ │ └──────────────┘ │ │ • Ministry Routing │ │ │ │ │ • Audit Logging │ │ │ │ └──────────────┬──────────────────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────────────────────────────┐ │ │ │ │ Private Endpoints │ │ │ │ │ • AI Foundry │ │ │ │ │ • Azure OpenAI │ │ │ │ │ • AI Search │ │ │ │ │ • Storage (RAG docs) │ │ │ │ └─────────────────────────────────┘ │ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Security Controls at Each Layer
| Layer | Security Control | Purpose |
|---|---|---|
| App Gateway | Web Application Firewall (WAF) | Block OWASP top 10, SQL injection, XSS |
| App Gateway | SSL/TLS Termination | Enforce HTTPS, manage certificates |
| App Gateway | DDoS Protection | Mitigate volumetric attacks |
| APIM | Subscription Keys | Identify and authenticate ministry |
| APIM | Rate Limiting | Prevent abuse, ensure fair usage |
| APIM | Request Validation | Validate payload structure |
| APIM | Audit Logging | Track all requests with ministry context |
| Private Endpoints | Network Isolation | Backend services unreachable from internet |
Consequences
Positive
- Defense in depth - Multiple security layers before reaching backends
- Centralized policy - All clients subject to same rules
- Audit compliance - Every request logged with full context
- Flexibility - Can update policies without changing backends
- Cost attribution - Can track usage per ministry via APIM metrics
Negative
- Added latency - Two extra hops (App Gateway + APIM)
- Cost - App Gateway and APIM have significant monthly costs
- Complexity - More components to configure and maintain
Mitigations
- Latency is typically <10ms additional per hop
- Costs are shared across all ministries (per ADR-010)
- Infrastructure-as-code manages complexity
References
ADR-008: No Azure Portal or Foundry Studio UI Access Pending Platform/MS
Context
Microsoft's standard approach to AI Landing Zones assumes users will manage resources through:
- Azure Portal (portal.azure.com)
- AI Foundry Studio (ai.azure.com)
- Azure Machine Learning Studio
All of these require public endpoint access to the Azure services being managed.
The Problem: UI Requires Public Endpoints
User Browser (Public Internet)
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ai.azure.com / portal.azure.com (Public Endpoint) │
│ │
│ "To manage your AI Foundry project, we need to reach │
│ your Azure resources over the public internet" │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Your AI Foundry / Storage / Search │ │
│ │ │ │
│ │ ❌ PUBLIC ENDPOINT REQUIRED │ │
│ │ BC Gov Policy: DENIED │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Decision
No browser-based UI access is supported for tenant management.
All resource provisioning and management must occur through:
- Terraform/AVM modules - Infrastructure as Code
- Azure CLI - Via Chisel tunnel (CI/CD) or Jumpbox (admin only)
- REST APIs - Via APIM with subscription keys
What This Means for Tenants
Can Do (View Only)
- Browse to portal.azure.com
- See resource groups and resources
- View configurations (read-only)
- Navigate AI Foundry Studio UI
- See project structure
Cannot Do (UI Blocked)
- Create/modify resources via Portal
- Upload documents in Foundry Studio
- Test models in Foundry playground
- Configure settings via web forms
- Any operation requiring service connection
Supported Methods
- Request resources via Terraform PR
- Upload documents via API (APIM)
- Query AI models via API (APIM)
- Automate via CLI in pipelines
- Use SDK from within VNet
Why Does This Happen?
When you click "Upload Document" in Foundry Studio, the browser (on public internet) tries to connect directly to your Storage Account. But your Storage Account has no public endpoint - it only accepts connections from within the VNet via Private Endpoint.
Browser (Public) → Storage Account (Private Only) = ❌ Connection Refused
Pipeline (VNet) → Storage Account (Private EP) = ✅ Works
The UI shows the resource exists, but can't actually interact with it.
Why Not Provide Bastion Access to Everyone?
This was considered and rejected:
| Approach | Problem |
|---|---|
| Bastion + VM per tenant | Not scalable (20 ministries = 20 VMs = $$$), security nightmare |
| Shared Jumpbox for all | Multi-tenant isolation violated, credential management chaos |
| VPN per tenant | Massive operational overhead, not self-service |
Result: Bastion/Jumpbox is for platform team administration only. Tenants interact via API.
AVM Module Requirement
Because all management must be IaC-based, only services with Azure Verified Modules (AVM) are supported.
Supported AVM Modules
- cognitiveservices-account
- machinelearningservices-workspace
- search-searchservice
- storage-storageaccount
- documentdb-databaseaccount
- keyvault-vault
- containerservice-managedcluster
- containerregistry-registry
- apimanagement-service
Full catalog: azure.github.io/Azure-Verified-Modules
Path Forward: Secure UI Access
Consequences
Positive
- Policy compliant - No public endpoints ever exposed
- Reproducible - All infrastructure is code, auditable, version controlled
- Scalable - Onboard 100 tenants with same process as 1
- Secure - No browser-based attack surface
Negative
- Steeper learning curve - Tenants must use IaC, not click-ops
- No visual management - Can't "see" resources in Portal
- Debugging harder - Must use CLI/API, not browser dev tools
- Microsoft disconnect - Their guidance assumes UI access
References
ADR-009: Why AI Landing Zone vs Custom Solution Accepted
Context
A valid question arises: If we're customizing everything for BC Gov requirements anyway, why use Microsoft's AI Landing Zone at all? Why not just build our own custom solution from scratch?
This ADR explains what value the AI Landing Zone and Azure Verified Modules (AVM) actually provide, even when we can't use Microsoft's default assumptions.
What AI Landing Zone Actually Provides
The Real Value: AVM Modules
Building From Scratch
If we wrote our own Terraform modules:
- Write 1000+ lines of Terraform per service
- Handle every Azure API change ourselves
- Debug edge cases Microsoft already solved
- Maintain security patches ourselves
- No community validation or review
- Reinvent private endpoint patterns
- Figure out RBAC configurations
- Test against every Azure region
Using AVM Modules wherever possible
With Azure Verified Modules:
- ~50 lines of Terraform to deploy a service
- Microsoft maintains API compatibility
- Edge cases handled by module maintainers
- Security updates pushed automatically
- Community tested, Microsoft validated
- Private endpoint patterns built-in
- RBAC best practices included
- Tested across all Azure regions
What We Get From AI Landing Zone Reference
| Component | What Microsoft Provides | What We Customize |
|---|---|---|
| Network Topology | Hub-spoke pattern, subnet sizing guidance, NSG rule templates | IP ranges, Canada regions, Platform Services DNS integration |
| AI Foundry Setup | AVM module for workspace, project structure, compute patterns | Disable public access, multi-tenant project isolation |
| Private Endpoints | Patterns for connecting services privately, DNS zone integration | Link to Platform Services DNS, IP allocation per tenant |
| APIM Integration | AVM module, backend pool patterns, policy templates | Subscription per tenant, OpenAPI routing, rate limits |
| Security Baseline | RBAC templates, managed identity patterns, Key Vault integration | BC Gov RBAC requirements, ministry-level isolation |
AVM Module Maturity - Honest Assessment
| Module | Status | Maturity | Notes |
|---|---|---|---|
avm-res-storage-storageaccount |
Released | Production Ready | Mature, well-tested |
avm-res-keyvault-vault |
Released | Production Ready | Mature, well-tested |
avm-res-network-virtualnetwork |
Released | Production Ready | Mature, well-tested |
avm-res-network-applicationgateway |
Released | Production Ready | Mature, well-tested |
avm-res-network-bastionhost |
Released | Production Ready | Mature, well-tested |
avm-res-documentdb-databaseaccount |
Released | Production Ready | Cosmos DB - mature |
avm-res-apimanagement-service |
Pending | Early Release | v0.0.5 - may need custom work |
avm-res-cognitiveservices-account |
Pending | Maturing | OpenAI, Doc Intel - verify features |
avm-res-search-searchservice |
Pending | Maturing | AI Search - verify private EP support |
avm-res-machinelearningservices-workspace |
Pending | Maturing | Core resource for Foundry Hub/Project |
avm-ptn-aiml-ai-foundry |
In Development | Not Production Ready | Pattern module - active development |
avm-ptn-ai-foundry-enterprise |
Archived | Abandoned | Was archived July 2025 - do not use |
avm-res-containerservice-managedcluster |
Pending | Pre-release | AKS - v0.4.0-pre2 |
What This Means
Safe to Use AVM
- Virtual Networks, Subnets, NSGs
- Storage Accounts
- Key Vault
- Bastion Host
- App Gateway
- Cosmos DB
Evaluate / May Need Custom
- AI Foundry (use resource module, not pattern)
- APIM (early version, test thoroughly)
- Cognitive Services (verify private EP)
- AI Search (verify features needed)
- AKS (pre-release)
machinelearningservices-workspace resource module rather than the pattern modules. Be prepared to write custom Terraform for AI-specific configurations that AVM doesn't yet support.
Concrete Example: Storage Account
Without AVM (Custom from scratch)
# ~200 lines of Terraform to handle:
resource "azurerm_storage_account" "main" { ... }
resource "azurerm_storage_account_network_rules" "main" { ... }
resource "azurerm_private_endpoint" "blob" { ... }
resource "azurerm_private_endpoint" "file" { ... }
resource "azurerm_private_endpoint" "queue" { ... }
resource "azurerm_private_dns_zone_virtual_network_link" "blob" { ... }
resource "azurerm_role_assignment" "contributor" { ... }
resource "azurerm_role_assignment" "reader" { ... }
resource "azurerm_monitor_diagnostic_setting" "main" { ... }
# Plus encryption, lifecycle policies, soft delete, versioning...
# Plus handling for every Azure API version change...
With AVM Module
module "storage" {
source = "Azure/avm-res-storage-storageaccount/azurerm"
version = "0.6.7"
name = "ministryhealthstorage"
resource_group_name = azurerm_resource_group.ministry.name
location = "canadacentral"
# Private endpoints - one line enables the pattern
private_endpoints = {
blob = { subnet_id = module.network.subnet_ids["private-endpoints"] }
}
# All the security, RBAC, monitoring handled by module
tags = var.common_tags
}
Result: Same outcome, 10x less code, maintained by Microsoft.
Why Not 100% Custom?
Custom = Maintenance Burden
- Azure releases ~100 API changes/month
- Each change could break custom modules
- Security vulnerabilities need patching
- New features require manual implementation
- Team turnover = knowledge loss
- Who maintains this in 3 years?
AVM = Shared Maintenance
- Microsoft + community maintain modules
- API changes handled upstream
- Security patches published as versions
- New features added automatically
- Documentation maintained centrally
- Sustainable long-term
What We're Actually Doing
Our Approach: AVM + BC Gov Customization Layer
┌─────────────────────────────────────────────────────────────────────┐
│ BC Gov AI Hub Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BC Gov Customization Layer (Our Code) │ │
│ │ • Multi-tenant resource group structure │ │
│ │ • APIM subscription per ministry │ │
│ │ • IP allocation policies │ │
│ │ • Platform Services DNS integration │ │
│ │ • Canada data residency enforcement │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Azure Verified Modules (Microsoft Maintained) │ │
│ │ • avm-res-cognitiveservices-account │ │
│ │ • avm-res-machinelearningservices-workspace │ │
│ │ • avm-res-storage-storageaccount │ │
│ │ • avm-res-network-virtualnetwork │ │
│ │ • avm-res-apimanagement-service │ │
│ │ • ... 100+ more modules │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Azure Resource Manager APIs │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
We write the thin customization layer. Microsoft maintains the heavy lifting.
Decision
We use AI Landing Zone reference architecture and AVM modules as our foundation, with a BC Gov customization layer on top.
We do NOT use Microsoft's default configuration. We use their:
- Reference patterns - How to wire services together
- AVM modules - Tested, maintained infrastructure components
- Best practices - Security, networking, RBAC patterns
Consequences
Positive
- Reduced maintenance - Microsoft maintains 90% of the code
- Faster development - Use proven patterns instead of inventing
- Security updates - Get patches via module version bumps
- Community support - Issues/bugs reported and fixed by others
- Audit trail - Using "official" modules helps with compliance
Negative
- Module constraints - Can only do what AVM modules support
- Version management - Must track and update module versions
- Abstraction leakage - Sometimes need to understand module internals
References
ADR-010: Multi-Tenant Isolation Model Accepted
Context
The AI Hub Landing Zone is designed to serve multiple BC Government ministries from a single shared infrastructure deployment. This creates a multi-tenancy challenge: how do we provide cost-efficient shared services while ensuring strict data isolation between ministries?
The Problem: Shared Infrastructure, Isolated Data
What We Cannot Do
- Allow Ministry A to access Ministry B's documents
- Share AI search indexes across ministries
- Use single storage accounts for all ministry data
- Allow cross-ministry API access without authorization
What We Can Share
- Network infrastructure (VNets, Bastion, NSGs)
- Compute resources (AI Foundry Hub)
- Ingress layer (App Gateway, APIM)
- Monitoring and logging infrastructure
Decision
We implement a four-layer isolation model:
1. Storage Isolation
Separate storage accounts per ministry
- Each ministry gets dedicated blob containers
- Azure RBAC restricts access to ministry principals
- Private endpoints per storage account
- Encryption keys can be ministry-specific (CMK)
2. AI Search Index Isolation
Separate search indexes per ministry
- Each ministry's documents indexed separately
- RAG queries scoped to ministry index only
- No cross-index queries permitted
- Index-level access control via Azure RBAC
3. API Isolation (APIM)
APIM subscriptions per ministry
- Unique subscription keys per ministry
- Rate limiting scoped to subscription
- API policies enforce ministry context
- Audit logging includes ministry identifier
4. Network Isolation (NSGs)
Network policies enforce boundaries
- NSG rules restrict subnet-to-subnet traffic
- Private endpoints isolated by ministry where needed
- No direct cross-ministry network paths
- All traffic routed through controlled ingress
Implementation Architecture
┌─────────────────────────────────────────────────────────────────────────────┐ │ MULTI-TENANT ISOLATION MODEL │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Ministry Health Ministry SDPR Ministry Justice │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ APIM Sub: H │ │ APIM Sub: S │ │ APIM Sub: J │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Shared APIM (Policy Enforcement) │ │ │ │ (validates subscription → routes to ministry) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Storage: H │ │ Storage: S │ │ Storage: J │ │ │ │ Index: H │ │ Index: S │ │ Index: J │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ ════════════════════════════════════════════════════════════════ │ │ Shared Infrastructure Layer │ │ (VNet, Bastion, AI Foundry Compute, Monitoring) │ │ ════════════════════════════════════════════════════════════════ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Consequences
Positive
- Strong data isolation - Ministry data never co-mingled
- Cost efficiency - Shared compute and network infrastructure
- Audit compliance - Clear ministry attribution in all logs
- Scalable onboarding - New ministries get isolated resources automatically
- Flexible isolation levels - Can increase isolation (dedicated compute) if needed
Negative
- Resource multiplication - Each ministry needs separate storage/indexes
- Complexity - More resources to manage and monitor
- Cost per ministry - Base cost increases with each ministry onboarded
Cost Tracking
Multi-tenant isolation also enables per-tenant cost tracking and chargeback. See the detailed cost allocation strategy:
References
- Cost Tracking Documentation - Full cost allocation strategy
- Multi-Tenant Isolation Diagram
- Azure Multi-Tenant Architecture Guide
- AI Request Data Flow Diagram
ADR-011: Control Plane vs Data Plane Access & Chisel Tunnel Pending
Context
Azure services have two distinct access paths that are often confused:
Control Plane (ARM APIs)
What: Managing Azure resources - create, update, delete, configure
Endpoint: management.azure.com (always public)
Authentication: OIDC tokens, Service Principals, Managed Identity
Examples:
- Creating a Key Vault
- Configuring a Storage Account
- Setting up Private Endpoints
- RBAC role assignments
- Azure Portal UI (viewing resources)
Data Plane (Service-specific APIs)
What: Accessing data inside resources
Endpoint: *.vault.azure.net, *.blob.core.windows.net, etc.
Authentication: Same tokens, BUT requires network access
Examples:
- Reading/writing Key Vault secrets
- Reading/writing Storage blobs
- Querying databases (PostgreSQL, CosmosDB)
- Calling Azure OpenAI APIs
- Terraform state file read/write
The Problem: Private Endpoints Block Data Plane
┌─────────────────────────────────────────────────────────────────────────────┐ │ WHAT WORKS vs WHAT'S BLOCKED │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ From Public Internet (GitHub Actions, Azure Portal, Your Laptop): │ │ │ │ ✅ CONTROL PLANE (management.azure.com) │ │ • Create Key Vault → ARM API → Works │ │ • Create Storage Account → ARM API → Works │ │ • View resource properties in Portal → ARM API → Works │ │ • OIDC authentication → Azure AD → Works │ │ │ │ ❌ DATA PLANE (*.vault.azure.net, *.blob.core.windows.net) │ │ • Read Key Vault secret → Private Endpoint → BLOCKED │ │ • Write Storage blob → Private Endpoint → BLOCKED │ │ • "View Secret Value" in Portal → Private Endpoint → BLOCKED │ │ • Terraform state read/write → Private Endpoint → BLOCKED │ │ │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ From Inside VNet (Chisel Tunnel, Jumpbox, or Optional Self-hosted Runners):│ │ │ │ ✅ CONTROL PLANE → Still works (ARM is public) │ │ ✅ DATA PLANE → Works via Private Endpoints (network path exists) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Why Azure Portal Shows Resources But Not Data
The Azure Portal is a Control Plane UI. When you browse to a Key Vault in the portal:
- ✅ You can see the vault exists (control plane: list resources)
- ✅ You can see its configuration (control plane: read properties)
- ❌ You CANNOT click "Show Secret Value" (data plane: blocked by private endpoint)
This is why ADR-008 states "No Azure Portal or Foundry Studio UI Access" for data operations - the portal physically cannot reach private data plane endpoints from your browser.
Why OIDC Is Used for Control Plane
OIDC (OpenID Connect) federation provides passwordless authentication from GitHub Actions to Azure:
┌─────────────────────────────────────────────────────────────────────────────┐ │ OIDC AUTHENTICATION FLOW │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ 1. GitHub Actions workflow starts │ │ ↓ │ │ 2. GitHub OIDC Provider issues JWT token │ │ (claims: repo, branch, environment, workflow) │ │ ↓ │ │ 3. Token sent to Azure AD │ │ ↓ │ │ 4. Azure AD validates against Federated Credential │ │ (checks: issuer=GitHub, subject=repo:bcgov/ai-hub-tracking:...) │ │ ↓ │ │ 5. Azure AD issues Access Token (valid ~1 hour) │ │ ↓ │ │ 6. Access Token used for ARM API calls (Control Plane) │ │ ↓ │ │ ✅ terraform plan/apply (resource management) │ │ ✅ az cli commands (control plane operations) │ │ ✅ No secrets stored in GitHub! │ │ │ │ BUT: OIDC gives credentials, NOT network access. │ │ Data plane still blocked without VNet connectivity. │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Decision: Multiple Access Methods for Different Needs
We provide four toggleable access mechanisms via Terraform, each serving a different use case:
| Access Method | Terraform Toggle | Who Uses It | Use Case | Plane Access |
|---|---|---|---|---|
| Self-Hosted Runners | github_runners_aca_enabled |
Other tenant repos | Optional: persistent VNet compute for CI/CD workloads that can't use Chisel | Control + Data |
| Bastion + Jumpbox | enable_bastion, enable_jumpbox |
Platform Maintainers | Emergency debugging, manual admin tasks | Control + Data |
| Chisel Tunnel | enable_azure_proxy |
Platform Maintainers | Local dev access to private databases/APIs | Control + Data |
| Public GitHub Runners | (default) | CI/CD (limited) | Control plane only operations | Control only |
Chisel Tunnel: Data Plane Access for Platform Maintainers
What is Chisel? A fast TCP/UDP tunnel over HTTP that creates a secure reverse proxy from your local machine into the Azure VNet.
┌─────────────────────────────────────────────────────────────────────────────┐ │ CHISEL TUNNEL ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Platform Maintainer's Laptop │ │ ┌─────────────────────────────┐ │ │ │ Chisel Client (Docker) │ │ │ │ Listens on localhost:5432 │ │ │ └──────────────┬──────────────┘ │ │ │ HTTPS (encrypted) │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Azure App Service (Chisel Server) │ │ │ │ Inside VNet (app-service-subnet) │ │ │ │ Random auth: tunnel:XXXXXXXX │ │ │ └──────────────┬──────────────────────────────────────────────────────┘ │ │ │ VNet Integration (Private Network) │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Private Endpoints │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ PostgreSQL │ │ Key Vault │ │ CosmosDB │ │ │ │ │ │ :5432 │ │ :443 │ │ :443 │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ Result: psql -h localhost -p 5432 → tunnels to private PostgreSQL │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
What Terraform Operations Need Data Plane?
Most Terraform operations are control plane only. Data plane is only needed when:
| Resource/Operation | Plane | Works from Public? | Example |
|---|---|---|---|
azurerm_key_vault (create) |
Control | ✅ Yes | Creating the vault itself |
azurerm_key_vault_secret (write) |
Data | ❌ No | Writing secrets INTO the vault |
data "azurerm_key_vault_secret" |
Data | ❌ No | Reading secrets FROM the vault |
| Terraform state backend (Storage) | Data | ❌ No | Reading/writing .tfstate blob |
azurerm_storage_account (create) |
Control | ✅ Yes | Creating the account |
azurerm_storage_blob (upload) |
Data | ❌ No | Uploading files to storage |
azurerm_private_endpoint |
Control | ✅ Yes | Creating the private endpoint |
| RBAC role assignments | Control | ✅ Yes | Granting permissions |
Access Model Summary
┌─────────────────────────────────────────────────────────────────────────────┐ │ COMPLETE ACCESS MODEL │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ TENANT DEVELOPERS (Ministry Teams) │ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │ │ Their Apps │─────▶│ App Gateway │─────▶│ APIM (rate │─────▶ AI │ │ │ │ │ + WAF │ │ limited, metered)│ APIs │ │ └─────────────┘ └──────────────┘ └─────────────────┘ │ │ ▲ │ │ │ Public endpoint (by design) │ │ │ No direct private endpoint access │ │ │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ PLATFORM MAINTAINERS (This Team) │ │ │ │ ┌────────────────────────────────────────────────────────────────────────┐ │ │ │ Method │ Toggle Variable │ Use Case │ │ │ ├────────────────────────────────────────────────────────────────────────┤ │ │ │ Self-hosted Runners │ github_runners_aca_enabled │ Optional tenant CI/CD │ │ │ │ Bastion + Jumpbox │ enable_bastion/jumpbox │ Admin access │ │ │ │ Chisel Tunnel │ enable_azure_proxy │ Local dev access │ │ │ └────────────────────────────────────────────────────────────────────────┘ │ │ │ │ All three provide: Control Plane + Data Plane access │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Consequences
Positive
- Clear mental model - Understanding control vs data plane explains "why" behind many decisions
- Flexible access - Enable only what you need (cost optimization)
- Policy compliant - No public data plane endpoints ever
- Secure tenant isolation - Tenants use APIM, never touch private endpoints directly
Negative
- Complexity - Must understand two planes, not just "Azure access"
- Chisel proxy dependency - Terraform workflows require the Azure Proxy App Service to be healthy before running
- Portal limitations - Can't "see" data in Portal even with permissions
References
ADR-012: Hybrid Cost Allocation & Usage Monitoring Strategy Proposed
Context
The AI Services Hub operates as a multi-tenant platform serving multiple ministries. This shared infrastructure creates a financial governance challenge: we must accurately allocate costs to specific tenants to ensure accountability and cost recovery.
The architecture includes two types of resources:
- Tenant-Dedicated: Resources used exclusively by one tenant (e.g., Azure OpenAI instances, Document Intelligence, Cosmos DB).
- Shared Infrastructure: Resources serving all tenants simultaneously (e.g., APIM, App Gateway, Application Insights).
We need a standardized strategy to track consumption and generate accurate chargeback invoices that account for both resource types across our dual-region deployment (Canada Central & East).
Decision
We will adopt a Hybrid Cost Allocation Model that combines native Azure billing with custom usage tracking:
- Direct Attribution (90% of costs): We will isolate high-cost resources (AI Foundry Projects, Document Intelligence) into tenant-specific resource groups. These will be billed directly to tenants using Azure Cost Management tags (
tenant-id), requiring no manual calculation. - Proportional Allocation (10% of costs): We will split the cost of shared infrastructure (APIM, App Gateway) based on actual usage metrics.
- APIM & Gateway costs split by API Request Volume.
- Monitoring costs split by Log Ingestion Volume.
- Centralized Tracking via APIM: Azure API Management will serve as the single source of truth for usage metrics. All traffic must flow through APIM to ensure consistent tenant identification and logging.
- Custom Egress Calculation: We will implement a custom calculation pipeline using App Gateway logs to track and charge for cross-region network egress (between Canada Central and Canada East).
Rationale
- Minimizes Operational Overhead: By using Direct Attribution for the most expensive resources (OpenAI tokens, Search), we rely on Azure's native billing engine for the vast majority of chargebacks. We only maintain custom logic for the shared platform components.
- Fairness: A flat fee for shared infrastructure would be unfair to smaller tenants. Proportional allocation based on request volume ensures ministries only pay for the platform capacity they actually consume.
- Granular Visibility: Using APIM as the central logging point allows us to capture operational metrics (latency, errors, token usage) alongside billing data without adding sidecar proxies to every service.
Consequences
Positive
- Transparency: Tenants can verify their direct Azure costs in the portal using their tenant tag.
- Scalability: New tenants can be onboarded simply by adding tags; the cost model adjusts automatically.
- Cost Recovery: Ensures the Platform Team fully recovers infrastructure costs rather than absorbing the overhead of shared services.
Negative
- Complexity of Egress: Cross-region data transfer is billed at the subscription level and is difficult to attribute. We accept the tradeoff of maintaining a custom Kusto query to calculate this specific cost.
- Maintenance: The logic for splitting shared costs (Python functions/Kusto queries) is custom code that must be maintained and verified monthly against Azure invoices.
ADR-013: Scaled Stack Architecture with Isolated State Files Accepted
Context
The AI Services Hub infrastructure was originally deployed as a single Terraform root module with one monolithic state file containing ~174 resources. This created several operational problems:
- Blast radius: Any Terraform error or state corruption could affect all 174 resources simultaneously.
- Lock contention: Only one Terraform operation could run at a time across the entire infrastructure, even for independent resources.
- Serial execution: Foundry projects required
parallelism=1due to Azure ETag conflicts, which forced the entire apply to run serially. - Apply duration: A full apply took 5m 44s because independent stacks (APIM, foundry, tenant-user-mgmt) had to wait for each other.
- Phased targeting: The deployment script used
-targetflags to orchestrate a multi-phase apply (Phase 1: everything except foundry, Phase 2: foundry only, Phase 3: validation). This was fragile and hid dependency issues.
Decision
We split the monolithic root module into 5 isolated Terraform stacks, each with its own state file, backend configuration, and dependency management via terraform_remote_state data sources:
| Stack | State Key | Phase | Purpose |
|---|---|---|---|
shared | shared.tfstate | 1 (serial) | VNet, subnets, AI Foundry Hub, App Gateway, WAF, Key Vault, ACR, monitoring |
tenant | tenant-{key}.tfstate | 2 (parallel fan-out) | Per-tenant resources: AI Search, CosmosDB, Document Intelligence, Storage, Key Vault |
foundry | foundry.tfstate | 3 (parallel) | AI Foundry projects per tenant (parallelism=1 to avoid ETag conflicts) |
apim | apim.tfstate | 3 (parallel) | API Management gateway, policies, tenant subscriptions, role assignments |
tenant-user-mgmt | tenant-user-management.tfstate | 3 (parallel) | Entra ID user/group assignments (requires Graph API permissions) |
A stack engine (deploy-scaled.sh) orchestrates execution in dependency order:
- Phase 1:
sharedruns first (all other stacks depend on it). - Phase 2:
tenantruns per-tenant in parallel (each tenant gets isolatedTF_DATA_DIRand state). - Phase 3:
foundry,apim, andtenant-user-mgmtrun concurrently (no cross-dependencies between them).
For destroy, the order is reversed: Phase 3 stacks first (parallel), then tenants (parallel), then shared last.
Rationale
- Reduced blast radius: A state corruption or failed apply in
apimcannot affectsharedortenantresources. Each stack can be independently recovered. - Parallel execution: Phase 3 stacks have no dependencies on each other and can run concurrently, reducing total apply time by ~60 seconds.
- Isolated parallelism: The
foundrystack runs withparallelism=1without forcing the entire infrastructure to be serial. APIM and tenant-user-mgmt run with full parallelism simultaneously. - Independent state locking: Multiple operators or CI pipelines can work on different stacks without lock contention (e.g., updating APIM policies while a tenant deployment is in progress).
- Cleaner dependencies: Using
terraform_remote_statedata sources makes cross-stack dependencies explicit and typed, replacing implicit module-to-module references.
Consequences
Positive
- 19% faster applies: Total apply time reduced from 5m 44s to 4m 39s (measured on test environment with 2 tenants).
- Eliminated
-targetphasing: The old Phase 1/2/3 with-targetflags is replaced by natural stack ordering. No more fragile target expressions. - Auto-recovery: The stack engine includes deposed object cleanup, import-on-conflict, and transient error retry — previously only available at the monolith level.
- Live output streaming: Each stack streams Terraform output in real time via
tee, improving debuggability in CI/CD logs. - Per-tenant state isolation: Each tenant has its own state file, making tenant onboarding/offboarding a state-level operation rather than a resource-level one.
Negative
- More files: 5 stacks × 5 standard files (main.tf, variables.tf, outputs.tf, providers.tf, backend.tf) = 25 files vs. the original 6. Some variable declarations are duplicated across stacks.
- State migration required: One-time migration from the monolith state to 5 stack states using
terraform state mv. This was performed manually with verification scripts. - Cross-stack refactoring: Moving a resource between stacks requires a state move operation, not just a code move. This adds operational complexity for future refactors.
- Remote state coupling: Stacks are coupled via
terraform_remote_statedata sources. Adding a new output insharedthatapimneeds requires deployingsharedfirst.
References
- Terraform Remote State Data Source
infra-ai-hub/stacks/— Stack root modulesinfra-ai-hub/scripts/deploy-scaled.sh— Stack engineinfra-ai-hub/scripts/deploy-terraform.sh— Public entrypoint
ADR-014: APIM Subscription Key Rotation Proposed
This decision has been made by the AI Services Hub team and all infrastructure is in place. Final approval from the Security Threat and Risk Assessment (STRA) process is pending. The rotation mechanism can be enabled per-environment via a single config flag.
Context
APIM subscription keys authenticate tenant API calls to the AI Services Hub gateway. Without rotation, these long-lived secrets present a growing risk:
- Credential staleness: Keys provisioned at tenant onboarding remain valid indefinitely unless manually changed. A compromised key grants persistent access.
- No expiry enforcement: Azure APIM subscription keys have no built-in TTL or auto-expiry. The platform must implement rotation externally.
- BC Gov compliance: Government security policy expects secrets to be rotated periodically. The rotation interval and mechanism require STRA sign-off before production use.
- Multi-tenant blast radius: Each tenant has isolated subscription keys, but without rotation a single leaked key provides indefinite access to that tenant’s API surface.
Decision
We implement an alternating primary/secondary key rotation pattern with zero downtime, driven by a Container App Job (scheduled) deployed as a custom container:
- Alternating slots: APIM subscriptions have two key slots (primary and secondary). Each rotation cycle regenerates one slot while the other remains valid and untouched.
- Centralized hub Key Vault: After regeneration, both keys are written to a single hub Key Vault (
{app_name}-{env}-hkv) with 90-day expiry and rotation metadata. No per-tenant Key Vault is required. - Self-service retrieval: Tenants fetch current keys via
GET /{tenant}/internal/apim-keys— an APIM policy endpoint that reads from the hub Key Vault using APIM’s managed identity. No Azure SDK required. - Configurable schedule: Rotation is controlled by two flags in
params/{env}/shared.tfvars:rotation_enabled(master on/off) androtation_interval_days(7 for dev/test, 30 for prod). - Managed identity authentication: The Container App Job uses a system-assigned managed identity for APIM and Key Vault access. No stored secrets required.
| Component | Purpose | Location |
|---|---|---|
| Container App Job | Scheduled Python job: discovers APIM + hub KV, rotates keys, stores in KV | jobs/apim-key-rotation/ |
| Container build | Builds custom container image to GHCR on PR/merge | .github/workflows/.builds.yml (matrix entry) |
| Terraform module | Deploys Container App Job, Container App Environment, RBAC | infra-ai-hub/modules/key-rotation-function/ |
| Hub Key Vault | Centralized storage for all tenant keys (scales to 1000+) | stacks/shared/main.tf |
| APIM policy endpoint | /internal/apim-keys reads from hub KV | params/apim/api_policy.xml.tftpl |
| Terraform config | Seeds initial KV secrets + RBAC for APIM MI → hub KV | stacks/apim/main.tf |
Rationale
- Zero downtime: The alternating slot pattern guarantees one key is always valid. Tenants are never locked out during rotation.
- Container App Job over GHA workflow: A scheduled Container App Job runs reliably within Azure (no 60-day inactivity disable risk). The previous Bash script + GHA approach required periodic repo activity to avoid GitHub silently disabling scheduled workflows.
- Centralized over distributed: A single hub Key Vault with tenant-prefixed secrets scales better than per-tenant Key Vaults. One RBAC assignment for APIM’s managed identity covers all tenants.
- Self-service key retrieval: The
/internal/apim-keysendpoint eliminates the need for tenants to have Azure portal access or Key Vault Reader roles. Any valid subscription key authenticates the request. - Idempotent and safe: The function checks elapsed time since last rotation before acting. Multiple invocations within the interval are no-ops. A
--dry-runmode allows validation without changes.
Consequences
Positive
- Automated secret hygiene: Keys rotate on a known schedule with full audit trail in Key Vault versioning and GitHub Actions logs.
- Minimal tenant burden: Tenants can use the APIM internal endpoint, daily cron polling, or simply contact the platform team. No Azure SDK or Key Vault access needed.
- Emergency rotation: Both slots can be regenerated immediately via the documented runbook, invalidating all compromised keys.
- Per-environment control: Rotation can be enabled independently per environment — currently active in dev/test, disabled in prod pending STRA approval.
Negative
- Container infrastructure: The Container App Job requires a Container App Environment and Container Registry, adding infrastructure components compared to the previous GHA-only approach. These are managed via the
key-rotation-functionTerraform module. - STRA gate: Production rotation cannot be enabled until the STRA process completes. Until then, prod keys are static (same risk as baseline).
- Tenant coordination: Tenants using hard-coded keys (rather than the self-service endpoint) must update their configuration within the rotation interval or face 401 errors.
References
- APIM Key Rotation Guide — full operational documentation
jobs/apim-key-rotation/— Container App Job source codeinfra-ai-hub/modules/key-rotation-function/— Terraform module.github/workflows/.builds.yml— container build workflow (matrix entry)params/{env}/shared.tfvars— per-environment rotation config
ADR-015: Tenant Isolation: Resource Group vs Subscription Pending Platform/MS
This decision requires input from Microsoft’s team on Azure quota scaling options and from the Landing Zone Platform team on subscription provisioning constraints. The current RG-based design is the preferred approach for centralized governance. See #63 and #74.
Context
The AI Services Hub serves multiple BC Government ministries from a shared Azure subscription. Tenant isolation is currently implemented at the Resource Group level — each tenant gets its own RG with dedicated data-plane resources while sharing control-plane infrastructure (VNet, App Gateway, APIM, AI Foundry Hub) in a central RG.
As the platform scales beyond the initial 2 tenants, key Azure subscription-level quotas create hard scaling ceilings:
| Resource | Per-Subscription Limit | Per-Tenant Usage | Ceiling (tenants) |
|---|---|---|---|
| Model deployments per AI account | 32 (default) | 5–7 | ~4–5 |
| AI Search services | 16 (Basic/Standard) | 0–1 | 16 |
| Cosmos DB accounts | 50 | 0–1 | 50 |
| Cognitive Services accounts | 200 | 2 (Doc Intel + Speech) | ~100 |
| APIM APIs per instance | 400 | 5–6 | ~80 |
| Private endpoints per subnet | 1000 | ~5 | ~200 |
| GlobalStandard TPM (per model) | Varies (e.g., 2M for gpt-4.1-mini) | 7.5K–30K | Depends on model |
The most immediate bottleneck is the 32 model deployments per AI Foundry Hub account. With 2 tenants deploying 6–7 models each (13 total today), the limit is reached at roughly 4–5 tenants. Additionally, LLM quotas (PTU and GlobalStandard TPM) are tied to the subscription scope, so all tenants compete for the same throughput pool.
Issue #74 raises specific questions about PTU scaling: turnaround time for quota increases, dynamic scaling between PTU and pay-as-you-go, and whether the single-subscription model can support production-scale workloads.
Options Evaluated
Option A: Resource Group Isolation (Current & Preferred)
All tenants share a single Azure subscription. Shared infrastructure (VNet, AppGW, APIM, AI Foundry Hub) lives in a central RG. Each tenant gets a dedicated RG with isolated data-plane resources. Scaling limits are mitigated at the application layer.
Option B: Subscription-Per-Tenant
Each tenant (or group of tenants) gets a dedicated Azure subscription. Shared infrastructure is replicated or connected via VNet peering. Each subscription has independent quota pools.
Option A: Resource Group Isolation — Pros & Cons
Pros
- Centralized governance: Single subscription = single set of Azure Policies, RBAC, Defender for Cloud, cost management. One pane of glass for the platform team.
- Simplified networking: All resources in one VNet with one PE subnet. No cross-subscription VNet peering, no transit routing, no DNS forwarding complexity.
- Lower cost: Shared App Gateway, APIM, AI Foundry Hub, Log Analytics. No per-subscription overhead (reserved instances apply once, shared Defender plans).
- Faster tenant onboarding: Adding a tenant = adding a Terraform tenant config block. No subscription provisioning, OIDC setup, or Landing Zone request.
- Existing investment: Current architecture (5 stacks, phased deployment, APIM key rotation) is built and validated for this model.
- Operational simplicity: One deployment pipeline, one set of state files per environment, one OIDC identity per environment.
- Cost attribution: Per-tenant RGs enable native Azure Cost Management tag-based and RG-based cost reporting without subscription boundaries.
Cons
- Quota ceilings: All tenants share subscription-scoped quotas. The 32-deployment AI account limit is the most immediate constraint (~4–5 tenants).
- TPM/PTU contention: All model deployments on the shared Hub compete for the same GlobalStandard TPM pool. High-demand tenants crowd out others.
- Blast radius: A subscription-level issue (quota exhaustion, policy misconfiguration, billing suspension) impacts all tenants simultaneously.
- Quota increase friction: Requesting Azure quota increases is a per-subscription manual process. Lead times can be days to weeks for PTU.
- AI Search hard limit: 16 search services per subscription is a fixed limit with no increase path — hard wall at 16 search-enabled tenants.
- Foundry serialization: Model deployments run serially across all tenants to avoid ETag conflicts on the shared Hub. More tenants = slower deploys.
Option B: Subscription-Per-Tenant — Pros & Cons
Pros
- Independent quota pools: Each subscription gets its own 32 model deployments, 200 Cognitive Services accounts, 16 AI Search services, etc. Eliminates quota-based scaling ceilings.
- PTU isolation: Each tenant can request and manage its own PTU commitments. No cross-tenant throughput contention.
- Blast radius reduction: Subscription-level issues are isolated to one tenant.
- Independent scaling: Each subscription can scale resources (VM sizes, throughput, storage) without affecting others.
- Compliance flexibility: Some future tenants may have regulatory requirements (FOIPPA, health data) that mandate subscription-level isolation.
- Clean cost separation: Native Azure billing per subscription. No tag-based attribution needed.
Cons
- Loss of central governance: Each subscription needs its own Azure Policies, RBAC assignments, Defender plans. Policy drift risk increases linearly.
- Networking complexity: Requires cross-subscription VNet peering (or VWAN hub-and-spoke), cross-subscription private DNS zones, transit routing. Significant complexity increase.
- Infrastructure duplication: App Gateway, APIM, Key Vault, and potentially AI Foundry Hub must be replicated per subscription — or a complex shared services model must be designed.
- Higher cost: Duplicated infrastructure (App Gateway ~$300/mo, APIM StandardV2 ~$700/mo per subscription). Reserved instance savings are fragmented.
- Subscription provisioning lead time: BC Gov Landing Zone subscription requests go through the Platform Services team. Lead time is days to weeks, blocking rapid tenant onboarding.
- Pipeline complexity: Each subscription needs its own OIDC federation, deployment pipeline, state backend. The current single-pipeline model does not extend.
- Operational overhead: N subscriptions = N sets of monitoring, alerting, secret rotation, certificate management. Platform team workload scales linearly.
- APIM routing: A shared APIM fronting multiple subscription backends requires cross-subscription private endpoints or public exposure — both add complexity.
Decision
We continue with Resource Group isolation (Option A) as the preferred architecture, with specific mitigations for quota scaling constraints. This decision is pending confirmation from Microsoft on quota flexibility and from the Platform team on subscription provisioning options as a future fallback.
Mitigations for RG-Based Scaling Limits
| Constraint | Limit | Mitigation Strategy | Status |
|---|---|---|---|
| Model deployments per AI account | 32 | Request quota increase via Azure Support. Deploy a second AI Foundry Hub account if increase denied. Consolidate shared models (e.g., single embedding model for all tenants). | Pending MS |
| GlobalStandard TPM contention | Per-model cap | Implement APIM rate limiting per tenant (already in place). Explore PTU for high-priority tenants. Use dynamic_throttling_enabled on AI account. Investigate PTU ↔ pay-as-you-go spillover. |
Pending MS |
| AI Search services | 16 | Not all tenants need AI Search (1 of 2 currently enabled). For tenants with simple needs, use shared index with document-level permissions or skip Search entirely. | Mitigated |
| Foundry project serialization | Serial deploys | Already mitigated in ADR-013 (scaled stacks). Foundry stack runs serial but other phases are parallel. prevent_destroy on model deployments reduces redeploy churn. |
In Place |
| APIM API count | 400 | Consolidate API definitions. Use a single versioned API with tenant routing via APIM policies rather than per-tenant API duplicates. | Future |
Trigger for Revisiting This Decision
The following conditions would warrant moving specific tenants or tenant groups to dedicated subscriptions (hybrid model):
- Tenant count exceeds 10–15 and quota increase requests are denied by Microsoft
- A tenant requires dedicated PTU with guaranteed throughput SLAs that conflict with shared pool
- Regulatory requirements (e.g., health data under FOIPPA) mandate subscription-level isolation explicitly
- PTU turnaround time for quota increases exceeds acceptable lead times for tenant onboarding
- Total GlobalStandard TPM across all tenants approaches the per-model subscription ceiling
Rationale
- Right-sizing for now: With 2 active tenants and realistic growth to 5–10 in the near term, the RG model has headroom with mitigations. Subscription-level rearchitecture is premature.
- Governance is the priority: BC Gov Landing Zone policies, RBAC, and audit requirements are significantly easier to enforce in a single subscription. The compliance benefit outweighs the quota risk.
- Cost efficiency: Shared infrastructure saves ~$1,000+/mo per avoided subscription (AppGW + APIM alone). This is material in a government context.
- Onboarding velocity: A Terraform config change vs. a multi-week subscription provisioning request. This directly impacts ministry adoption speed.
- Hybrid escape hatch: The architecture supports a future hybrid model where high-demand or compliance-sensitive tenants move to dedicated subscriptions while most remain on the shared RG model. This is not an irreversible decision.
Open Questions for Microsoft
- What is the turnaround time for PTU quota increases when current allocation is at full capacity?
- Does PTU support dynamic scaling (elastic range) or is it fixed at a provisioned point?
- Can you provide Terraform samples for auto-fallback between PTU and GlobalStandard (pay-as-you-go)?
- Can the 32 model deployment limit per Cognitive Services account be increased? To what ceiling?
- Is there a recommended pattern for multi-AI-Foundry-Hub accounts within a single subscription to distribute deployments?
Consequences
Positive
- No immediate rearchitecture needed: The platform continues operating with the validated RG-based model while answers from Microsoft are pending.
- Clear scaling triggers: The team knows exactly which quotas to monitor and at what tenant count to revisit the decision.
- Documented escape path: If RG-based scaling hits limits, the migration path to subscription-per-tenant (or hybrid) is architecturally understood.
Negative
- Near-term ceiling: The 32-deployment limit means maximum ~4–5 tenants without a quota increase or model consolidation. This is a known constraint.
- MS dependency: Key mitigations (quota increases, PTU scaling guidance) depend on Microsoft response timelines.
- Potential future migration: If the hybrid model is eventually needed, migrating a tenant from shared subscription to dedicated subscription requires re-creating resources, migrating data, and updating APIM routing — non-trivial effort.
References
- Issue #63: Decision | Tenant Isolation | RG Or Subscription
- Issue #74: Foundry Hub | Single Subscription | Scaling (parent issue)
- ADR-010: Multi-Tenant Isolation Model — four-layer isolation architecture
- ADR-013: Scaled Stack Architecture — phased deployment with isolated state
- Azure OpenAI Quotas and Limits
- Azure Subscription Service Limits
ADR-016: Backend Circuit Breaker Pattern Resilience Accepted
| Status | Accepted |
| Date | 2026-02 |
| Deciders | Platform Team |
| Category | Resilience / API Gateway |
Context
The APIM gateway proxies requests to multiple backend services: Azure OpenAI, Document Intelligence, AI Search, Speech Services, and Storage. When a backend experiences failures (overload, outages, throttling), continued request forwarding wastes resources and degrades the client experience with slow timeouts instead of fast failures.
Azure OpenAI specifically returns 429 Too Many Requests with large Retry-After values (up to 86,400 seconds / 1 day) when quota is exhausted. Without circuit breaking, APIM would continue sending requests to a backend that cannot serve them.
Decision
Implement the circuit breaker pattern on all APIM backend entities using the native circuit_breaker_rule in azurerm_api_management_backend.
Configuration per backend
| Parameter | Value (AI services) | Value (Storage) |
|---|---|---|
| Failure count threshold | 3 | 5 |
| Failure window | 1 minute (PT1M) | 1 minute (PT1M) |
| Trip duration | 1 minute (PT1M) | 1 minute (PT1M) |
| Accept Retry-After | Yes | Yes |
| Trigger status codes | 429, 500–599 | 429, 500–599 |
Backends covered
ai_foundry— AI Foundry Hub endpointopenai— Azure OpenAI endpointdocint— Document Intelligenceai_search— Azure AI Searchspeech_services_stt— Speech-to-Textspeech_services_tts— Text-to-Speechstorage— Blob Storage (higher threshold: 5 failures)
What happens when the circuit trips
- Backend accumulates failures matching the configured status codes within the failure window.
- When the failure count exceeds the threshold, the circuit opens (trips).
- APIM immediately returns HTTP 503 Service Unavailable to all subsequent requests targeting that backend — requests are not forwarded.
- The global policy
<on-error>handler intercepts the 503 and returns a structured JSON error with aRetry-Afterheader (default: 60 seconds). - After the trip duration (or the backend’s
Retry-Aftervalue ifaccept_retry_after_enabled = true), the circuit resets and traffic resumes.
429, the Retry-After header can be very large (e.g., 86,400 seconds = 1 day). With accept_retry_after_enabled = true, the circuit stays open for that duration. This is intentional — the backend cannot serve requests during that period anyway.
Client-facing error response (503)
{
"error": {
"code": "503",
"message": "Service temporarily unavailable — backend circuit breaker is open. The service is recovering from excessive failures. Retry after the indicated period.",
"retryAfter": "60",
"requestId": "abc-123-def-456"
}
}
Rationale
- Fast failure: Clients get an immediate 503 instead of waiting for backend timeouts (up to 300 seconds).
- Backend protection: Prevents overwhelming a struggling backend with additional requests.
- Native support: Uses
azurerm_api_management_backend.circuit_breaker_rule— no custom policy logic required. - Retry-After propagation: Passes backend throttle signals directly to clients for proper backoff.
- Event Grid integration: APIM emits Event Grid events on circuit trip/reset for monitoring and alerting.
Consequences
Positive
- Reduced latency during backend outages (instant 503 vs. timeout).
- Backend services get breathing room to recover.
- Consistent error format with
Retry-Afterenables client-side retry logic. - Zero policy-level code — purely infrastructure configuration.
Negative
- Approximate tripping: APIM gateway instances do not synchronize circuit breaker state. Each instance tracks failures independently, so tripping is approximate in multi-instance deployments.
- Single rule per backend: Only one circuit breaker rule per backend is currently supported by the Azure API.
- 503 during recovery: Legitimate requests during the trip window will be rejected even if the backend has recovered before the trip duration expires.
References
- Azure APIM Backends — Circuit Breaker
- Circuit Breaker Pattern (Azure Architecture)
- APIM Event Grid Events — circuit breaker trip/reset events
- Issue #98: AI Gateway Gap Analysis
ADR-017: Custom Tenant Onboarding Portal Inside AI Hub Accepted
Context
The AI Hub needed a tenant onboarding experience that does more than collect a request. The onboarding flow must support structured tenant metadata, governed admin review, future automation from submission through approval, and environment-aware downstream actions such as generating Terraform inputs and preparing non-prod and prod deployment workflows.
We considered using an existing platform instead of building a portal inside the Hub workspace. The main alternatives were the BC Government Platform Product Registry, CHEFS, and the Azure API Management Developer Portal. All three reduce initial build effort, but none matches the lifecycle and control-plane requirements of Hub onboarding.
Options Considered
Custom portal inside AI Hub (selected)
- Owns the end-to-end onboarding workflow
- Supports structured validation and Hub-specific data models
- Can drive future automation after approval
BC Government Platform Product Registry
- Designed to manage existing products on Private Cloud OpenShift and Public Cloud Landing Zones
- Solves a different problem than Hub onboarding
- Supports product metadata and resource change requests for higher level platforms which may not be a need at all for the AI Hub
CHEFS
- Strong for hosted form submission
- Submission lifecycle is oriented around form intake, not long-running onboarding state
- Not a fit for secure post-approval actions or controlled reveal flows
APIM Developer Portal
- Can be deployed as part of the Hub infrastructure (APIM)
- Offers self-service API subscription and developer onboarding UX out of the box
- Customisable via delegation and custom HTML/JS widgets, but tightly coupled to APIM concepts (products, subscriptions)
- Not designed for multi-step approval workflows, structured Terraform config generation, or environment-aware provisioning actions
Decision
We will use a custom tenant onboarding portal inside the AI Hub repository and deployment boundary instead of the BC Government Platform Product Registry, CHEFS, or the APIM Developer Portal.
Rationale
- Governed approval workflow: Hub onboarding requires an admin review and approval process, not just a one-time request capture. The custom portal can model request states, reviewer actions, and follow-up steps directly.
- Structured Hub-specific configuration: The workflow needs to collect and validate data that maps cleanly into tenant-specific Terraform inputs and other structured configuration artifacts. A generic registry or hosted form would require extra translation layers and still would not own the lifecycle.
- Tight integration with Hub logic and storage: The onboarding flow must stay close to Hub-specific validation rules, tenant state, and downstream provisioning behavior. Keeping the portal in the same codebase reduces impedance between intake, approval, generated config, and deployment automation.
- Authenticated post-submission lifecycle: The process does not end at submission. The platform needs room for authenticated follow-up actions, operator review, state transitions, and future self-service interactions that go beyond a write-once form.
- CHEFS is too immutable for this lifecycle: CHEFS is well suited to hosted intake, but its submission model is intentionally form-centric and immutable. It supports status and notes, but it is not designed to mutate form data, drive secure post-approval reveal flows, or act as the system of record for ongoing onboarding operations.
- The Platform Product Registry solves a different problem: The registry is built for teams that need to create or manage a product on Private Cloud OpenShift or the Public Cloud Landing Zone, which may not be a need for AI Hub. AI Hub tenant onboarding can be irrespective of the platform the tenants use, captures Hub-specific configuration, and needs a purpose-built approval and provisioning workflow rather than generic platform registry product change process, similar to what Keycloak and Kong App Gateway have their own portals.
- Secure post-approval actions: The team needs room for actions after approval, including controlled credential-related workflows and other sensitive follow-up behavior. Those concerns should live in a dedicated application boundary rather than in generic form metadata or a public-facing registry experience.
- Future automation path: The chosen design supports evolving from request intake to approval and then to automated promotion and deployment workflows across non-prod and prod environments without replacing the front door later.
- APIM Developer Portal is the wrong abstraction: The APIM Developer Portal is built around API products and subscriptions, not tenant provisioning workflows. Customising it far enough to support multi-step approval, structured config generation, and environment-aware provisioning actions would require extensive delegation and external backend work — effectively rebuilding the same application on a more constrained foundation.
Consequences
Positive
- Single ownership boundary: Intake, review, generated config, and deployment hooks can evolve together in the Hub codebase
- Better security posture for follow-up actions: Sensitive post-approval behavior stays in a purpose-built application instead of being forced into form notes or registry constructs
- Clearer evolution path: The portal can add richer workflow, audit, and automation capabilities without re-platforming
- Operational consistency: The same repo, CI/CD patterns, and Azure deployment model can be reused for the portal and Hub-adjacent automation
Negative
- Custom application to build and maintain: We own the frontend, backend, tests, deployment, and documentation
- Higher initial delivery cost: Building a tailored workflow is slower than standing up a generic form or pointing users at an existing portal
- More platform decisions to maintain: Auth, storage, review workflow, and automation semantics become our responsibility
Mitigations
- Keep the portal thin and focused: Implement only onboarding workflow concerns that are specific to AI Hub
- Automate validation and delivery: Maintain build, unit, E2E, and deployment workflows so sustainment cost stays controlled
- Document the why: This ADR exists so future teams do not revisit CHEFS or the Platform Product Registry without understanding the lifecycle mismatch
References
ADR-018: External PII Redaction Service Accepted
Context
The AI Hub uses Azure Language Service PII recognition to redact personally identifiable information from chat completion payloads before they reach upstream LLM backends. APIM delegates all PII processing to a dedicated external service rather than calling the Language API inline, because APIM policies have no loop construct and the Language API imposes strict per-call limits.
The Azure Language Service /language/:analyze-text endpoint accepts a maximum of 5 documents per call, and each document is limited to 5 120 characters. Real-world chat completion requests can contain many messages or very long messages that chunk into more than 5 documents. Covering the full payload requires batched calls with deadline enforcement and transient retry handling — logic that belongs in a real programming language, not APIM XML.
send-request covers at most 5 documents. Payloads exceeding that limit need an external orchestrator that can issue batched calls, handle transient retry/backoff, and stay within the APIM timeout budget.
Options Considered
External PII Redaction Container App (selected)
- Dedicated FastAPI service deployed as a Container App on the shared internal CAE
- APIM routes all PII-enabled requests to the external service
- Service handles bounded batching, retry/backoff, deadline enforcement, and coverage verification
- Single code path keeps the APIM policy simple
Azure Functions (Consumption or Flex)
- Serverless compute that scales to zero
- Cold start latency (seconds) conflicts with the 90 s APIM timeout budget
- Less control over runtime, networking, and concurrency model
- Does not align with existing Container App Environment already in use
Expand APIM inline policy (no external service)
- Keep everything in APIM XML policies
- APIM policies have no loop construct; would require N hard-coded
send-requestblocks - Extremely fragile, hard to test, and impossible to maintain at scale
- Policy execution time adds up and risks breaching the APIM timeout
Client-side redaction SDK
- Push PII responsibility to each tenant application
- Cannot be enforced centrally; tenants may skip or misconfigure
- Defeats the purpose of transparent gateway-level PII protection
Decision
We will deploy a dedicated PII redaction microservice as a Container App on the shared internal Container App Environment, reachable only from APIM via VNet-internal ingress. APIM routes all PII-enabled requests to this service.
Rationale
- Language API batching limit: The
/language/:analyze-textendpoint accepts at most 5 documents per call (each up to 5 120 chars). Payloads with many or long messages exceed this and need orchestrated batch calls that APIM policies cannot express. - APIM cannot loop: Policy XML has no iteration construct. Hard-coding multiple
send-requestblocks is brittle, hard to test, and caps out quickly. A real programming language (Python + asyncio) handles batching, timeouts, retry/backoff, and error recovery naturally. - Single routing path: A single route from APIM to the external service keeps the policy fragment simple. The service handles all payloads with the same batching logic regardless of size.
- Reuse existing infrastructure: The shared Container App Environment, Managed Identity RBAC, GHCR image builds, and Terraform module patterns are already in place. Adding one more Container App is incremental.
- Timeout budget alignment: The service enforces an 85 s total processing deadline that fits within the 90 s APIM
send-requesttimeout, with per-attempt timeouts (10 s) and bounded transient retries for 429 and 5xx responses. - Fail-open / fail-closed flexibility: The service returns a structured response with coverage status. APIM decides whether to block (503) or pass through based on the tenant's
fail_closedconfiguration. - Testability: Python unit tests cover chunking, batching, reassembly, and API error handling — far easier to maintain than equivalent logic embedded in APIM XML.
Architecture
┌──────────────────────────────────────────────────────────────────────┐ │ ALL-EXTERNAL PII REDACTION │ ├──────────────────────────────────────────────────────────────────────┤ │ │ │ Client → App Gateway → APIM │ │ │ │ │ ┌─────────┴────────────────────────┐ │ │ │ POST /redact │ │ │ │ (PII Redaction Service) │ │ │ └─────────┬────────────────────────┘ │ │ │ │ │ ┌─────────┴────────────────────────┐ │ │ │ Bounded concurrent batches │ │ │ │ (max 5 docs × 15 batches) │ │ │ │ Word-boundary chunking │ │ │ │ Retry-after / backoff handling │ │ │ │ Deadline enforcement (85s) │ │ │ │ Full-coverage check │ │ │ └─────────┬────────────────────────┘ │ │ │ │ │ Redacted body → APIM → Backend │ │ │ └──────────────────────────────────────────────────────────────────────┘
Key Design Constraints
| Constraint | Value | Reason |
|---|---|---|
| Max chars per document | 5 000 | Language API limit (5 120) with safety margin |
| Max documents per Language API call | 5 | Language API batch limit |
| Max batches per request | 15 (→ 75 documents) | Caps total processing time; rejects with 413 if exceeded |
| Per-attempt timeout | 10 s | Isolate slow Language API calls |
| Total processing timeout | 85 s | Fits within APIM 90 s send-request timeout |
| Transient retry handling | 429 + 5xx | Honor Retry-After for 429 and use exponential backoff for 5xx, all within the same 85 s budget |
| Chunking strategy | Word-boundary split | Avoids splitting mid-word which degrades PII detection accuracy |
Consequences
Positive
- Transparent payload handling: Tenants do not need to know about Language API limits; APIM routes all PII-enabled requests to the external service automatically
- Single code path: All PII redaction flows through the Container App, keeping behaviour consistent across all payload sizes
- Testable orchestration logic: Chunking, batching, reassembly, and error handling are covered by Python unit tests rather than being embedded in untestable APIM XML
- Incremental infrastructure cost: Runs on the existing shared CAE with Managed Identity RBAC — no new networking or identity infrastructure
- Structured observability: JSON-formatted logs with correlation IDs, batch counts, and elapsed-time diagnostics
Negative
- Additional component to deploy and maintain: One more Container App, Dockerfile, Terraform module, and deployment phase
- Extra network hop for all PII requests: Every PII-enabled request pays the cost of APIM → Container App → Language API instead of APIM → Language API directly
Mitigations
- Reuse proven patterns: The Container App module, GHCR build workflow, and deploy ordering follow the same conventions as the key-rotation job
- Integration tests: The APIM integration test suite covers PII redaction end-to-end through the external service
- Conservative thresholds: The 75-document cap and 85 s deadline ensure the service stays within APIM timeout bounds even under adverse conditions, including transient retry/backoff
References
- Language Service PII Redaction — operational documentation for PII redaction via the external service
- Services Overview — Container App deployment topology
- Azure Language Service — Analyze Text API
- Azure AI Language — PII Detection Overview
ADR Template
Use this template when adding new ADRs:
## ADR-XXX: [Title] **Status:** [Proposed | Accepted | Deprecated | Superseded] **Date:** YYYY-MM **Deciders:** [Team/People] **Category:** [Security | Networking | Infrastructure | Documentation | etc.] ### Context [What is the issue? What forces are at play?] ### Decision [What is the decision? Be specific.] ### Rationale [Why this decision over alternatives?] ### Consequences #### Positive - [Good outcomes] #### Negative - [Tradeoffs accepted] ### References - [Links to relevant docs, diagrams, discussions]
You have reached the end of the ADR list.
↑ Back to Decision Index