AI Services Hub
Azure Landing Zone Infrastructure

Frequently Asked Questions

This page answers common questions about the AI Services Hub, how the private network is designed, how teams are onboarded, and how people are expected to use the platform in practice. The wording here is intentionally plain and explanatory so that readers do not need to already know Azure platform terminology.

Questions still being researched or awaiting a decision are marked with Pending
Network & Subscriptions
Control vs Data Plane
Onboarding & Access
Technical Architecture
Questions for Microsoft
Pending Decisions

Network & Subscription Setup

What network sizes were allocated for the AI Hub?

Answered
Environment Virtual Network Size Usable IPs
da4cf6-prod 4x /24 (= /22) ~1,020
da4cf6-test 2x /24 (= /23) ~508
da4cf6-dev 1x /24 ~251
da4cf6-tools 1x /24 ~251

Confirmed by Platform Services Team on Dec 11, 2025

Why does the AI Hub need larger private networks than a standard /24?

Answered

The platform is not hosting just one application. It contains several managed Azure services, and many of those services require their own dedicated subnet space and reserve more internet protocol addresses than people expect. Because of that, a small default network range is not enough for production.

The table below shows why the address space grows quickly:

Service Min Size Recommended
Azure gateway service for application programming interfaces /26 /24
App Gateway /26 /24
Azure AI Foundry /25 /24
Private endpoint subnet /27 /27
Azure Kubernetes Service Variable Variable

Total: Once you combine the gateway layer, private endpoints, container and platform subnets, and room for growth, the usual /24 network is too small. That is why production needs a much larger address range.

Control Plane vs Data Plane

What's the difference between control plane and data plane?

Answered

Azure has two very different kinds of access, and understanding the difference explains many of the platform design choices in this repository.

Control plane access means managing the resource itself: creating it, updating settings, changing permissions, or deleting it. Data plane access means using what is inside that resource: reading a secret, writing a blob, or querying stored data.

Aspect Control Plane Data Plane
What Managing resources (create, configure, delete) Accessing data inside resources
Endpoint management.azure.com *.vault.azure.net, *.blob.core.windows.net
Works from the public internet with short-lived sign-in? Yes. This is why GitHub Actions can still deploy infrastructure changes. No. Private endpoints block this unless traffic first enters the private network.
Examples Create a Key Vault, change settings, assign access roles Read secrets, write blobs, query databases

If you are wondering why some tasks work from a public automation runner and others do not, this is the reason. See the architecture decision record on control plane versus data plane access and the matching diagram for a deeper explanation.

Why can't I see Key Vault secrets in the Azure Portal?

Answered

The Azure Portal is primarily a management interface. It is very good at showing you that a resource exists and how it is configured, but it does not magically bypass a private endpoint.

When you open a Key Vault in the Azure Portal:

In other words, the screen that shows the vault is not the same thing as network access to the vault's secret data. To view secrets, use the Chisel tunnel playbook or connect through the jumpbox by way of Azure Bastion.

Do tenant developers get Chisel access?

Answered

No. Chisel is an administrative tool for the people who operate the platform itself. It is not part of the normal tenant experience.

Tenant developers are expected to use the public entry points that were built for them:

This separation is deliberate. It keeps tenants isolated from private infrastructure, preserves metering and security boundaries, and avoids giving every team broad internal network access that they do not need.

Which access method should I use?

Answered

Choose the access method based on what you are trying to do. The important distinction is whether you need to reach private data inside Azure resources, or whether you only need to deploy and configure resources.

If you need... Use this
Automated deployment that must read secrets or private state Chisel tunnel with the Azure proxy enabled (enable_azure_proxy)
Administrative access, debugging, or command-line work by platform maintainers Chisel tunnel with the Azure proxy enabled (enable_azure_proxy)
Local development access to private databases or private Azure services Chisel tunnel with the Azure proxy enabled (enable_azure_proxy)
Deployment work that only changes resource definitions and does not need private data access Standard public GitHub Actions runners

See Choosing Your Access Method and the access methods diagram if you want a more detailed decision guide.

Onboarding & Access

How do other teams onboard to the AI Hub?

Answered

Teams onboard through the Tenant Onboarding Portal. This is a custom application built specifically for this platform because the team needed more than a simple request form. The portal stores structured onboarding information, supports review and approval by platform administrators, and gives the platform a place to expose approved tenant information later.

The current operating model works like this:

  1. A team submits its onboarding information through the portal.
  2. Platform administrators review the request, verify that it fits the platform, and then approve or reject it.
  3. Once a tenant has been approved, authorised tenant administrators can return to the portal and view environment-specific information that the platform exposes for them.

This is important because onboarding here is not just a one-time form submission. It is an ongoing workflow that can later include approval history, generated configuration, operational details, and controlled access to credentials and service information.

Still evolving:
  • What prerequisites must teams meet?
  • How much of the downstream provisioning flow should be automated after approval?
  • What's the cost model (chargeback)?

How do tenant administrators get their gateway subscription keys?

Answered

Approved tenant administrators can view their gateway subscription keys directly in the Tenant Onboarding Portal. They do not need to open a support request for routine key retrieval.

Behind the scenes, the portal backend reads the tenant's key material from the shared central Key Vault by using its own managed identity. That means no long-lived stored password is required for the portal to retrieve those values. See the key rotation guide for the full operational details.

What's the 3-6-9 month roadmap?

Pending

A planning session has been proposed, but the roadmap has not yet been finalised. The idea is to move from early pilot work into a clearer sequence of near-term, medium-term, and longer-term platform outcomes.

The working agenda currently includes:

Status: Planning session with Microsoft to be scheduled. Date TBD.

Technical Architecture

Why can't GitHub Actions run Terraform directly?

Answered

GitHub Actions can run Terraform, but it cannot directly reach private endpoints from a standard public runner. That is the key limitation.

The platform solves this by creating a secure tunnel from the public automation runner into the private Azure network. A Chisel server runs inside an Azure App Service that sits in the private network path, and the deployment workflow sends private data traffic through that tunnel when it needs to talk to things like the private Terraform state account or secrets store.

This means the workflow can still use normal public GitHub-hosted runners for most automation, without paying for a permanently running private build fleet. There is an optional module for self-hosted runners on Azure Container Apps for special cases, but the platform's own deployment path does not depend on it.

Cost note: The Chisel proxy runs as an App Service plan-based component, so cost is tied to that service plan rather than to how often the workflow runs. See Cost Tracking for details.

What Azure services will the AI Hub provide?

Answered

The platform provides shared artificial intelligence services for British Columbia Government teams through a multi-tenant setup. In practical terms, that means one platform supports multiple teams while still keeping each tenant's access, quotas, and configuration separate.

The following services are currently live in the development and test environments:

See the services catalogue for the full model list, per-tenant allocation details, and usage limits.

How does sensitive personal information redaction work in the current architecture?

Answered

Requests that need sensitive personal information redaction are no longer processed directly inside the gateway policy engine. Instead, the platform gateway forwards the request body and the tenant's redaction settings to a dedicated redaction service that runs as an internal container application.

This design is easier to reason about because all redaction work goes through the same external service, regardless of request size.

  1. The platform gateway reads the tenant-specific settings that control whether redaction is enabled, which categories to exclude, which language to assume, whether failure should block the request, and which message roles should be scanned.
  2. The gateway policy sends the request to the external redaction service over the internal private network path.
  3. The redaction service calls Azure AI Language service over a private endpoint, reconstructs the result into the original request shape, and returns the redacted body together with diagnostic information.
  4. The gateway forwards the request to the downstream backend only after the redaction service reports that it completed the full job successfully.

See the Language service redaction guide and the architecture decision record for the external redaction service for the full flow and configuration details.

How does virtual network peering work between environments?

Answered

The platform uses virtual network peering so that the tools environment can reach the development, test, and production environments for operational and deployment work without collapsing everything into one flat network.

This is mainly there to support platform operations and automation. It is not a shortcut that gives tenant teams direct access to internal services in every environment.

Questions for Microsoft

Post kick-off questions for Microsoft regarding the Azure artificial intelligence application zone (Dec 2025)
Click any question to expand details

Access & Security

1. User access model for AI Foundry Pending

How will end users, such as data scientists and developers, access Azure AI Foundry? What sign-in and authorisation model should the platform expect?

Questions:

  • Should users work directly in the portal, or should access stay application-programming-interface only?
  • Integration with the provincial directory sign-in system?
  • Guest access for contractors?
2. Role-based access model Pending

What access roles are needed for the main user groups, such as platform administrators, team leads, developers, and data scientists?

Personas to define:

  • Platform Admin - full control
  • Team Lead - manage team resources
  • Developer - deploy models, run experiments
  • Data Scientist - read-only model access
  • Auditor - view logs and compliance
3. Governance automation Pending

How can we automate governance policies (data classification, model approval, prompt logging)?

Areas needing automation:

  • Data classification enforcement
  • Model deployment approval workflows
  • Prompt/response logging for audit
  • Detection and masking of sensitive personal information

Infrastructure & Cost

4. Cost-effective infrastructure-as-code deployment without Azure Bastion and a virtual machine Answered

Resolved: The platform uses standard GitHub-hosted runners for automated deployment, including Terraform operations that must touch private endpoints. It does not require a permanently running Azure Bastion session or jumpbox virtual machine for normal automation.

The mechanism is a Docker-based Chisel tunnel together with Privoxy, running directly on the GitHub-hosted runner:

  1. A Chisel server container is deployed into the tools virtual network through the deployment workflow.
  2. The Terraform deployment job starts a Chisel client container and Privoxy on the runner, creating an encrypted tunnel into the private network.
  3. Terraform sends data-plane traffic, such as access to the secrets store and private state, through that tunnel, while normal Azure management traffic bypasses it.

Azure Bastion and the jumpbox remain available for interactive administrator sessions, such as portal access or command-line debugging, but they are not required for the normal automated deployment pipeline.

There is also an optional module for self-hosted runners on Azure Container Apps for other repositories that cannot use the Chisel proxy approach.

5. GitHub-hosted runners versus Azure DevOps Answered — GitHub (public runners + Chisel tunnel)

Decision: GitHub Actions was chosen over Azure DevOps for the platform's own automation. The deployment jobs run on standard public GitHub-hosted runners rather than on always-on self-hosted machines inside the private network.

Access to private endpoints is handled by the Chisel tunnel approach, which sends the necessary private data traffic through the tools virtual network only when needed. This is cheaper and operationally simpler than maintaining a permanent pool of self-hosted agents.

The optional module for self-hosted runners on Azure Container Apps still exists for workloads that genuinely need persistent private-network compute, but it is not the default operating model for this platform.

See the runner cost section for a more detailed cost comparison.

6. Shared vs dedicated resources per team Pending

Which resources should be shared across teams, and which ones should be dedicated to a single team for stronger isolation?

Proposed split:

SharedDedicated
Azure gateway service for application programming interfacesCompute quotas
Model deploymentsStorage accounts
Networking, including virtual networks and security rulesKey Vault
Monitoring infrastructureResource groups
7. Cost allocation and chargeback Pending

How do we track and allocate costs to individual teams and projects when they are using shared artificial intelligence services?

Mechanisms needed:

  • Tagging strategy for cost attribution
  • Subscription-based metering at the gateway layer
  • Monthly cost reports per team
  • Budget alerts
8. Telemetry and tagging strategy Pending

What tagging conventions and telemetry should be implemented for cost tracking and observability?

Required tags:

  • cost-center - ministry/team billing code
  • project - project identifier
  • environment - dev/test/prod
  • owner - responsible team
  • data-classification - public/protected/confidential
9. Noisy neighbor mitigation Pending

How do we prevent one team's workload from impacting others in a shared environment (rate limiting, quotas)?

Mitigation strategies:

  • Rate limiting per subscription at the gateway layer
  • Resource quotas per resource group
  • Kubernetes resource limits/requests
  • Model token-per-minute limits

Onboarding & Operations

10. Workload isolation model Pending

How should team workloads be isolated from each other? Should that isolation happen through resource groups, subscriptions, Kubernetes namespaces, or a combination of those approaches?

Options:

  • Resource Groups: Simple but limited isolation
  • Subscriptions: Strong isolation but management overhead
  • Kubernetes namespaces: Useful for container-based workloads
  • Hybrid: Shared subscription, separate resource groups per team, and Kubernetes namespaces where needed
11. Non-Foundry services governance Pending

How do we govern teams that want to deploy their own artificial intelligence services, such as Ollama or vLLM, outside of Azure AI Foundry?

Considerations:

  • Security review process for custom deployments
  • Approved container images list
  • Network isolation requirements
  • Logging/monitoring requirements
12. Onboarding journey Pending

What is the full end-to-end journey for a new team, starting with a request for access and ending with real use of platform services?

Proposed flow:

  1. Portal submission
  2. Security and privacy assessment
  3. Resource allocation approval
  4. Technical onboarding session
  5. Sandbox environment provisioning
  6. Production access (after validation)
13. Runbooks and playbooks Answered

The Playbooks page documents operational runbooks for the ongoing care and maintenance of the platform after initial deployment, including:

  • Chisel tunnel setup for private endpoint access
  • Azure Bastion and jumpbox access
  • Gateway subscription key rotation
  • Chisel tunnel setup and proxy operations

Additional runbooks (incident response, disaster recovery, cost spike investigation) are in progress.

14. SDPR document text extraction scope Answered

Document Intelligence for SDPR is handled at the Application Gateway layer. SDPR applications call the gateway endpoint, which then routes those requests to the shared Document Intelligence service. No tenant-specific routing configuration is required for that path.

The integration is operational. SDPR is one of three active tenants (wlrs, sdpr, nr-dap) in the test environment.

Pending Decisions

📝
Items requiring team discussion or decisions:
# Question Owner Status
1 Portal workflow completion and onboarding policy Product Management In Progress
2 3-6-9 month roadmap Microsoft In Progress
3 Cost model and chargeback for artificial intelligence services Executive Sponsor Not Started
4 Chisel tunnel proxy setup in the virtual network Platform Team Done
5 Artificial intelligence model governance policy TBD Not Started