Frequently Asked Questions

This page answers common questions about the AI Services Hub, how the private network is designed, how teams are onboarded, and how people are expected to use the platform in practice. The wording here is intentionally plain and explanatory so that readers do not need to already know Azure platform terminology.

Questions still being researched or awaiting a decision are marked with Pending

Network & Subscriptions

Control vs Data Plane

Onboarding & Access

Technical Architecture

Questions for Microsoft

Pending Decisions

Network & Subscription Setup

What network sizes were allocated for the AI Hub?

Answered

Environment	Virtual Network Size	Usable IPs
da4cf6-prod	4x /24 (= /22)	~1,020
da4cf6-test	2x /24 (= /23)	~508
da4cf6-dev	1x /24	~251
da4cf6-tools	1x /24	~251

Confirmed by Platform Services Team on Dec 11, 2025

Why does the AI Hub need larger private networks than a standard /24?

Answered

The platform is not hosting just one application. It contains several managed Azure services, and many of those services require their own dedicated subnet space and reserve more internet protocol addresses than people expect. Because of that, a small default network range is not enough for production.

The table below shows why the address space grows quickly:

Service	Min Size	Recommended
Azure gateway service for application programming interfaces	/26	/24
App Gateway	/26	/24
Azure AI Foundry	/25	/24
Private endpoint subnet	/27	/27
Azure Kubernetes Service	Variable	Variable

Total: Once you combine the gateway layer, private endpoints, container and platform subnets, and room for growth, the usual /24 network is too small. That is why production needs a much larger address range.

Control Plane vs Data Plane

What's the difference between control plane and data plane?

Answered

Azure has two very different kinds of access, and understanding the difference explains many of the platform design choices in this repository.

Control plane access means managing the resource itself: creating it, updating settings, changing permissions, or deleting it. Data plane access means using what is inside that resource: reading a secret, writing a blob, or querying stored data.

Aspect	Control Plane	Data Plane
What	Managing resources (create, configure, delete)	Accessing data inside resources
Endpoint	`management.azure.com`	`.vault.azure.net`, `.blob.core.windows.net`
Works from the public internet with short-lived sign-in?	Yes. This is why GitHub Actions can still deploy infrastructure changes.	No. Private endpoints block this unless traffic first enters the private network.
Examples	Create a Key Vault, change settings, assign access roles	Read secrets, write blobs, query databases

If you are wondering why some tasks work from a public automation runner and others do not, this is the reason. See the architecture decision record on control plane versus data plane access and the matching diagram for a deeper explanation.

Why can't I see Key Vault secrets in the Azure Portal?

Answered

The Azure Portal is primarily a management interface. It is very good at showing you that a resource exists and how it is configured, but it does not magically bypass a private endpoint.

When you open a Key Vault in the Azure Portal:

✓ You can see that the vault exists, because listing resources is a management action.
✓ You can see the vault configuration, because reading settings is also a management action.
✗ You cannot reveal the secret value unless your traffic is actually coming from inside the allowed private network path.

In other words, the screen that shows the vault is not the same thing as network access to the vault's secret data. To view secrets, use the Chisel tunnel playbook or connect through the jumpbox by way of Azure Bastion.

Do tenant developers get Chisel access?

Answered

No. Chisel is an administrative tool for the people who operate the platform itself. It is not part of the normal tenant experience.

Tenant developers are expected to use the public entry points that were built for them:

Application Gateway and the Azure gateway service for application programming interfaces for public service endpoints
Their own applications, which then call those gateway endpoints

This separation is deliberate. It keeps tenants isolated from private infrastructure, preserves metering and security boundaries, and avoids giving every team broad internal network access that they do not need.

Which access method should I use?

Answered

Choose the access method based on what you are trying to do. The important distinction is whether you need to reach private data inside Azure resources, or whether you only need to deploy and configure resources.

If you need...	Use this
Automated deployment that must read secrets or private state	Chisel tunnel with the Azure proxy enabled (`enable_azure_proxy`)
Administrative access, debugging, or command-line work by platform maintainers	Chisel tunnel with the Azure proxy enabled (`enable_azure_proxy`)
Local development access to private databases or private Azure services	Chisel tunnel with the Azure proxy enabled (`enable_azure_proxy`)
Deployment work that only changes resource definitions and does not need private data access	Standard public GitHub Actions runners

See Choosing Your Access Method and the access methods diagram if you want a more detailed decision guide.

Onboarding & Access

How do other teams onboard to the AI Hub?

Answered

Teams onboard through the Tenant Onboarding Portal. This is a custom application built specifically for this platform because the team needed more than a simple request form. The portal stores structured onboarding information, supports review and approval by platform administrators, and gives the platform a place to expose approved tenant information later.

The current operating model works like this:

A team submits its onboarding information through the portal.
Platform administrators review the request, verify that it fits the platform, and then approve or reject it.
Once a tenant has been approved, authorised tenant administrators can return to the portal and view environment-specific information that the platform exposes for them.

This is important because onboarding here is not just a one-time form submission. It is an ongoing workflow that can later include approval history, generated configuration, operational details, and controlled access to credentials and service information.

Still evolving:

What prerequisites must teams meet?
How much of the downstream provisioning flow should be automated after approval?
What's the cost model (chargeback)?

How do tenant administrators get their gateway subscription keys?

Answered

Approved tenant administrators can view their gateway subscription keys directly in the Tenant Onboarding Portal. They do not need to open a support request for routine key retrieval.

Only users listed as tenant administrators can access the credential panel.
The portal separates information by environment, so development, test, and production values are shown in different tabs.
The primary and secondary keys are masked on screen and exposed through copy actions rather than printed openly in the page.
When automatic key rotation is enabled, the portal also shows explanatory timing information such as the last rotation and the next scheduled rotation.

Behind the scenes, the portal backend reads the tenant's key material from the shared central Key Vault by using its own managed identity. That means no long-lived stored password is required for the portal to retrieve those values. See the key rotation guide for the full operational details.

What's the 3-6-9 month roadmap?

Pending

A planning session has been proposed, but the roadmap has not yet been finalised. The idea is to move from early pilot work into a clearer sequence of near-term, medium-term, and longer-term platform outcomes.

The working agenda currently includes:

Role clarity and decision ownership
The first technical capabilities that should be delivered to show value quickly
The platform improvements that make onboarding easier for new teams
A baseline set of milestones for the next three, six, and nine months
A work tracker and delivery rhythm
Regular review meetings and planning cadence

Status: Planning session with Microsoft to be scheduled. Date TBD.

Technical Architecture

Why can't GitHub Actions run Terraform directly?

Answered

GitHub Actions can run Terraform, but it cannot directly reach private endpoints from a standard public runner. That is the key limitation.

The platform solves this by creating a secure tunnel from the public automation runner into the private Azure network. A Chisel server runs inside an Azure App Service that sits in the private network path, and the deployment workflow sends private data traffic through that tunnel when it needs to talk to things like the private Terraform state account or secrets store.

This means the workflow can still use normal public GitHub-hosted runners for most automation, without paying for a permanently running private build fleet. There is an optional module for self-hosted runners on Azure Container Apps for special cases, but the platform's own deployment path does not depend on it.

Cost note: The Chisel proxy runs as an App Service plan-based component, so cost is tied to that service plan rather than to how often the workflow runs. See Cost Tracking for details.

What Azure services will the AI Hub provide?

Answered

The platform provides shared artificial intelligence services for British Columbia Government teams through a multi-tenant setup. In practical terms, that means one platform supports multiple teams while still keeping each tenant's access, quotas, and configuration separate.

The following services are currently live in the development and test environments:

Azure AI Foundry for language models, reasoning models, and text embedding models
The Azure gateway service for application programming interfaces for tenant-specific subscription keys, per-tenant rate limiting, and usage metering
Azure Application Gateway with built-in web request protection for public entry, secure transport termination, and routing toward the right backend service
Azure AI Language for sensitive personal information detection only, routed through a dedicated external redaction service; other text-analysis workloads belong on Azure AI Foundry models
Azure AI Search, Speech, and Document Intelligence as shared cognitive services exposed through the gateway layer

See the services catalogue for the full model list, per-tenant allocation details, and usage limits.

How does sensitive personal information redaction work in the current architecture?

Answered

Requests that need sensitive personal information redaction are no longer processed directly inside the gateway policy engine. Instead, the platform gateway forwards the request body and the tenant's redaction settings to a dedicated redaction service that runs as an internal container application.

This design is easier to reason about because all redaction work goes through the same external service, regardless of request size.

The platform gateway reads the tenant-specific settings that control whether redaction is enabled, which categories to exclude, which language to assume, whether failure should block the request, and which message roles should be scanned.
The gateway policy sends the request to the external redaction service over the internal private network path.
The redaction service calls Azure AI Language service over a private endpoint, reconstructs the result into the original request shape, and returns the redacted body together with diagnostic information.
The gateway forwards the request to the downstream backend only after the redaction service reports that it completed the full job successfully.

This path is intentionally scoped to PII detection. If a tenant needs summarization, classification, or other non-PII language behavior, the platform standard is to implement that on Azure AI Foundry models rather than on the shared Language service.

See the Language service redaction guide and the architecture decision record for the external redaction service for the full flow and configuration details.

How does virtual network peering work between environments?

Answered

The platform uses virtual network peering so that the tools environment can reach the development, test, and production environments for operational and deployment work without collapsing everything into one flat network.

Tools to development, test, and production: outbound traffic from the tools environment is allowed toward the other environments so deployment and maintenance workflows can reach what they need.

This is mainly there to support platform operations and automation. It is not a shortcut that gives tenant teams direct access to internal services in every environment.

Questions for Microsoft

Post kick-off questions for Microsoft regarding the Azure artificial intelligence application zone (Dec 2025)
Click any question to expand details

Access & Security

1. User access model for AI Foundry Pending

How will end users, such as data scientists and developers, access Azure AI Foundry? What sign-in and authorisation model should the platform expect?

Questions:

Should users work directly in the portal, or should access stay application-programming-interface only?
Integration with the provincial directory sign-in system?
Guest access for contractors?

2. Role-based access model Pending

What access roles are needed for the main user groups, such as platform administrators, team leads, developers, and data scientists?

Personas to define:

Platform Admin - full control
Team Lead - manage team resources
Developer - deploy models, run experiments
Data Scientist - read-only model access
Auditor - view logs and compliance

3. Governance automation Pending

How can we automate governance policies (data classification, model approval, prompt logging)?

Areas needing automation:

Data classification enforcement
Model deployment approval workflows
Prompt/response logging for audit
Detection and masking of sensitive personal information

Infrastructure & Cost

4. Cost-effective infrastructure-as-code deployment without Azure Bastion and a virtual machine Answered

Resolved: The platform uses standard GitHub-hosted runners for automated deployment, including Terraform operations that must touch private endpoints. It does not require a permanently running Azure Bastion session or jumpbox virtual machine for normal automation.

The mechanism is a Docker-based Chisel tunnel together with Privoxy, running directly on the GitHub-hosted runner:

A Chisel server container is deployed into the tools virtual network through the deployment workflow.
The Terraform deployment job starts a Chisel client container and Privoxy on the runner, creating an encrypted tunnel into the private network.
Terraform sends data-plane traffic, such as access to the secrets store and private state, through that tunnel, while normal Azure management traffic bypasses it.

Azure Bastion and the jumpbox remain available for interactive administrator sessions, such as portal access or command-line debugging, but they are not required for the normal automated deployment pipeline.

There is also an optional module for self-hosted runners on Azure Container Apps for other repositories that cannot use the Chisel proxy approach.

5. GitHub-hosted runners versus Azure DevOps Answered — GitHub (public runners + Chisel tunnel)

Decision: GitHub Actions was chosen over Azure DevOps for the platform's own automation. The deployment jobs run on standard public GitHub-hosted runners rather than on always-on self-hosted machines inside the private network.

Access to private endpoints is handled by the Chisel tunnel approach, which sends the necessary private data traffic through the tools virtual network only when needed. This is cheaper and operationally simpler than maintaining a permanent pool of self-hosted agents.

The optional module for self-hosted runners on Azure Container Apps still exists for workloads that genuinely need persistent private-network compute, but it is not the default operating model for this platform.

See the runner cost section for a more detailed cost comparison.

6. Shared vs dedicated resources per team Pending

Which resources should be shared across teams, and which ones should be dedicated to a single team for stronger isolation?

Proposed split:

Shared	Dedicated
Azure gateway service for application programming interfaces	Compute quotas
Model deployments	Storage accounts
Networking, including virtual networks and security rules	Key Vault
Monitoring infrastructure	Resource groups

7. Cost allocation and chargeback Pending

How do we track and allocate costs to individual teams and projects when they are using shared artificial intelligence services?

Mechanisms needed:

Tagging strategy for cost attribution
Subscription-based metering at the gateway layer
Monthly cost reports per team
Budget alerts

8. Telemetry and tagging strategy Pending

What tagging conventions and telemetry should be implemented for cost tracking and observability?

Required tags:

cost-center - ministry/team billing code
project - project identifier
environment - dev/test/prod
owner - responsible team
data-classification - public/protected/confidential

9. Noisy neighbor mitigation Pending

How do we prevent one team's workload from impacting others in a shared environment (rate limiting, quotas)?

Mitigation strategies:

Rate limiting per subscription at the gateway layer
Resource quotas per resource group
Kubernetes resource limits/requests
Model token-per-minute limits

Onboarding & Operations

10. Workload isolation model Pending

How should team workloads be isolated from each other? Should that isolation happen through resource groups, subscriptions, Kubernetes namespaces, or a combination of those approaches?

Options:

Resource Groups: Simple but limited isolation
Subscriptions: Strong isolation but management overhead
Kubernetes namespaces: Useful for container-based workloads
Hybrid: Shared subscription, separate resource groups per team, and Kubernetes namespaces where needed

11. Non-Foundry services governance Pending

How do we govern teams that want to deploy their own artificial intelligence services, such as Ollama or vLLM, outside of Azure AI Foundry?

Considerations:

Security review process for custom deployments
Approved container images list
Network isolation requirements
Logging/monitoring requirements

12. Onboarding journey Pending

What is the full end-to-end journey for a new team, starting with a request for access and ending with real use of platform services?

Proposed flow:

Portal submission
Security and privacy assessment
Resource allocation approval
Technical onboarding session
Sandbox environment provisioning
Production access (after validation)

13. Runbooks and playbooks Answered

The Playbooks page documents operational runbooks for the ongoing care and maintenance of the platform after initial deployment, including:

Chisel tunnel setup for private endpoint access
Azure Bastion and jumpbox access
Gateway subscription key rotation
Chisel tunnel setup and proxy operations

Additional runbooks (incident response, disaster recovery, cost spike investigation) are in progress.

14. SDPR document text extraction scope Answered

Document Intelligence for SDPR is handled at the Application Gateway layer. SDPR applications call the gateway endpoint, which then routes those requests to the shared Document Intelligence service. No tenant-specific routing configuration is required for that path.

The integration is operational. SDPR is one of three active tenants (wlrs, sdpr, nr-dap) in the test environment.

Pending Decisions

📝

Items requiring team discussion or decisions:

#	Question	Owner	Status
1	Portal workflow completion and onboarding policy	Product Management	In Progress
2	3-6-9 month roadmap	Microsoft	In Progress
3	Cost model and chargeback for artificial intelligence services	Executive Sponsor	Not Started
4	Chisel tunnel proxy setup in the virtual network	Platform Team	Done
5	Artificial intelligence model governance policy	TBD	Not Started