Frequently Asked Questions
This page answers common questions about the AI Services Hub, how the private network is designed, how teams are onboarded, and how people are expected to use the platform in practice. The wording here is intentionally plain and explanatory so that readers do not need to already know Azure platform terminology.
Network & Subscription Setup
What network sizes were allocated for the AI Hub?
Answered| Environment | Virtual Network Size | Usable IPs |
|---|---|---|
| da4cf6-prod | 4x /24 (= /22) | ~1,020 |
| da4cf6-test | 2x /24 (= /23) | ~508 |
| da4cf6-dev | 1x /24 | ~251 |
| da4cf6-tools | 1x /24 | ~251 |
Confirmed by Platform Services Team on Dec 11, 2025
Why does the AI Hub need larger private networks than a standard /24?
AnsweredThe platform is not hosting just one application. It contains several managed Azure services, and many of those services require their own dedicated subnet space and reserve more internet protocol addresses than people expect. Because of that, a small default network range is not enough for production.
The table below shows why the address space grows quickly:
| Service | Min Size | Recommended |
|---|---|---|
| Azure gateway service for application programming interfaces | /26 | /24 |
| App Gateway | /26 | /24 |
| Azure AI Foundry | /25 | /24 |
| Private endpoint subnet | /27 | /27 |
| Azure Kubernetes Service | Variable | Variable |
Total: Once you combine the gateway layer, private endpoints, container and platform subnets, and room for growth, the usual /24 network is too small. That is why production needs a much larger address range.
Control Plane vs Data Plane
What's the difference between control plane and data plane?
AnsweredAzure has two very different kinds of access, and understanding the difference explains many of the platform design choices in this repository.
Control plane access means managing the resource itself: creating it, updating settings, changing permissions, or deleting it. Data plane access means using what is inside that resource: reading a secret, writing a blob, or querying stored data.
| Aspect | Control Plane | Data Plane |
|---|---|---|
| What | Managing resources (create, configure, delete) | Accessing data inside resources |
| Endpoint | management.azure.com |
*.vault.azure.net, *.blob.core.windows.net |
| Works from the public internet with short-lived sign-in? | Yes. This is why GitHub Actions can still deploy infrastructure changes. | No. Private endpoints block this unless traffic first enters the private network. |
| Examples | Create a Key Vault, change settings, assign access roles | Read secrets, write blobs, query databases |
If you are wondering why some tasks work from a public automation runner and others do not, this is the reason. See the architecture decision record on control plane versus data plane access and the matching diagram for a deeper explanation.
Why can't I see Key Vault secrets in the Azure Portal?
AnsweredThe Azure Portal is primarily a management interface. It is very good at showing you that a resource exists and how it is configured, but it does not magically bypass a private endpoint.
When you open a Key Vault in the Azure Portal:
- ✓ You can see that the vault exists, because listing resources is a management action.
- ✓ You can see the vault configuration, because reading settings is also a management action.
- ✗ You cannot reveal the secret value unless your traffic is actually coming from inside the allowed private network path.
In other words, the screen that shows the vault is not the same thing as network access to the vault's secret data. To view secrets, use the Chisel tunnel playbook or connect through the jumpbox by way of Azure Bastion.
Do tenant developers get Chisel access?
AnsweredNo. Chisel is an administrative tool for the people who operate the platform itself. It is not part of the normal tenant experience.
Tenant developers are expected to use the public entry points that were built for them:
- Application Gateway and the Azure gateway service for application programming interfaces for public service endpoints
- Their own applications, which then call those gateway endpoints
This separation is deliberate. It keeps tenants isolated from private infrastructure, preserves metering and security boundaries, and avoids giving every team broad internal network access that they do not need.
Which access method should I use?
AnsweredChoose the access method based on what you are trying to do. The important distinction is whether you need to reach private data inside Azure resources, or whether you only need to deploy and configure resources.
| If you need... | Use this |
|---|---|
| Automated deployment that must read secrets or private state | Chisel tunnel with the Azure proxy enabled (enable_azure_proxy) |
| Administrative access, debugging, or command-line work by platform maintainers | Chisel tunnel with the Azure proxy enabled (enable_azure_proxy) |
| Local development access to private databases or private Azure services | Chisel tunnel with the Azure proxy enabled (enable_azure_proxy) |
| Deployment work that only changes resource definitions and does not need private data access | Standard public GitHub Actions runners |
See Choosing Your Access Method and the access methods diagram if you want a more detailed decision guide.
Onboarding & Access
How do other teams onboard to the AI Hub?
AnsweredTeams onboard through the Tenant Onboarding Portal. This is a custom application built specifically for this platform because the team needed more than a simple request form. The portal stores structured onboarding information, supports review and approval by platform administrators, and gives the platform a place to expose approved tenant information later.
The current operating model works like this:
- A team submits its onboarding information through the portal.
- Platform administrators review the request, verify that it fits the platform, and then approve or reject it.
- Once a tenant has been approved, authorised tenant administrators can return to the portal and view environment-specific information that the platform exposes for them.
This is important because onboarding here is not just a one-time form submission. It is an ongoing workflow that can later include approval history, generated configuration, operational details, and controlled access to credentials and service information.
- What prerequisites must teams meet?
- How much of the downstream provisioning flow should be automated after approval?
- What's the cost model (chargeback)?
How do tenant administrators get their gateway subscription keys?
AnsweredApproved tenant administrators can view their gateway subscription keys directly in the Tenant Onboarding Portal. They do not need to open a support request for routine key retrieval.
- Only users listed as tenant administrators can access the credential panel.
- The portal separates information by environment, so development, test, and production values are shown in different tabs.
- The primary and secondary keys are masked on screen and exposed through copy actions rather than printed openly in the page.
- When automatic key rotation is enabled, the portal also shows explanatory timing information such as the last rotation and the next scheduled rotation.
Behind the scenes, the portal backend reads the tenant's key material from the shared central Key Vault by using its own managed identity. That means no long-lived stored password is required for the portal to retrieve those values. See the key rotation guide for the full operational details.
What's the 3-6-9 month roadmap?
PendingA planning session has been proposed, but the roadmap has not yet been finalised. The idea is to move from early pilot work into a clearer sequence of near-term, medium-term, and longer-term platform outcomes.
The working agenda currently includes:
- Role clarity and decision ownership
- The first technical capabilities that should be delivered to show value quickly
- The platform improvements that make onboarding easier for new teams
- A baseline set of milestones for the next three, six, and nine months
- A work tracker and delivery rhythm
- Regular review meetings and planning cadence
Technical Architecture
Why can't GitHub Actions run Terraform directly?
AnsweredGitHub Actions can run Terraform, but it cannot directly reach private endpoints from a standard public runner. That is the key limitation.
The platform solves this by creating a secure tunnel from the public automation runner into the private Azure network. A Chisel server runs inside an Azure App Service that sits in the private network path, and the deployment workflow sends private data traffic through that tunnel when it needs to talk to things like the private Terraform state account or secrets store.
This means the workflow can still use normal public GitHub-hosted runners for most automation, without paying for a permanently running private build fleet. There is an optional module for self-hosted runners on Azure Container Apps for special cases, but the platform's own deployment path does not depend on it.
Cost note: The Chisel proxy runs as an App Service plan-based component, so cost is tied to that service plan rather than to how often the workflow runs. See Cost Tracking for details.
What Azure services will the AI Hub provide?
AnsweredThe platform provides shared artificial intelligence services for British Columbia Government teams through a multi-tenant setup. In practical terms, that means one platform supports multiple teams while still keeping each tenant's access, quotas, and configuration separate.
The following services are currently live in the development and test environments:
- Azure AI Foundry for language models, reasoning models, and text embedding models
- The Azure gateway service for application programming interfaces for tenant-specific subscription keys, per-tenant rate limiting, and usage metering
- Azure Application Gateway with built-in web request protection for public entry, secure transport termination, and routing toward the right backend service
- The Azure language service for sensitive personal information detection, now routed through a dedicated external redaction service
- Azure AI Search, Speech, and Document Intelligence as shared cognitive services exposed through the gateway layer
See the services catalogue for the full model list, per-tenant allocation details, and usage limits.
How does sensitive personal information redaction work in the current architecture?
AnsweredRequests that need sensitive personal information redaction are no longer processed directly inside the gateway policy engine. Instead, the platform gateway forwards the request body and the tenant's redaction settings to a dedicated redaction service that runs as an internal container application.
This design is easier to reason about because all redaction work goes through the same external service, regardless of request size.
- The platform gateway reads the tenant-specific settings that control whether redaction is enabled, which categories to exclude, which language to assume, whether failure should block the request, and which message roles should be scanned.
- The gateway policy sends the request to the external redaction service over the internal private network path.
- The redaction service calls Azure AI Language service over a private endpoint, reconstructs the result into the original request shape, and returns the redacted body together with diagnostic information.
- The gateway forwards the request to the downstream backend only after the redaction service reports that it completed the full job successfully.
See the Language service redaction guide and the architecture decision record for the external redaction service for the full flow and configuration details.
How does virtual network peering work between environments?
AnsweredThe platform uses virtual network peering so that the tools environment can reach the development, test, and production environments for operational and deployment work without collapsing everything into one flat network.
- Tools to development, test, and production: outbound traffic from the tools environment is allowed toward the other environments so deployment and maintenance workflows can reach what they need.
This is mainly there to support platform operations and automation. It is not a shortcut that gives tenant teams direct access to internal services in every environment.
Questions for Microsoft
Click any question to expand details
Access & Security
1. User access model for AI Foundry Pending
How will end users, such as data scientists and developers, access Azure AI Foundry? What sign-in and authorisation model should the platform expect?
Questions:
- Should users work directly in the portal, or should access stay application-programming-interface only?
- Integration with the provincial directory sign-in system?
- Guest access for contractors?
2. Role-based access model Pending
What access roles are needed for the main user groups, such as platform administrators, team leads, developers, and data scientists?
Personas to define:
- Platform Admin - full control
- Team Lead - manage team resources
- Developer - deploy models, run experiments
- Data Scientist - read-only model access
- Auditor - view logs and compliance
3. Governance automation Pending
How can we automate governance policies (data classification, model approval, prompt logging)?
Areas needing automation:
- Data classification enforcement
- Model deployment approval workflows
- Prompt/response logging for audit
- Detection and masking of sensitive personal information
Infrastructure & Cost
4. Cost-effective infrastructure-as-code deployment without Azure Bastion and a virtual machine Answered
Resolved: The platform uses standard GitHub-hosted runners for automated deployment, including Terraform operations that must touch private endpoints. It does not require a permanently running Azure Bastion session or jumpbox virtual machine for normal automation.
The mechanism is a Docker-based Chisel tunnel together with Privoxy, running directly on the GitHub-hosted runner:
- A Chisel server container is deployed into the tools virtual network through the deployment workflow.
- The Terraform deployment job starts a Chisel client container and Privoxy on the runner, creating an encrypted tunnel into the private network.
- Terraform sends data-plane traffic, such as access to the secrets store and private state, through that tunnel, while normal Azure management traffic bypasses it.
Azure Bastion and the jumpbox remain available for interactive administrator sessions, such as portal access or command-line debugging, but they are not required for the normal automated deployment pipeline.
There is also an optional module for self-hosted runners on Azure Container Apps for other repositories that cannot use the Chisel proxy approach.
5. GitHub-hosted runners versus Azure DevOps Answered — GitHub (public runners + Chisel tunnel)
Decision: GitHub Actions was chosen over Azure DevOps for the platform's own automation. The deployment jobs run on standard public GitHub-hosted runners rather than on always-on self-hosted machines inside the private network.
Access to private endpoints is handled by the Chisel tunnel approach, which sends the necessary private data traffic through the tools virtual network only when needed. This is cheaper and operationally simpler than maintaining a permanent pool of self-hosted agents.
The optional module for self-hosted runners on Azure Container Apps still exists for workloads that genuinely need persistent private-network compute, but it is not the default operating model for this platform.
See the runner cost section for a more detailed cost comparison.
6. Shared vs dedicated resources per team Pending
Which resources should be shared across teams, and which ones should be dedicated to a single team for stronger isolation?
Proposed split:
| Shared | Dedicated |
|---|---|
| Azure gateway service for application programming interfaces | Compute quotas |
| Model deployments | Storage accounts |
| Networking, including virtual networks and security rules | Key Vault |
| Monitoring infrastructure | Resource groups |
7. Cost allocation and chargeback Pending
How do we track and allocate costs to individual teams and projects when they are using shared artificial intelligence services?
Mechanisms needed:
- Tagging strategy for cost attribution
- Subscription-based metering at the gateway layer
- Monthly cost reports per team
- Budget alerts
8. Telemetry and tagging strategy Pending
What tagging conventions and telemetry should be implemented for cost tracking and observability?
Required tags:
cost-center- ministry/team billing codeproject- project identifierenvironment- dev/test/prodowner- responsible teamdata-classification- public/protected/confidential
9. Noisy neighbor mitigation Pending
How do we prevent one team's workload from impacting others in a shared environment (rate limiting, quotas)?
Mitigation strategies:
- Rate limiting per subscription at the gateway layer
- Resource quotas per resource group
- Kubernetes resource limits/requests
- Model token-per-minute limits
Onboarding & Operations
10. Workload isolation model Pending
How should team workloads be isolated from each other? Should that isolation happen through resource groups, subscriptions, Kubernetes namespaces, or a combination of those approaches?
Options:
- Resource Groups: Simple but limited isolation
- Subscriptions: Strong isolation but management overhead
- Kubernetes namespaces: Useful for container-based workloads
- Hybrid: Shared subscription, separate resource groups per team, and Kubernetes namespaces where needed
11. Non-Foundry services governance Pending
How do we govern teams that want to deploy their own artificial intelligence services, such as Ollama or vLLM, outside of Azure AI Foundry?
Considerations:
- Security review process for custom deployments
- Approved container images list
- Network isolation requirements
- Logging/monitoring requirements
12. Onboarding journey Pending
What is the full end-to-end journey for a new team, starting with a request for access and ending with real use of platform services?
Proposed flow:
- Portal submission
- Security and privacy assessment
- Resource allocation approval
- Technical onboarding session
- Sandbox environment provisioning
- Production access (after validation)
13. Runbooks and playbooks Answered
The Playbooks page documents operational runbooks for the ongoing care and maintenance of the platform after initial deployment, including:
- Chisel tunnel setup for private endpoint access
- Azure Bastion and jumpbox access
- Gateway subscription key rotation
- Chisel tunnel setup and proxy operations
Additional runbooks (incident response, disaster recovery, cost spike investigation) are in progress.
14. SDPR document text extraction scope Answered
Document Intelligence for SDPR is handled at the Application Gateway layer. SDPR applications call the gateway endpoint, which then routes those requests to the shared Document Intelligence service. No tenant-specific routing configuration is required for that path.
The integration is operational. SDPR is one of three active tenants (wlrs, sdpr, nr-dap) in the test environment.
Pending Decisions
| # | Question | Owner | Status |
|---|---|---|---|
| 1 | Portal workflow completion and onboarding policy | Product Management | In Progress |
| 2 | 3-6-9 month roadmap | Microsoft | In Progress |
| 3 | Cost model and chargeback for artificial intelligence services | Executive Sponsor | Not Started |
| 4 | Chisel tunnel proxy setup in the virtual network | Platform Team | Done |
| 5 | Artificial intelligence model governance policy | TBD | Not Started |