AI Services Hub
Azure Landing Zone Infrastructure

GitHub Actions Automation Workflows

This page explains the GitHub Actions workflows used by the repository. These workflows build containers, validate infrastructure changes, and deploy platform resources by using passwordless sign-in to Azure through OpenID Connect.

Workflow Overview

Workflow Trigger Purpose
.builds.yml Called by other workflows Reusable container build workflow for azure-proxy images
.deployer.yml Called by other workflows Reusable Terraform deployer for initial-setup/infra (tools bootstrap and module-level operations)
.deployer-using-secure-tunnel.yml Called by other workflows Reusable Terraform deployer for infra-ai-hub through Chisel + Privoxy secure tunnel
.lint.yml Called by other workflows Reusable validation: pre-commit (terraform fmt + tflint), conventional commits, fork check
add-or-remove-module.yml Manual (workflow_dispatch) Deploy or destroy selected tools modules (bastion, azure_proxy, jumpbox, github_runners_aca)
manual-dispatch.yml Manual (workflow_dispatch) Run plan/apply/destroy for dev/test/prod; prod apply requires a semver tag (e.g. v1.2.3) and creates a GitHub Release
merge-main.yml Push to main Automatic post-merge: semantic version tag via conventional commits, apply infrastructure to test
pr-open.yml Pull request events + manual trigger PR validation: lint, container builds, deploy proxy in tools, and plan against test
schedule.yml Cron (daily at 5 PM PST) Auto-destroy Bastion for cost savings
pages.yml Push to main (docs and Terraform roots) + manual trigger Generate docs and deploy GitHub Pages site

Tenant Onboarding Portal Workflows

The tenant onboarding portal now keeps Terraform in tenant-onboarding-portal/infra while the application code lives under tenant-onboarding-portal/backend and tenant-onboarding-portal/frontend. GitHub Actions provisions App Service from the root infra/ folder, builds the frontend separately, copies the SPA into the backend deployment bundle, and deploys the backend package to App Service via zip deployment.

Workflow Portal-specific behavior
pr-open.yml Detects changes anywhere under tenant-onboarding-portal/, provisions a preview App Service in tools, builds frontend/ and backend/, then deploys the backend zip for PR validation.
pr-close.yml Destroys the preview portal environment created for the PR after the pull request is closed.
merge-main.yml On merges to main, detects portal changes, provisions the shared tools App Service if needed, builds the frontend and backend, deploys to the staging slot, health-checks it, then swaps staging into production.
portal-deploy.yml Manual tools redeploy for the same backend/frontend bundle, following the same provision, staging deploy, health check, and slot-swap path as the automated main deployment.

Developer SDLC Flow (Branch → PR → TEST → PROD)

The repository enforces a promote-through-environments workflow. Every change flows through a validated path before reaching production.

  1. Create feature branch from main (e.g. feat/<work-item>, fix/<issue>) and commit changes.
  2. Open PR to main, which triggers pr-open.yml:
    • Lint (pre-commit: terraform fmt + tflint, conventional commit title, fork check)
    • Container image builds
    • Terraform plan against test (via secure tunnel) — summary appended to PR description
  3. Merge to main once PR checks and code review pass.
  4. Auto-apply to test + semver tag: merge-main.yml starts two jobs concurrently — semantic versioning (conventional commit history → v1.2.3 tag + CHANGELOG.md update) and proxy bootstrap in tools. Container images are then re-tagged with the semver version. Once the proxy is up, it applies infrastructure to test through the Chisel tunnel. After a successful apply, integration tests run automatically (two phases — see below).
  5. Promote to prod: Use manual-dispatch.yml, select prod + apply, and provide the semver tag. The workflow verifies all container images exist with that tag, then deploys using the semver-tagged images (not latest). This is gated — requires approval from designated reviewers (configured in the GitHub prod environment protection rules).
  6. Release created: After successful prod apply, a GitHub Release is automatically created from the deployed tag with deployment metadata.
PROD is gated: The prod environment has required reviewers configured as a GitHub Environment protection rule. Any workflow job targeting prod will pause and wait for manual approval before executing. Only designated reviewers can approve.
Developer branch testing (occasional): Developers can run manual dispatch to dev from a feature/PR branch for targeted validation; use this sparingly and co-ordinate with other devs. The dev environment does not include App Gateway or DNS Zone — full ingress path validation is only available in test and prod.
Feature Branch
   │
   └── Pull Request ──► pr-open.yml
                        ├─ .lint.yml (fmt, tflint, conventional commits, fork check)
                        ├─ .builds.yml
                        ├─ .deployer.yml (tools/azure_proxy)
                        ├─ .deployer-using-secure-tunnel.yml (test plan)
                        └─ Update PR description with plan summary
                                  │
                                  ▼
                             Merge to main
                                  │
                                  ▼
                        merge-main.yml
                        ├─ Semantic version tag (v1.2.3) ─────────────────── concurrent
                        ├─ Tag container images (latest → v1.2.3) ──────── after semver
                        ├─ Deploy azure_proxy (tools) ───────────────────────────────┤
                        └─ Apply to test (via Chisel tunnel)
                             └─ Integration tests (post-apply)
                                  ├─ Direct: all tests except apim-key-rotation.bats
                                  └─ Via proxy: apim-key-rotation.bats (KV private endpoint)
                                  │
                                  ▼
                        manual-dispatch.yml (prod + apply + tag)
                        ├─ ⏸ Requires prod environment approval
                        ├─ Verify container images exist for semver tag
                        ├─ Apply to prod using semver-tagged images
                        └─ Create GitHub Release

Advanced Branching Patterns

The basic SDLC flow above covers the single-feature-per-PR path. In practice, teams often need to coordinate dependent work or bundle multiple features into a single release. Two patterns handle this: Stacked PRs and Release PRs.

Stacked PRs (Dependent Feature Chains)

Use stacked PRs when a feature depends on another feature that hasn't merged to main yet. Each PR in the stack targets the previous branch instead of main.

When to Use

How It Works

  1. Create feat/base-network from main → open PR #1 targeting main
  2. Create feat/add-apim-policy from feat/base-network → open PR #2 targeting feat/base-network
  3. Each PR triggers pr-open.yml independently for lint and plan
  4. Merge PR #1 first (bottom of the stack) — this triggers merge-main.yml
  5. Retarget PR #2 to main and rebase onto updated main
  6. Merge PR #2 — triggers another merge-main.yml run with its own semver tag
main ─────────────────────────┬───────────────────┬──────►
   \                          │ merge PR #1        │ merge PR #2
    └─ feat/base-network ─────┘   (v1.3.0)        │
         \                                         │
          └─ feat/add-apim-policy ─── rebase ──────┘
                                                 (v1.4.0)
Important: Always merge bottom-up. If you merge PR #2 before PR #1, the diff will include both sets of changes and the base branch won't exist in main yet. After merging the base PR, rebase the dependent branch onto main to pick up any squash-merge differences before merging.
CI Behaviour: Each PR in the stack runs pr-open.yml against test. The plan for PR #2 will show changes from both branches while PR #1 is open (because the diff includes the base). After PR #1 merges, re-running PR #2's checks shows only its own changes. Each merge to main produces its own semver tag.

Release PRs (Bundled Multi-Feature Releases)

Use a Release PR when multiple developers are working on separate features that should ship together as a single coordinated release. All feature branches merge into a shared release branch, which then opens one PR to main.

When to Use

How It Works

  1. Create a release branch from main: release/sprint-42 (or release/apim-v2, etc.)
  2. Developers create feature branches from the release branch:
    • feat/new-tenant-config → PR targeting release/sprint-42
    • feat/apim-rate-limits → PR targeting release/sprint-42
    • fix/dns-zone-ttl → PR targeting release/sprint-42
  3. Feature PRs are reviewed and merged into the release branch (these merges do not trigger merge-main.yml since they don't target main)
  4. Optionally deploy the release branch to dev via manual-dispatch.yml to validate the combined changes
  5. When all features are complete, open one Release PR: release/sprint-42main
  6. The Release PR triggers pr-open.yml — the plan shows the aggregate of all bundled changes
  7. Merge the Release PR → merge-main.yml creates one semver tag and applies to test
main ─────────────────────────────────────────┬──────────►
   \                                           │ merge Release PR
    └─ release/sprint-42 ───┬──────┬──────────┘  (v2.0.0)
         \                  │      │
          ├─ feat/tenant ───┘      │
          │                        │
          └─ feat/rate-limits ─────┘
CI Behaviour: PRs targeting the release branch still trigger pr-open.yml (lint and plan), giving each feature its own review cycle. The final Release PR to main runs the full pipeline and shows a combined plan. Only the merge to main triggers merge-main.yml for semver tagging and test apply.
PR Title Convention: The Release PR title must follow Conventional Commits (e.g. feat: sprint 42 release — tenant config and rate limits) since it controls the semver bump. Use feat: for minor, fix: for patch, or include BREAKING CHANGE in the body for major.

Choosing Between Stacked PRs and Release PRs

Stacked PRs Release PR
Best for Sequential/dependent changes by 1–2 developers Parallel independent features by multiple developers
Semver tags One tag per merged PR (e.g. v1.3.0, v1.4.0) One tag for the entire bundle (e.g. v2.0.0)
Review granularity Each PR is reviewed independently with a focused diff Feature PRs reviewed individually; Release PR shows combined diff
merge-main.yml runs Triggers once per PR merged to main Triggers once for the entire release
Risk Rebase conflicts after base merges Release branch can drift from main if long-lived
Recommendation Keep stacks shallow (2–3 deep max) Keep release branches short-lived; rebase from main regularly

When Do I Need Self-Hosted Runners?

How this platform solves the private-endpoint problem: All continuous integration and deployment work in this repository runs on standard GitHub-hosted runners using ubuntu-24.04. When a workflow needs to reach a private endpoint, the Terraform deployer starts a Chisel tunnel and Privoxy inside Docker on that runner. That temporary proxy sends data-access traffic through a proxy service deployed in the tools virtual network. Because of that setup, this repository does not need self-hosted runners for its own pipelines.

The optional github_runners_aca module, deployed through add-or-remove-module.yml, can create self-hosted runners inside the virtual network. Use that option when other repositories or workloads need long-lived compute that already lives inside the private network, not for this repository's own automation.

The table below describes the general pattern. Data-plane work that can't use the Chisel tunnel approach — for example, other repos without the proxy setup — would still need self-hosted runners.

Understanding the Difference: See ADR-011: Control Plane vs Data Plane for the detailed explanation of why identity alone is enough for some operations, while private network access is also required for others.
Operation Plane Public Runner? Example
Create resources (VMs, VNets, Key Vault) Control ✓ Yes azurerm_key_vault
Configure settings, RBAC roles Control ✓ Yes azurerm_role_assignment
Deploy private endpoints Control ✓ Yes azurerm_private_endpoint
Read/write Key Vault secrets Data ✗ No azurerm_key_vault_secret
Read/write Storage blobs Data ✗ No azurerm_storage_blob
Terraform state (if private) Data ✗ No Backend storage account

Public Runners Work For

  • Deploying all infrastructure modules
  • Network, Bastion, Jumpbox, Proxy
  • Any resource that doesn't read secrets
  • Documentation builds (pages.yml)

Self-Hosted Required For

  • Terraform using azurerm_key_vault_secret
  • Private state backend (blocked by PE)
  • Any code that reads secrets at plan time
  • Database migrations, blob uploads
Cost Tip: If your Terraform doesn't need data plane access, stick with public runners. Self-hosted runners on Container Apps add cost. Only enable github_runners_aca_enabled if you actually need data plane access in CI/CD.

.deployer.yml (Reusable Workflow)

Reusable Terraform workflow for initial-setup/infra, primarily used to manage the tools environment and targeted foundational modules.

Inputs

Input Type Description
environment_name string Target environment name (commonly tools)
command string Terraform command (init, plan, apply, destroy)
target_module string Required module target (for example: azure_proxy, bastion, jumpbox)

Key Features

Required Permissions

permissions:
  id-token: write   # Required for OIDC token generation
  contents: read    # Required for repository checkout
Important: Without id-token: write permission, the workflow cannot generate the OIDC token needed for Azure authentication.

.deployer-using-secure-tunnel.yml (Reusable Workflow)

Reusable Terraform workflow for infra-ai-hub. It receives encrypted proxy outputs from .deployer.yml, starts Chisel + Privoxy, then runs stack deployment and (for apply in dev/test) integration tests.

Highlights


.lint.yml (Reusable Workflow)

Reusable validation workflow used by PR checks. Runs Terraform formatting and linting, enforces PR title conventions, and blocks forks.

What It Runs


.builds.yml (Container Build Workflow)

Reusable workflow for building and pushing container images to GitHub Container Registry (GHCR). Supports matrix builds for multiple packages including the azure-proxy services.

Built Packages

Image Tagging Strategy

Container images are tagged at different stages of the pipeline:

Stage Tags Applied Example
PR Build (pr-open.yml) PR number, run number, latest 42, 42-7, latest
Main Merge (merge-main.yml) Semver tag added to latest image v1.2.3
Prod Deploy (manual-dispatch.yml) Deploys using the semver-tagged image v1.2.3 (immutable)

The latest tag is used for dev and test environments. Production always uses the semver tag to ensure the exact tested image is what runs in prod.


add-or-remove-module.yml

Manual workflow for deploying or destroying infrastructure modules on demand. Evolved from the earlier bastion-only workflow to support all optional modules (bastion, jumpbox, azure_proxy, github_runners_aca).

When to Use

  • Deploy Module: When you need to provision a specific infrastructure module
  • Destroy Module: When done, to save costs or clean up resources

How to Run

  1. Go to Actions tab in GitHub
  2. Select "Deploy or Remove Bastion Host"
  3. Click "Run workflow"
  4. Choose tools, module, and command

Workflow Inputs

Input Options Description
environment_name tools Target environment
module bastion, jumpbox, azure_proxy, github_runners_aca Module to deploy or destroy
command apply, destroy Terraform command to execute
Cost Optimization: Azure Bastion has hourly charges (~$0.19/hour for Basic SKU). Deploy only when needed and destroy when done.

pr-open.yml (Pull Request Checks)

Runs on pull request events (and manual dispatch) to provide fast CI signal before merge.

Automated Checks

Note: The pr-open workflow must pass before a PR can be merged. The plan step appends an AI Hub Infra Changes section to the PR description with a readable plan summary; when there are no infra diffs the section says No Changes to AI Hub Infra in this PR.

schedule.yml (Scheduled Cleanup)

Automatically destroys Bastion every day at 5 PM PST to prevent unnecessary charges.

Schedule

# Runs daily at 5 PM PST (1 AM UTC next day)
on:
  schedule:
    - cron: "0 1 * * *"

Workflow Logic

  1. Check if Bastion exists: Uses Azure CLI to query the Bastion resource
  2. Conditional destroy: Only runs Terraform destroy if Bastion is found
  3. Status notification: Reports whether Bastion was destroyed or already removed

Jobs Flow

check-and-destroy-bastion
         │
         ├── bastion_exists=true ──► destroy-bastion ──► notification
         │
         └── bastion_exists=false ──────────────────────► notification

manual-dispatch.yml (Manual Promotion Workflow)

Use this workflow to run plan, apply, or destroy for dev, test, and prod.

How It Works

  1. Validates inputs — prod apply requires a deploy_tag (a semver tag like v1.2.3 from a successful merge-main.yml run)
  2. Verifies all container images exist with the semver tag (prod apply only)
  3. Deploys/refreshes azure_proxy in tools via .deployer.yml
  4. Passes encrypted proxy outputs to .deployer-using-secure-tunnel.yml
  5. Executes selected Terraform command in chosen environment — prod uses the semver-tagged container images (e.g. :v1.2.3) while dev/test use :latest
  6. For prod apply only: Creates a GitHub Release from the deployed tag with deployment metadata
PROD is gated: The prod environment has required reviewers configured as a GitHub Environment protection rule. Any workflow job targeting prod will pause and wait for manual approval from designated reviewers before executing. Additionally, prod apply requires you to specify which semver git tag to deploy — ensuring only test-validated commits reach production.
Dev Environment Scope: Developers may occasionally deploy to dev from a PR branch using workflow dispatch to test specific features. The dev environment does not include App Gateway or DNS Zone; end-to-end ingress and DNS validation is only supported in test and prod.

merge-main.yml (Auto Apply + Semantic Version on Main)

This workflow runs on every push to main. It creates a semantic version tag using Conventional Commits and then applies infrastructure to test.

Execution Flow

  1. Concurrently on push to main:
    • Semantic version: TriPSs/conventional-changelog-action inspects commits since last tag (feat: → minor, fix: → patch, BREAKING CHANGE → major), creates a git tag (e.g. v1.2.3), and pushes an updated CHANGELOG.md
    • Proxy bootstrap: deploys/refreshes azure_proxy in tools via .deployer.yml
  2. Tag container images: After the semver tag is created, all container images (azure-proxy/chisel, azure-proxy/privoxy, jobs/apim-key-rotation) that have a :latest tag are re-tagged with the semver version (e.g. :v1.2.3). This runs in parallel via a matrix strategy.
  3. Once azure_proxy is live, its encrypted URL+auth outputs are passed to .deployer-using-secure-tunnel.yml
  4. Run apply against test (through Chisel tunnel)
  5. After successful apply, integration tests run automatically in two phases (direct + via proxy — see .deployer-using-secure-tunnel.yml section)
Why semver tags? Semantic version tags serve as the input for prod deployments. When promoting to prod via manual-dispatch.yml, you provide a version tag (e.g. v1.2.3) to ensure the exact commit that passed test is what gets applied to production. The version is derived from commit messages, so feat: and fix: prefixes directly control versioning.

Release Process (TEST → PROD)

Production releases follow a tag-based promotion flow:

Steps to Release to Production

  1. Merge your PR to main — the test deployment and semantic versioning run automatically
  2. Verify the test deployment succeeded and a version tag was created (visible in Actions summary and repo tags, e.g. v1.2.3)
  3. Go to Actions → Deploy to Environments (Manual Dispatch)
  4. Select: environment=prod, command=apply, deploy_tag=v1.2.3
  5. Click Run workflow — the job will pause waiting for prod environment approval
  6. A designated reviewer approves the deployment in the Actions UI
  7. After successful apply, a GitHub Release is automatically created with deployment details

Release Contents

Each GitHub Release created for prod includes:


Concurrency Strategy

Several workflows and reusable jobs use concurrency to prevent race conditions and conflicting Terraform operations.

Workflow/Job Concurrency Group Why It Helps
.deployer.yml tools Serializes tools/bootstrap changes so multiple runs don't mutate shared proxy infra at the same time.
.deployer-using-secure-tunnel.yml ${environment_name} Ensures only one Terraform operation per environment runs at once (for example, only one test apply).
pr-open.yml builds job builds-${PR number} (cancel in-progress) Stops stale image builds when a newer commit arrives in the same PR.
manual-dispatch.yml manual-deploy-${run_id} Keeps each manually triggered deployment isolated and traceable to a single run.
merge-main.yml deploy-test-on-main Queues merges to main so test applies execute in order and avoid state lock contention.
Practical Effect: Concurrency improves deployment reliability by reducing Terraform state lock failures, avoiding overlapping applies, and ensuring each environment converges predictably.

APIM Key Rotation

APIM subscription key rotation is now handled by a Container App Job (scheduled) deployed as a custom container. The job source is at jobs/apim-key-rotation/ and the Terraform module at infra-ai-hub/modules/key-rotation-function/.

.builds.yml (Matrix Entry)

The key rotation container image is built by the shared .builds.yml reusable workflow as a matrix entry alongside other containers (chisel, privoxy).

  • Called by pr-open.yml on PR open/update
  • Triggers on changes to jobs/apim-key-rotation/ or the workflow itself
  • Uses bcgov/action-builder-ghcr for image build and push
  • Tagged with PR number, run number, and latest

Semver Image Tagging

On merge to main, merge-main.yml re-tags the :latest image with the semver version (e.g. :v1.2.3). This ensures prod deployments use an immutable, version-pinned image rather than the mutable :latest tag.

  • Dev/test use :latest (with a FORCE_IMAGE_PULL env var to trigger re-pull each Terraform apply)
  • Prod uses the semver tag (e.g. :v1.2.3) — immutable and traceable
  • Terraform container_image_tag_job_key_rotation variable controls which tag is deployed

pages.yml (Documentation)

Deploys this documentation site to GitHub Pages when changes are pushed to docs or Terraform roots (so generated references stay current).

Triggers

Deployment Steps

  1. Checkout repository
  2. Run docs/generate-tf-docs.sh to refresh Terraform reference content
  3. Run docs/build.sh to generate HTML from templates
  4. Configure GitHub Pages
  5. Upload docs/ folder as artifact
  6. Deploy to GitHub Pages

Environment Secrets

Each GitHub environment requires these secrets (created by initial-azure-setup.sh):

Secret Description Source
AZURE_CLIENT_ID Managed Identity client ID Created by setup script
AZURE_TENANT_ID Azure AD tenant ID Your Azure subscription
AZURE_SUBSCRIPTION_ID Target subscription ID Your Azure subscription
VNET_RESOURCE_GROUP_NAME Resource group containing VNet Your infrastructure
VNET_NAME Existing VNet name Your infrastructure
VNET_ADDRESS_SPACE VNet CIDR block Your infrastructure
SOURCE_VNET_ADDRESS_SPACE Source VNet(tools) CIDR for NSG rules Your infrastructure
SUBNET_ALLOCATION JSON object for subnet_allocation (map(map(string))) Azure Blob (network-info/subnet-allocation) then copied to GitHub secret
EXTERNAL_PEERED_PROJECTS Optional JSON object for external_peered_projects (map(object)) Azure Blob (network-info/subnet-allocation) then copied to GitHub secret

Subnet Allocation Process (Blob First)

For visibility and team collaboration, keep network JSON in Azure Blob first, then copy the same JSON into GitHub environment secrets.

  1. Update the environment file in Blob storage (for example, subnet-allocation-dev.json or subnet-allocation-test.json in network-info/subnet-allocation).
  2. Copy the JSON payload and paste it into the matching GitHub environment secret SUBNET_ALLOCATION.
  3. If direct APIM access from peered VNets is needed, copy the external_peered_projects JSON payload into optional secret EXTERNAL_PEERED_PROJECTS.
  4. Run plan in that environment to validate before apply.
Format rule: Paste raw JSON in GitHub secrets (single line is safest). Do not use HCL syntax (for example, subnet_allocation = { ... } or external_peered_projects = { ... }) and do not paste escaped JSON with backslashes.

How JSON Flows into Terraform

  1. GitHub Actions reads secrets.SUBNET_ALLOCATION.
  2. The reusable workflow maps it to TF_VAR_subnet_allocation.
  3. Terraform parses the JSON into var.subnet_allocation (map(map(string))).
  4. If secrets.EXTERNAL_PEERED_PROJECTS is present, the workflow exports it as TF_VAR_external_peered_projects and Terraform parses it into var.external_peered_projects.

Common Errors You Might See

Running Workflows Locally

For local development and testing, use the deployment scripts in each Terraform root:

Mandatory variable: Local runs must export TF_VAR_subnet_allocation before plan/apply. Source of truth is the tools storage account tftoolsaihubtracking, container tools, path network-info/subnet-allocation/.
# Load required subnet allocation JSON (example: prod)
az account set --subscription "da4cf6-tools - AI Services Hub"
tmpfile=$(mktemp)
az storage blob download \
    --account-name tftoolsaihubtracking \
    --container-name tools \
    --name network-info/subnet-allocation/subnet-allocation-prod.json \
    --auth-mode login \
    --file "$tmpfile"
export TF_VAR_subnet_allocation="$(jq -c . "$tmpfile")"
rm -f "$tmpfile"

# Switch back to your deployment subscription before running terraform
az account set --subscription "da4cf6-dev - AI Services Hub"
# Ensure you're logged in
az login

# Initial setup / tools
./initial-setup/infra/deploy-terraform.sh init
./initial-setup/infra/deploy-terraform.sh plan
./initial-setup/infra/deploy-terraform.sh apply
# AI Hub stacks
./infra-ai-hub/scripts/deploy-terraform.sh plan dev
./infra-ai-hub/scripts/deploy-terraform.sh apply test
Note: Local runs use your Azure CLI credentials instead of OIDC. Make sure you have the required permissions and that TF_VAR_subnet_allocation is exported in the shell.