AI Services Hub
Azure Landing Zone Infrastructure

Operational Playbooks

This page contains step-by-step procedures for common operational tasks, troubleshooting paths, and recovery work. Use it when something has gone wrong, when a routine maintenance action must be performed, or when you need a tested procedure instead of starting from scratch.

Before You Begin
Ensure you have the required permissions and have read the relevant documentation. When in doubt, escalate to the platform team.

Quick Reference

State Lock Stuck

Terraform state is locked and won't release

View Playbook →

OIDC Auth Failing

Token exchange errors between GitHub and Azure

View Playbook →

Rollback Deployment

Revert to a previous known-good state

View Playbook →

Rotate Credentials

Update federated credentials if compromised

View Playbook →

Bastion Access Issues

Cannot connect to VMs via Bastion

View Playbook →

Pipeline Stuck

GitHub Actions workflow hanging or failing

View Playbook →

Chisel Tunnel Setup

Local access to private Azure resources

View Playbook →

Playbook: Terraform State Lock Stuck

Symptoms

Diagnosis

  1. Check if another operation is running

    Look at GitHub Actions for any in-progress Terraform jobs.

  2. Check the lock info
    az storage blob show \
        --account-name <storage_account> \
        --container-name tfstate \
        --name terraform.tfstate \
        --query "properties.lease"

Resolution

⚠️
Caution: Only force-unlock if you are CERTAIN no other operation is running. Breaking an active lock can corrupt state.

Option 1: Wait and Retry (Safest)

Locks typically auto-release after 15-60 minutes. Wait and retry.

Option 2: Break the Blob Lease

# List current leases
az storage blob lease show \
    --account-name <storage_account> \
    --container-name tfstate \
    --blob-name terraform.tfstate

# Break the lease (requires Storage Blob Data Owner role)
az storage blob lease break \
    --account-name <storage_account> \
    --container-name tfstate \
    --blob-name terraform.tfstate

Option 3: Force Unlock via Terraform

# Get the lock ID from the error message, then:
terraform force-unlock <LOCK_ID>

# Example:
terraform force-unlock 12345678-1234-1234-1234-123456789012

Prevention

Playbook: OIDC Authentication Failing

Symptoms

Diagnosis by Error Code

Error Cause Fix
AADSTS700024 Token timing issue (clock skew) Retry the workflow - usually transient
AADSTS70021 Subject claim doesn't match federated credential Check branch/environment matches credential config
AADSTS700016 Client ID doesn't exist or wrong tenant Verify AZURE_CLIENT_ID secret is correct

Resolution Steps

1. Verify GitHub Secrets

# These secrets must be set in GitHub repository settings:
AZURE_CLIENT_ID      # Managed Identity Client ID
AZURE_TENANT_ID      # Azure AD Tenant ID
AZURE_SUBSCRIPTION_ID # Target Subscription ID

2. Verify Federated Credential Configuration

# Check federated credentials on the managed identity
az ad app federated-credential list \
    --id <APP_OBJECT_ID> \
    --query "[].{name:name, subject:subject, issuer:issuer}"

The subject must match exactly:

3. Check Token Claims

Add this step to your workflow to debug the token:

- name: Debug OIDC Token
  run: |
    TOKEN=$(curl -s -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=api://AzureADTokenExchange" | jq -r '.value')
    echo "Token claims:"
    echo $TOKEN | cut -d. -f2 | base64 -d 2>/dev/null | jq .

4. Recreate Federated Credential

# Delete and recreate if misconfigured
az ad app federated-credential delete \
    --id <APP_OBJECT_ID> \
    --federated-credential-id <CREDENTIAL_ID>

# Recreate with correct subject
./initial-setup/initial-azure-setup.sh \
    -g "your-rg" -n "your-identity" \
    -r "bcgov/ai-hub-tracking" -e "dev"

Playbook: Rollback Failed Deployment

When to Use

# Find the last good commit
git log --oneline -10

# Revert the problematic commit(s)
git revert <BAD_COMMIT_SHA>

# Push to trigger pipeline with reverted code
git push origin main

Option 2: Restore from State Backup

Azure Blob Storage keeps versions of the state file:

# List state file versions
az storage blob list \
    --account-name <storage_account> \
    --container-name tfstate \
    --include v \
    --query "[?name=='terraform.tfstate'].{version:versionId, modified:properties.lastModified}"

# Download a previous version
az storage blob download \
    --account-name <storage_account> \
    --container-name tfstate \
    --name terraform.tfstate \
    --version-id <VERSION_ID> \
    --file terraform.tfstate.backup

Option 3: Targeted Destroy and Recreate

# Destroy specific problematic resources
terraform destroy -target=azurerm_virtual_machine.jumpbox

# Reapply to recreate with correct config
terraform apply
💡
Tip: Always run terraform plan before apply after a rollback to verify the expected changes.

Playbook: Rotate Federated Credentials

When to Use

Good News: With OIDC, there are no long-lived secrets to rotate! The "credential" is the trust relationship, not a secret key.

To Update the Trust Relationship

1. Delete Existing Federated Credential

az ad app federated-credential list --id <APP_OBJECT_ID>
az ad app federated-credential delete \
    --id <APP_OBJECT_ID> \
    --federated-credential-id <CREDENTIAL_ID>

2. Create New Federated Credential

./initial-setup/initial-azure-setup.sh \
    -g "your-rg" \
    -n "your-identity" \
    -r "bcgov/new-repo-name" \
    -e "dev"

3. Update GitHub Secrets (if Client ID changed)

  1. Go to Repository Settings → Secrets and variables → Actions
  2. Update AZURE_CLIENT_ID with new Managed Identity Client ID

4. Verify New Configuration

# Trigger a test workflow run
gh workflow run deploy.yml

Playbook: Bastion Access Issues

Symptoms

Diagnosis

1. Check if Bastion is Deployed

az network bastion list \
    --resource-group <RG_NAME> \
    --query "[].{name:name, state:provisioningState}"

If empty, Bastion may be disabled (cost-saving). Deploy it:

# Via GitHub Actions
gh workflow run add-or-remove-module.yml -f action=add

# Or via manual trigger in GitHub UI

2. Check NSG Rules

# Bastion subnet requires specific NSG rules
az network nsg rule list \
    --resource-group <RG_NAME> \
    --nsg-name <BASTION_NSG> \
    --query "[].{name:name, access:access, direction:direction, port:destinationPortRange}"

Required inbound rules:

3. Check VM Status

az vm get-instance-view \
    --resource-group <RG_NAME> \
    --name <VM_NAME> \
    --query "instanceView.statuses[1].displayStatus"

VM must be in "VM running" state.

Playbook: GitHub Actions Pipeline Stuck

Symptoms

Resolution

1. Check for Pending Approvals

Production environments require approval. Check the workflow run for pending reviews.

2. Cancel and Retry

# Cancel via CLI
gh run cancel <RUN_ID>

# Retry
gh workflow run <WORKFLOW_NAME>

3. Check for State Lock

If Terraform is waiting on state lock, see State Lock Playbook.

4. Check Runner Health

# View recent workflow runs
gh run list --limit 10

# Check specific run logs
gh run view <RUN_ID> --log

5. GitHub Status

Check GitHub Status Page for platform issues.

Playbook: Chisel Tunnel for Local Development

🔧
Platform Maintainers Only: This playbook is for the platform team to access private Azure resources (databases, Key Vault, etc.) from their local machines. Tenant developers should use the APIM/App Gateway endpoints instead.
📖
Looking for the full deployment workflow? See the comprehensive Local Development Deployment Guide for step-by-step instructions on deploying infrastructure from your local machine, including environment switching and integration testing.

When to Use

Prerequisites

Step 1: Get Chisel Credentials

The Chisel server credentials are stored in Terraform outputs. Contact the platform team or retrieve them yourself:

# From a machine with Terraform access (e.g., Jumpbox)
cd initial-setup/infra
terraform output azure_proxy_chisel_auth

# Output format: tunnel:XXXXXXXX

You'll also need the App Service URL:

terraform output azure_proxy_url

# Output: https://<app-name>.azurewebsites.net

Step 2: Connect to a Private Database

Example: Connecting to a private PostgreSQL database on port 5432:

# Run Chisel client in Docker
docker run --rm -it \
  -p 5432:5432 \
  jpillora/chisel:latest client \
  --auth "tunnel:XXXXXXXX" \
  https://<app-name>.azurewebsites.net \
  0.0.0.0:5432:<postgres-server>.postgres.database.azure.com:5432

# Now connect locally
psql -h localhost -p 5432 -U <username> -d <database>
⚠️
Port Conflicts: If port 5432 is already in use locally, map to a different port:
-p 5462:5432
Then connect with psql -h localhost -p 5462 ...

Step 3: Connect to Multiple Services (SOCKS5 Proxy)

For accessing multiple private services or browsing private endpoints:

# Run Chisel as SOCKS5 proxy
docker run --rm -it \
  -p 1080:1080 \
  jpillora/chisel:latest client \
  --auth "tunnel:XXXXXXXX" \
  https://<app-name>.azurewebsites.net \
  socks

Configure Firefox with SmartProxy Extension

  1. Install SmartProxy extension
  2. Add a proxy server:
    • Type: SOCKS5
    • Address: localhost
    • Port: 1080
  3. Add rules for Azure private endpoints:
    • *.vault.azure.net
    • *.postgres.database.azure.com
    • *.blob.core.windows.net

Step 4: Connect to Key Vault

# Tunnel to Key Vault (port 443)
docker run --rm -it \
  -p 8443:443 \
  jpillora/chisel:latest client \
  --auth "tunnel:XXXXXXXX" \
  https://<app-name>.azurewebsites.net \
  0.0.0.0:443:<keyvault-name>.vault.azure.net:443

# Use Azure CLI with HTTPS_PROXY (alternative method)
HTTPS_PROXY=socks5://localhost:1080 az keyvault secret list --vault-name <keyvault-name>

Troubleshooting

Issue Cause Solution
Connection refused Chisel server not running or wrong URL Verify App Service is running, check URL
Authentication failed Wrong credentials Get fresh credentials from Terraform output
Timeout connecting to target Target hostname wrong or not in VNet Verify private endpoint hostname, check VNet peering
Port already in use Local service using the port Use different local port: -p 5462:5432
DNS resolution failed Private DNS not linked Use IP address instead, or check Private DNS Zone

Security Notes

🔒
  • Credentials: Chisel auth is randomly generated and stored in App Service settings (encrypted by Azure)
  • Access logging: All connections are logged in Application Insights
  • Scope: Chisel can only reach resources within the VNet or peered VNets
  • Not for tenants: This is platform team tooling, not for ministry developers