Operational Playbooks
This page contains step-by-step procedures for common operational tasks, troubleshooting paths, and recovery work. Use it when something has gone wrong, when a routine maintenance action must be performed, or when you need a tested procedure instead of starting from scratch.
Ensure you have the required permissions and have read the relevant documentation. When in doubt, escalate to the platform team.
Quick Reference
State Lock Stuck
Terraform state is locked and won't release
OIDC Auth Failing
Token exchange errors between GitHub and Azure
Rollback Deployment
Revert to a previous known-good state
Rotate Credentials
Update federated credentials if compromised
Bastion Access Issues
Cannot connect to VMs via Bastion
Pipeline Stuck
GitHub Actions workflow hanging or failing
Chisel Tunnel Setup
Local access to private Azure resources
Playbook: Terraform State Lock Stuck
Symptoms
- Terraform commands fail with:
Error acquiring the state lock - Message includes:
Lock Info: ID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - Previous pipeline run may have crashed or been cancelled
Diagnosis
- Check if another operation is running
Look at GitHub Actions for any in-progress Terraform jobs.
- Check the lock info
az storage blob show \ --account-name <storage_account> \ --container-name tfstate \ --name terraform.tfstate \ --query "properties.lease"
Resolution
Option 1: Wait and Retry (Safest)
Locks typically auto-release after 15-60 minutes. Wait and retry.
Option 2: Break the Blob Lease
# List current leases
az storage blob lease show \
--account-name <storage_account> \
--container-name tfstate \
--blob-name terraform.tfstate
# Break the lease (requires Storage Blob Data Owner role)
az storage blob lease break \
--account-name <storage_account> \
--container-name tfstate \
--blob-name terraform.tfstate
Option 3: Force Unlock via Terraform
# Get the lock ID from the error message, then: terraform force-unlock <LOCK_ID> # Example: terraform force-unlock 12345678-1234-1234-1234-123456789012
Prevention
- Never cancel Terraform runs mid-operation
- Use workflow concurrency controls (already configured)
- Ensure proper timeout settings on pipeline jobs
Playbook: OIDC Authentication Failing
Symptoms
- GitHub Actions fails at
azure/loginstep - Error:
AADSTS700024: Client assertion is not within its valid time range - Error:
AADSTS70021: No matching federated identity record found - Error:
AADSTS700016: Application with identifier 'xxx' was not found
Diagnosis by Error Code
| Error | Cause | Fix |
|---|---|---|
AADSTS700024 |
Token timing issue (clock skew) | Retry the workflow - usually transient |
AADSTS70021 |
Subject claim doesn't match federated credential | Check branch/environment matches credential config |
AADSTS700016 |
Client ID doesn't exist or wrong tenant | Verify AZURE_CLIENT_ID secret is correct |
Resolution Steps
1. Verify GitHub Secrets
# These secrets must be set in GitHub repository settings: AZURE_CLIENT_ID # Managed Identity Client ID AZURE_TENANT_ID # Azure AD Tenant ID AZURE_SUBSCRIPTION_ID # Target Subscription ID
2. Verify Federated Credential Configuration
# Check federated credentials on the managed identity
az ad app federated-credential list \
--id <APP_OBJECT_ID> \
--query "[].{name:name, subject:subject, issuer:issuer}"
The subject must match exactly:
- For environment:
repo:bcgov/ai-hub-tracking:environment:dev - For branch:
repo:bcgov/ai-hub-tracking:ref:refs/heads/main
3. Check Token Claims
Add this step to your workflow to debug the token:
- name: Debug OIDC Token
run: |
TOKEN=$(curl -s -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
"$ACTIONS_ID_TOKEN_REQUEST_URL&audience=api://AzureADTokenExchange" | jq -r '.value')
echo "Token claims:"
echo $TOKEN | cut -d. -f2 | base64 -d 2>/dev/null | jq .
4. Recreate Federated Credential
# Delete and recreate if misconfigured
az ad app federated-credential delete \
--id <APP_OBJECT_ID> \
--federated-credential-id <CREDENTIAL_ID>
# Recreate with correct subject
./initial-setup/initial-azure-setup.sh \
-g "your-rg" -n "your-identity" \
-r "bcgov/ai-hub-tracking" -e "dev"
Playbook: Rollback Failed Deployment
When to Use
- A deployment broke existing functionality
- Resources are in an inconsistent state
- Need to revert to a known-good configuration
Option 1: Git Revert (Recommended)
# Find the last good commit git log --oneline -10 # Revert the problematic commit(s) git revert <BAD_COMMIT_SHA> # Push to trigger pipeline with reverted code git push origin main
Option 2: Restore from State Backup
Azure Blob Storage keeps versions of the state file:
# List state file versions
az storage blob list \
--account-name <storage_account> \
--container-name tfstate \
--include v \
--query "[?name=='terraform.tfstate'].{version:versionId, modified:properties.lastModified}"
# Download a previous version
az storage blob download \
--account-name <storage_account> \
--container-name tfstate \
--name terraform.tfstate \
--version-id <VERSION_ID> \
--file terraform.tfstate.backup
Option 3: Targeted Destroy and Recreate
# Destroy specific problematic resources terraform destroy -target=azurerm_virtual_machine.jumpbox # Reapply to recreate with correct config terraform apply
terraform plan before apply after a rollback to verify the expected changes.
Playbook: Rotate Federated Credentials
When to Use
- Suspected credential compromise
- Security audit requires rotation
- Changing repository or organization
To Update the Trust Relationship
1. Delete Existing Federated Credential
az ad app federated-credential list --id <APP_OBJECT_ID>
az ad app federated-credential delete \
--id <APP_OBJECT_ID> \
--federated-credential-id <CREDENTIAL_ID>
2. Create New Federated Credential
./initial-setup/initial-azure-setup.sh \
-g "your-rg" \
-n "your-identity" \
-r "bcgov/new-repo-name" \
-e "dev"
3. Update GitHub Secrets (if Client ID changed)
- Go to Repository Settings → Secrets and variables → Actions
- Update
AZURE_CLIENT_IDwith new Managed Identity Client ID
4. Verify New Configuration
# Trigger a test workflow run gh workflow run deploy.yml
Playbook: Bastion Access Issues
Symptoms
- Cannot connect to VM via Azure Portal Bastion
- Connection times out
- "Bastion host not found" error
Diagnosis
1. Check if Bastion is Deployed
az network bastion list \
--resource-group <RG_NAME> \
--query "[].{name:name, state:provisioningState}"
If empty, Bastion may be disabled (cost-saving). Deploy it:
# Via GitHub Actions gh workflow run add-or-remove-module.yml -f action=add # Or via manual trigger in GitHub UI
2. Check NSG Rules
# Bastion subnet requires specific NSG rules
az network nsg rule list \
--resource-group <RG_NAME> \
--nsg-name <BASTION_NSG> \
--query "[].{name:name, access:access, direction:direction, port:destinationPortRange}"
Required inbound rules:
- HTTPS (443) from Internet
- Gateway Manager from GatewayManager service tag
- Azure Load Balancer from AzureLoadBalancer service tag
3. Check VM Status
az vm get-instance-view \
--resource-group <RG_NAME> \
--name <VM_NAME> \
--query "instanceView.statuses[1].displayStatus"
VM must be in "VM running" state.
Playbook: GitHub Actions Pipeline Stuck
Symptoms
- Pipeline shows "In progress" for extended time
- Terraform plan/apply seems to hang
- No new log output appearing
Resolution
1. Check for Pending Approvals
Production environments require approval. Check the workflow run for pending reviews.
2. Cancel and Retry
# Cancel via CLI gh run cancel <RUN_ID> # Retry gh workflow run <WORKFLOW_NAME>
3. Check for State Lock
If Terraform is waiting on state lock, see State Lock Playbook.
4. Check Runner Health
# View recent workflow runs gh run list --limit 10 # Check specific run logs gh run view <RUN_ID> --log
5. GitHub Status
Check GitHub Status Page for platform issues.
Playbook: Chisel Tunnel for Local Development
When to Use
- You need to connect to a private PostgreSQL/CosmosDB database from your laptop
- You need to debug private endpoints locally
- You need to test API calls to private Azure services
- Bastion/Jumpbox is overkill for quick data plane access
Prerequisites
- Docker installed on your local machine
- Chisel credentials from Terraform output (contact platform team)
- Azure Proxy deployed (
enable_azure_proxy = truein Terraform)
Step 1: Get Chisel Credentials
The Chisel server credentials are stored in Terraform outputs. Contact the platform team or retrieve them yourself:
# From a machine with Terraform access (e.g., Jumpbox) cd initial-setup/infra terraform output azure_proxy_chisel_auth # Output format: tunnel:XXXXXXXX
You'll also need the App Service URL:
terraform output azure_proxy_url # Output: https://<app-name>.azurewebsites.net
Step 2: Connect to a Private Database
Example: Connecting to a private PostgreSQL database on port 5432:
# Run Chisel client in Docker docker run --rm -it \ -p 5432:5432 \ jpillora/chisel:latest client \ --auth "tunnel:XXXXXXXX" \ https://<app-name>.azurewebsites.net \ 0.0.0.0:5432:<postgres-server>.postgres.database.azure.com:5432 # Now connect locally psql -h localhost -p 5432 -U <username> -d <database>
-p 5462:5432Then connect with
psql -h localhost -p 5462 ...
Step 3: Connect to Multiple Services (SOCKS5 Proxy)
For accessing multiple private services or browsing private endpoints:
# Run Chisel as SOCKS5 proxy docker run --rm -it \ -p 1080:1080 \ jpillora/chisel:latest client \ --auth "tunnel:XXXXXXXX" \ https://<app-name>.azurewebsites.net \ socks
Configure Firefox with SmartProxy Extension
- Install SmartProxy extension
- Add a proxy server:
- Type: SOCKS5
- Address: localhost
- Port: 1080
- Add rules for Azure private endpoints:
*.vault.azure.net*.postgres.database.azure.com*.blob.core.windows.net
Step 4: Connect to Key Vault
# Tunnel to Key Vault (port 443) docker run --rm -it \ -p 8443:443 \ jpillora/chisel:latest client \ --auth "tunnel:XXXXXXXX" \ https://<app-name>.azurewebsites.net \ 0.0.0.0:443:<keyvault-name>.vault.azure.net:443 # Use Azure CLI with HTTPS_PROXY (alternative method) HTTPS_PROXY=socks5://localhost:1080 az keyvault secret list --vault-name <keyvault-name>
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Connection refused | Chisel server not running or wrong URL | Verify App Service is running, check URL |
| Authentication failed | Wrong credentials | Get fresh credentials from Terraform output |
| Timeout connecting to target | Target hostname wrong or not in VNet | Verify private endpoint hostname, check VNet peering |
| Port already in use | Local service using the port | Use different local port: -p 5462:5432 |
| DNS resolution failed | Private DNS not linked | Use IP address instead, or check Private DNS Zone |
Security Notes
- Credentials: Chisel auth is randomly generated and stored in App Service settings (encrypted by Azure)
- Access logging: All connections are logged in Application Insights
- Scope: Chisel can only reach resources within the VNet or peered VNets
- Not for tenants: This is platform team tooling, not for ministry developers