Service Uptime and Monitoring

Overview

The SSO service operates across a primary cluster and a disaster recovery (DR) cluster to ensure high availability. This page provides links to uptime monitoring dashboards and explains what each health check measures and what happens when the service enters disaster recovery mode.

For current service status and incident updates, visit status.loginproxy.gov.bc.ca.

Do you think SSO service is down?

Check status.loginproxy.gov.bc.ca
We will provide live updates as we learn more in our Microsoft Teams Keycloak How-to Channel
If you see the other alerts please know we are working to resolve this. We will post in Microsoft Teams Keycloak How-to Channel and update the history and incident section
you can always reach a human via our Microsoft Teams Keycloak How-to Channel or by emailing us at SSO Team

Uptime Monitoring Checks

The following checks measure different aspects of the SSO service health:

Check	Purpose	Link
PROD Service Uptime	Measures whether the production SSO service itself is running	View Check
TEST Service Uptime	Measures whether the test SSO service itself is running	View Check
DEV Service Uptime	Measures whether the dev SSO service itself is running	View Check

DNS Checks

DNS checks verify that each environment's hostname is correctly routing to the primary Gold cluster. If a DNS check fails while other uptime checks pass, the service may be running in disaster recovery mode with traffic routed to the DR cluster.

Environment	DNS Check
PROD	Prod DNS Check — Verifies `loginproxy.gov.bc.ca` routes to Gold cluster
TEST	Test DNS Check — Verifies `test.loginproxy.gov.bc.ca` routes to Gold cluster
DEV	Dev DNS Check — Verifies `dev.loginproxy.gov.bc.ca` routes to Gold cluster

Disaster Recovery Mode

When the primary cluster is unavailable, the SSO service automatically fails over to the disaster recovery cluster. Traffic is rerouted transparently, and service continues with minimal interruption.

When Failover Occurs

Failover may be triggered by:

OpenShift platform upgrades — the SSO team coordinates in advance and notifies the community via the Microsoft Teams Keycloak How-to Channel.
Incidents on the primary cluster — the SSO team sends notifications as failover is initiated and completed.

Communication During Failover

The SSO team communicates failover status through:

Email notifications — sent to all teams with active integrations
Microsoft Teams — updates posted in the Keycloak How-to Channel
Status page — status.loginproxy.gov.bc.ca

Important: Configuration Changes Are Not Preserved

Critical: Any configuration changes made to your integration during disaster recovery mode will be lost when the service returns to the primary cluster. The DR cluster runs from a backup snapshot and does not sync changes back to production.

If you need to make configuration changes during a failover, wait until the service is restored to the primary cluster.

Typical Failover Timeline

Failover initiated — "The Gold Keycloak instance is failing over to the DR cluster"
- The CSS App may be put into maintenance mode
- Check the status page for updates
Failover complete — "The Gold Keycloak instance has failed over to the DR cluster"
- End users can continue logging in to integrated applications
- The SSO service prioritizes availability for end-user authentication and automation
- Do not make configuration changes
Service restored — "The Gold Keycloak instance has been restored to the primary cluster"
- Normal operations resume
- Configuration changes made during DR are not present — re-apply them if needed
- Service returns to normal SLA targets

Service Level Agreement (SLA)

The SSO service targets 99.95% availability, translating to approximately 22 minutes of downtime per month. However, actual availability is constrained by several external dependencies beyond the SSO team's direct control.

Availability Guarantees

The SSO service operates 24/7, except during planned maintenance windows. Planned outages within the Kamloops and Calgary data centers are communicated in advance through the Microsoft Teams Keycloak How-to Channel.

Business Hours Support:

Hours: Weekdays 9:00 AM–5:00 PM Pacific Time (excluding statutory holidays)
Scope: Client provisioning questions, feature requests, and general guidance
Response time: Best effort during business hours

After-Hours Support:

Scope: Service outages and incidents impacting authentication
Availability: 24/7 incident response
Contact: Email bcgov.sso@gov.bc.ca or the Microsoft Teams Keycloak How-to Channel

Incident Response Times

The SSO team classifies incidents into four priority levels with target response times:

Priority	Severity	Example	Business Hours	After Hours
P1 — Critical	Service outage	Keycloak or authentication is unavailable	< 15 min	< 30 min
P2 — High	Stability warning	Stability issue that may cause an outage if not addressed	< 30 min	< 60 min
P3 — Moderate	Degradation warning	Moderate issue not requiring immediate intervention	< 30 min	Best effort
P4 — Low	Minor warning	Non-critical system health issue	< 45 min	Best effort

All incidents are detected and escalated through the SSO team's 24/7 monitoring system.

Limitations: Zero-Downtime Deployments

The current version of Red Hat Build of Keycloak does not support zero-downtime (blue-green) deployments. When the SSO team upgrades Keycloak or applies security patches, brief downtime is required.

During upgrades and patches:

Users with active sessions must re-authenticate
Users are notified in advance via the Teams channel with the specific maintenance window

Change Communications

The SSO team provides advance notice for all planned changes:

Change Type	Advance Notice	Announcement Channel	Example
Minor	24 hours	Microsoft Teams Keycloak How-to Channel	Bug fixes, low-impact changes
Medium/Major	5 business days	Microsoft Teams Keycloak How-to Channel	Keycloak version upgrades
Emergency	As soon as possible	Microsoft Teams Keycloak How-to Channel + SSO Team Email	Security vulnerabilities, service recovery

Factors That Impact SSO Service SLA

The SSO service is one component in a larger infrastructure stack. Outages or performance issues in any of the following layers can affect the availability of SSO:

1. Private Cloud Platform Services (OpenShift)

The SSO service runs on the BC Government's Private Cloud Platform (OpenShift), hosted in data centers in Kamloops (primary) and Calgary (disaster recovery).

Factor	Details
Availability Commitment	99.95% uptime
Impact of Planned Outages	Minimal — the SSO team uses automated failover to the DR cluster (typically ≤15 minutes)
Impact of Unplanned Outages	Affects SSO availability if both primary and DR clusters are impacted

2. Data Center Infrastructure

The Kamloops and Calgary data centers provide the underlying physical and network infrastructure for the OpenShift platform. The SSO team relies on the SLAs negotiated between the Province and the data center operators.

Factor	Details
Availability Commitment	99.5% uptime
Responsibility	Data center operators and the Province; not controlled by the SSO team
Impact	Any unplanned outage at the data center level directly impacts SSO availability

3. Upstream Identity Providers (IDPs)

The SSO service depends on external identity providers for authentication. Outages in these services are entirely outside the SSO team's control.

IDP	Operated By	Impact If Down
IDIR / IDIR - MFA	Access Directory Management Services (ADMS/WAM)	Users cannot authenticate with IDIR or IDIR - MFA
BCeID	Provincial Identity Information Management (IDIM)	Users cannot authenticate with BCeID
BC Services Card	BC Services Card team	Users cannot authenticate with BC Services Card
Other IDPs	GitHub, Digital Credential, etc.	Users cannot authenticate with that specific IDP

Note: While the SSO service itself may be running, if all upstream IDPs are unavailable, end users cannot log in to any applications.

Overall SLA Calculation

The effective SLA of the SSO service is constrained by the weakest link in the dependency chain. Because the SSO team depends on 99.5% availability from data centers (rather than 99.95%), the realistic SLA target is 99.5% when accounting for all infrastructure dependencies.

Upstream IDP outages further reduce the effective end-user availability, as users cannot authenticate even if SSO is operational.

How the SSO Team Manages Availability

Automated failover — detects primary cluster failures and automatically routes traffic to the DR cluster with minimal latency
Monitoring and alerts — 24/7 monitoring of service health and upstream IDP status
Maintenance windows — coordinated in advance via the Teams Keycloak How-to Channel to minimize impact

For the complete list of current monitoring checks and real-time status, visit status.loginproxy.gov.bc.ca.

Cost

There is no cost for the SSO service for BC Government ministries, central agencies, and Crown corporations. Any future changes to the cost model will be communicated in advance

Overview​

Do you think SSO service is down?​

Uptime Monitoring Checks​

DNS Checks​

Disaster Recovery Mode​

When Failover Occurs​

Communication During Failover​

Important: Configuration Changes Are Not Preserved​

Typical Failover Timeline​

Service Level Agreement (SLA)​

Availability Guarantees​

Incident Response Times​

Limitations: Zero-Downtime Deployments​

Change Communications​

Factors That Impact SSO Service SLA​

1. Private Cloud Platform Services (OpenShift)​

2. Data Center Infrastructure​

3. Upstream Identity Providers (IDPs)​

Overall SLA Calculation​

How the SSO Team Manages Availability​

Cost​