Skip to main content

Service Uptime and Monitoring

Overview

The SSO service operates across a primary cluster and a disaster recovery (DR) cluster to ensure high availability. This page provides links to uptime monitoring dashboards and explains what each health check measures and what happens when the service enters disaster recovery mode.

For current service status and incident updates, visit status.loginproxy.gov.bc.ca.


Do you think SSO service is down?

  1. Check status.loginproxy.gov.bc.ca
  2. We will provide live updates as we learn more in our Microsoft Teams Keycloak How-to Channel
  3. If you see the other alerts please know we are working to resolve this. We will post in Microsoft Teams Keycloak How-to Channel and update the history and incident section
  4. you can always reach a human via our Microsoft Teams Keycloak How-to Channel or by emailing us at SSO Team

Uptime Monitoring Checks

The following checks measure different aspects of the SSO service health:

CheckPurposeLink
PROD Service UptimeMeasures whether the production SSO service itself is runningView Check
TEST Service UptimeMeasures whether the test SSO service itself is runningView Check
DEV Service UptimeMeasures whether the dev SSO service itself is runningView Check

DNS Checks

DNS checks verify that each environment's hostname is correctly routing to the primary Gold cluster. If a DNS check fails while other uptime checks pass, the service may be running in disaster recovery mode with traffic routed to the DR cluster.

EnvironmentDNS Check
PRODProd DNS Check — Verifies loginproxy.gov.bc.ca routes to Gold cluster
TESTTest DNS Check — Verifies test.loginproxy.gov.bc.ca routes to Gold cluster
DEVDev DNS Check — Verifies dev.loginproxy.gov.bc.ca routes to Gold cluster

Disaster Recovery Mode

When the primary cluster is unavailable, the SSO service automatically fails over to the disaster recovery cluster. Traffic is rerouted transparently, and service continues with minimal interruption.

When Failover Occurs

Failover may be triggered by:

  • OpenShift platform upgrades — the SSO team coordinates in advance and notifies the community via the Microsoft Teams Keycloak How-to Channel.
  • Incidents on the primary cluster — the SSO team sends notifications as failover is initiated and completed.

Communication During Failover

The SSO team communicates failover status through:

Important: Configuration Changes Are Not Preserved

Critical: Any configuration changes made to your integration during disaster recovery mode will be lost when the service returns to the primary cluster. The DR cluster runs from a backup snapshot and does not sync changes back to production.

If you need to make configuration changes during a failover, wait until the service is restored to the primary cluster.

Typical Failover Timeline

  1. Failover initiated — "The Gold Keycloak instance is failing over to the DR cluster"

    • The CSS App may be put into maintenance mode
    • Check the status page for updates
  2. Failover complete — "The Gold Keycloak instance has failed over to the DR cluster"

    • End users can continue logging in to integrated applications
    • The SSO service prioritizes availability for end-user authentication and automation
    • Do not make configuration changes
  3. Service restored — "The Gold Keycloak instance has been restored to the primary cluster"

    • Normal operations resume
    • Configuration changes made during DR are not present — re-apply them if needed
    • Service returns to normal SLA targets

Service Level Agreement (SLA)

The SSO service targets 99.95% availability, translating to approximately 22 minutes of downtime per month. However, actual availability is constrained by several external dependencies beyond the SSO team's direct control.

Availability Guarantees

The SSO service operates 24/7, except during planned maintenance windows. Planned outages within the Kamloops and Calgary data centers are communicated in advance through the Microsoft Teams Keycloak How-to Channel.

Business Hours Support:

  • Hours: Weekdays 9:00 AM–5:00 PM Pacific Time (excluding statutory holidays)
  • Scope: Client provisioning questions, feature requests, and general guidance
  • Response time: Best effort during business hours

After-Hours Support:


Incident Response Times

The SSO team classifies incidents into four priority levels with target response times:

PrioritySeverityExampleBusiness HoursAfter Hours
P1 — CriticalService outageKeycloak or authentication is unavailable< 15 min< 30 min
P2 — HighStability warningStability issue that may cause an outage if not addressed< 30 min< 60 min
P3 — ModerateDegradation warningModerate issue not requiring immediate intervention< 30 minBest effort
P4 — LowMinor warningNon-critical system health issue< 45 minBest effort

All incidents are detected and escalated through the SSO team's 24/7 monitoring system.


Limitations: Zero-Downtime Deployments

The current version of Red Hat Build of Keycloak does not support zero-downtime (blue-green) deployments. When the SSO team upgrades Keycloak or applies security patches, brief downtime is required.

During upgrades and patches:

  • Users with active sessions must re-authenticate
  • Users are notified in advance via the Teams channel with the specific maintenance window

Change Communications

The SSO team provides advance notice for all planned changes:

Change TypeAdvance NoticeAnnouncement ChannelExample
Minor24 hoursMicrosoft Teams Keycloak How-to ChannelBug fixes, low-impact changes
Medium/Major5 business daysMicrosoft Teams Keycloak How-to ChannelKeycloak version upgrades
EmergencyAs soon as possibleMicrosoft Teams Keycloak How-to Channel + SSO Team EmailSecurity vulnerabilities, service recovery

Factors That Impact SSO Service SLA

The SSO service is one component in a larger infrastructure stack. Outages or performance issues in any of the following layers can affect the availability of SSO:

1. Private Cloud Platform Services (OpenShift)

The SSO service runs on the BC Government's Private Cloud Platform (OpenShift), hosted in data centers in Kamloops (primary) and Calgary (disaster recovery).

FactorDetails
Availability Commitment99.95% uptime
Impact of Planned OutagesMinimal — the SSO team uses automated failover to the DR cluster (typically ≤15 minutes)
Impact of Unplanned OutagesAffects SSO availability if both primary and DR clusters are impacted

2. Data Center Infrastructure

The Kamloops and Calgary data centers provide the underlying physical and network infrastructure for the OpenShift platform. The SSO team relies on the SLAs negotiated between the Province and the data center operators.

FactorDetails
Availability Commitment99.5% uptime
ResponsibilityData center operators and the Province; not controlled by the SSO team
ImpactAny unplanned outage at the data center level directly impacts SSO availability

3. Upstream Identity Providers (IDPs)

The SSO service depends on external identity providers for authentication. Outages in these services are entirely outside the SSO team's control.

IDPOperated ByImpact If Down
IDIR / IDIR - MFAAccess Directory Management Services (ADMS/WAM)Users cannot authenticate with IDIR or IDIR - MFA
BCeIDProvincial Identity Information Management (IDIM)Users cannot authenticate with BCeID
BC Services CardBC Services Card teamUsers cannot authenticate with BC Services Card
Other IDPsGitHub, Digital Credential, etc.Users cannot authenticate with that specific IDP

Note: While the SSO service itself may be running, if all upstream IDPs are unavailable, end users cannot log in to any applications.

Overall SLA Calculation

The effective SLA of the SSO service is constrained by the weakest link in the dependency chain. Because the SSO team depends on 99.5% availability from data centers (rather than 99.95%), the realistic SLA target is 99.5% when accounting for all infrastructure dependencies.

Upstream IDP outages further reduce the effective end-user availability, as users cannot authenticate even if SSO is operational.

How the SSO Team Manages Availability

  • Automated failover — detects primary cluster failures and automatically routes traffic to the DR cluster with minimal latency
  • Monitoring and alerts — 24/7 monitoring of service health and upstream IDP status
  • Maintenance windows — coordinated in advance via the Teams Keycloak How-to Channel to minimize impact

For the complete list of current monitoring checks and real-time status, visit status.loginproxy.gov.bc.ca.


Cost

There is no cost for the SSO service for BC Government ministries, central agencies, and Crown corporations. Any future changes to the cost model will be communicated in advance