Azure Cental US region down impacted multiple services in Kyro as part of CrowdStrike

Incident Report for Kyro Technologies Inc

Postmortem

Description

In Azure, Central US region all 3 availability zones are down, leading to outage in Kyro Prod app, as it is available in Central US region. During the time, multiple services impacted causing customer impact

How identified

  • By Our Alert setup We received
  • formio status-appi-mkrwkcpmhlans
  • VM Availability - form-admin-Prod with Sev0.
  • Multiple 500 failure so based on that received that alert
  • Changestream events restart alert PROD with Sev0.
  • Deployment Error - PROD with Sev1
  • api gateway status-appi-mkrwkcpmhlans

Customer Impact:

  1. Trying to finish up my daily report for today. It was saved in drafts but will not open now. Error message said "Internal Server Error" Had to manually sync but still same error. Logged out and back in. Same error message. Joy Cox Think Power Solutions FCR AEP joy.cox@thinkpowersolutions.com

There are few more customer impact, have to check intercomm / zendesk

Impact

Impact Statement from Azure

Impact Statement: Starting at approximately 21:56 UTC on 18 Jul 2024 | 3:26 AM IST on July 19th, a subset of customers may experience issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services.

Current Status: We are aware of this issue and have engaged multiple teams. We’ve determined the underlying cause. A backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks. Mitigation has been confirmed for all Azure Storage clusters, the majority of services are now recovered. A small subset of services is still experiencing residual impact. Impacted customers will be continuing to communicate through the Azure service health portal.

The Central US region is located in Iowa and boasts three availability zones – meaning Microsoft operates three discrete physical facilities that are fewer than two milliseconds apart in terms of network connection speed. Like its hyperscale rivals, Microsoft promotes availability zones as improving resilience and enabling faster disaster recovery.

But that idea relies on at least one availability zone being available – right now all three are out.

Impacted Services:

  • APIM
  • App Insight
  • Cosmos DB
  • Storage Account
  • Keyvault
  • Logic App
  • Container registry
  • Virtual machine - Changestream: 4:12 to 7:07 Formio: 3:49 to 6:19

Go to Log Analytics and run query

| CorrelationId | OperationNameValue | Resource | ErrorCode | ErrorMessage | TimeGenerated | ErrorCount | | --- | --- | --- | --- | --- | --- | --- | | 6303208d-c37f-4d93-b5ee-6d7da5674265 | MICROSOFT.STORAGE/STORAGEACCOUNTS/LISTACCOUNTSAS/ACTION | kyroprod | InternalServerError | Encountered internal server error. Diagnostic information: timestamp '20240718T221645Z', subscription id 'c97a7671-7dae-4f08-a7f8-b26c4b5f75ea', tracking id '6303208d-c37f-4d93-b5ee-6d7da5674265', request correlation id '6303208d-c37f-4d93-b5ee-6d7da5674265'. | 2024-07-18T22:16:45.2876893Z | 1 | | 039ef870-ae33-4cfa-9bfb-285bc67e9932 | Microsoft.Resourcehealth/healthevent/InProgress/action | form-admin |   |   | 2024-07-18T22:18:14.102Z | 1 | | 91dcb917-bc79-40a7-b5e8-d2010e0d417e | MICROSOFT.STORAGE/STORAGEACCOUNTS/LISTACCOUNTSAS/ACTION | kyroprod | InternalServerError | Encountered internal server error. Diagnostic information: timestamp '20240718T221815Z', subscription id 'c97a7671-7dae-4f08-a7f8-b26c4b5f75ea', tracking id '91dcb917-bc79-40a7-b5e8-d2010e0d417e', request correlation id '91dcb917-bc79-40a7-b5e8-d2010e0d417e'. | 2024-07-18T22:18:15.1938552Z | 1 | | f034de56-1233-4fcf-8291-884ba9d47978 | MICROSOFT.INSIGHTS/COMPONENTS/METADATA/ACTION | appi-mkrwkcpmhlans | InternalServerError | Encountered internal server error. Diagnostic information: timestamp '20240718T222016Z', subscription id 'c97a7671-7dae-4f08-a7f8-b26c4b5f75ea', tracking id 'f034de56-1233-4fcf-8291-884ba9d47978', request correlation id 'f034de56-1233-4fcf-8291-884ba9d47978'. | 2024-07-18T22:20:16.7003297Z | 1 |

Not Impacted

  • Container app

Timeline

UTC Time IST Time Events happened
21:56 UTC on 18 Jul 2024 3:26 AM IST on 19 Jul 2024 Azure Reported the Issue in status page
UTC 3:37 AM Encountered multiple 500 for Get/ other Request and Changestream Exception as well with Sev0
UTC 3:54 AM Uday reported: in teams group - https://app.kyro.ai/- is very slow. multiple requests are ending with 500 - Request taking longer than usual, please try again late
UTC 4 AM Uday reported: Nothing is loading for him in the portal.
UTC 4:03 AM From South Central US, the ping test failed, 2 more failed ping test observed at 4:18 AM and 4:22 AM
UTC 4:23 AM From North Central US, 1 ping test failed, later request are success
UTC 4:24 AM and 4:25 AM APIM ping test failure observed in East US (2), South Central US, West US, North Central US
22:56 4:26 Azure officially updated in status page,
UTC 8:30 AM app.kyro.ai is working fine
UTC 9 AM Sample projects not created for new signups and The page keeps on toggling between the main page and the project listing page after the user signs in . Uday Timelogged is sync but totals are 0, even after adding
UTC 9:43 AM Event grid started delivering few events
UTC 9:54 AM Avaneesh Tested document upload as mark-as-complete event handler triggers events from blob upload directly. That is working.
UTC 10:22 Changestream is turned on and monitoring auditlog creation
5:40:06 UTC on 18 Jul 2024 11:10:06 IST on 19 Jul 2024 Azure have mitigated issue

Impact from Kyro end

If the response is 0, then the data will remain in queue if it is from mobile. worklog ids Present in Both Lists:

6699975769c65b4e153d10fc 6699a202f53f2c928c3afb0e 6699bb7c46c46518100dface 66999edb5d778af2894e90ec 6699a5468a8594fa500d96a7 6699c2fc072ea864d57090c6 6699b5daa814c1e38dacc773 66999f135d778af2894e90f1 6699cc6ef9bcea2cac275489 6699a8a040a3d55390bb304f 66999c2dfa47cc3a7e413d98 6699a8e440a3d55390bb3055 6699a55f8a8594fa500d96a8 6699978bc5d9c91f5a585fd6 6699983cda40f78057d3daae

Worklog ids Present in report but not in the logs you sent:

6699a417492a0e31b987f626 6699b49002f1e72407074622 6699c74e9e433852de1499dc 66999d23d216ed69b4e92913 66999dced216ed69b4e9291f 669932702424c7de696d933a 66999b86988ab20b2ac542b4

Worklog ids Present in logs but not in the report :

6699edb7389fb60b532f91e3 6699df518a0f59174dfb47f4 6699cda743dce842294e242c 6699c5a28c64a2023ba339a2 6699cef7bf7c640fcb2e5fa4 6699ce1d434668b48845a651 6699c9ab3c52487f8920a43f 6699bfc48331e3431bdda6a1 6699c26a93e07227ef8b3d2a 6699a5c428fd916b76d8018a

These submission id's doesn't match with Report.

As two of the 0 responses submission Id - 6699b4548331e3431bdd9964, 66990c842bcf9beaa348dff5 are in DB so the report doesn't contain the entries.

Other 500 errors for the submissions doesn't have any customevent on the Azure. As the missing_submissions script used for Mobile report is based on the customevents. It is not part of the report.

From the Worklog ids Present in the report but not in the logs sent

These worklogs are not present in the database, app insights logs and container app logs

6699a417492a0e31b987f626 6699b49002f1e72407074622 6699c74e9e433852de1499dc 66999d23d216ed69b4e92913 66999dced216ed69b4e9291f 669932702424c7de696d933a 66999b86988ab20b2ac542b4

This worklog returned - 400 - {"detail":{"message":"Cannot add worklog to this timesheet","error_code":"T_005"}} at 2024-07-18T15:19:12.8115796Z

669932702424c7de696d933a

For missing auditlogs we are planning to call the POST auditlog service with the current document so that an insert event can be generated to backfill the missing auditlogs.

From looking into the worklogs that were in the logs and not in the report: In DB (Successful) 6699cda743dce842294e242c

6699c9ab3c52487f8920a43f 6699c5a28c64a2023ba339a2 6699bfc48331e3431bdda6a1 6699c26a93e07227ef8b3d2a

Actions to be Taken

Actions Taken by Kyro Team

  1. Analysed Azure Status page and confirmed downtime from their end
  2. Communicated to users/customers via David.
  3. To analyse impact we tried testing mobile app by submitting timesheets and reports. and monitored if Changestreams might have to be handled properly once the services are back up. Outcome: Forms, timesheets are not loading for Uday
  4. As action item, get an estimate of what the impact would be by looking at daily usage. The auditlogs are not being created due to cosmos unavailability. We might need to either restart the changestreams or re-run for some failed documents.
  5. Let's validate if everything is working properly. See if we have data for failed requests from the outage timeframe and if the changestream processes are running.

If application insights would have captured incoming requests, we will have to look at what type of POST/PATCH requests failed.

  1. https://github.com/kyro-saas/kyro-one/pull/7734 - Generated missing_auditlogs
  2. Check the POST calls during the outage to get a list of impacted users. - https://github.com/kyro-saas/kyro-one/pull/7735 - failed-worklogs
  3. Failed post submissions
  4. added the script to check for missing auditlogs and create the auditlog for same by ashraf-kyro · Pull Request #7734 · kyro-saas/kyro-one (github.com) - After running Uday's 3 auditlog got added
What happens when you submit a time log from the project?

I am able to log time from projects page and it gets saved in unsynced entries

  1. I had added 4 worklogs during the outage. Only the auditlog for the 1st worklog got created. I added a fifth worklog right now. The audit log got created. So 3 worklog auditlogs were skipped.

Root cause / Reason

Action Item

  1. get an estimate of what the impact would be by looking at daily usage. The auditlogs are not being created due to cosmos unavailability. We might need to either restart the changestreams or re-run for some failed documents.

If application insights would have captured incoming requests, we will have to look at what type of POST/PATCH requests failed.

  1. Create a GitHub issue to have the mobile app go to offline mode when it detects API failures. - Mobile
  2. Update Kyro Statuspage - DevOps
  3. Read up on the recovery from region loss link and list down what we need to do to set up failover in such scenarios.
  4. list of items to be done to failover from an infrastructure perspective and to get the mobile app functioning in complete offline mode.
  5. look at the time of the last good backups and have the procedures ready to restore in case we require it.
  6. ensure we determine the customer impact, any inconsistencies in the DB and steps for complete recovery.
  7. Check the POST calls during the outage to get a list of impacted users - Uday
  8. Write a script to check for which documents the auditlogs are not created. - Ashraf
  9. Re-run the report missing worklog and submission report for mobile to check if there are entries in the queue. - Mohan
  10. Testing the services one by one to check if everything is working fine. - QA

Prevention and Avoidance

Posted Nov 21, 2024 - 08:18 UTC

Resolved

CosmosDB and compute solutions are impacted, causing kyro app unable to function properly

Azure Response
Impact Statement: Starting at approximately 21:56 UTC on 18 Jul 2024 | 3:26 AM IST on July 19th, a subset of customers may experience issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services.

Current Status: We are aware of this issue and have engaged multiple teams. We’ve determined the underlying cause. A backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks. Mitigation has been confirmed for all Azure Storage clusters, the majority of services are now recovered. A small subset of services is still experiencing residual impact. Impacted customers will be continuing to communicate through the Azure service health portal.

The Central US region is located in Iowa and boasts three availability zones – meaning Microsoft operates three discrete physical facilities that are fewer than two milliseconds apart in terms of network connection speed. Like its hyperscale rivals, Microsoft promotes availability zones as improving resilience and enabling faster disaster recovery.

But that idea relies on at least one availability zone being available – right now all three are out.
Posted Jul 19, 2024 - 08:17 UTC