awork slow or unavailable

Incident Report for awork.com

Postmortem

Englisch

Customer Post-Mortem: Database Outage (SQL Overload)

Summary On Feb 26, 2026 (09:08–10:35), awork was unavailable due to a database CPU saturation incident. A change intended to improve edge-case slow queries (enabling a SQL recompile setting) interacted with one specific high-impact query pattern from our Connect/external "type of work" endpoint, causing runaway load and a full service disruption.

Impact

awork was not accessible during the incident window
No data loss and no security impact

Root Cause A performance optimization (SQL recompile enabled) increased database cost under pressure. A single problematic query pattern then drove database CPU to 100%, and long-running queries were not cancelled quickly enough, worsening the overload.

Resolution We restored service by rolling back the backend version and disabling the recompile setting.

Timeline

09:08 Incident detected/declared (alerts fired)
09:08–09:45 Triage and load reduction attempts; root cause investigation
09:45 Compatibility level adjusted during investigation (did not resolve)
09:55 Backend rolled back to previous version
10:20 SQL recompile setting disabled
10:35 Service fully recovered

Prevention

Rework the affected Connect/external "type of work" endpoint/query
Enforce query timeouts/cancellation to prevent minute-long executions
Safer rollout + canary/rollback triggers for database-impacting changes

German

Kunden-Post-Mortem: Datenbankausfall (SQL-Überlastung)

Zusammenfassung Am 26. Februar 2026 (09:08–10:35) war awork aufgrund eines Datenbank-CPU-Sättigungsvorfalls nicht verfügbar. Eine Änderung zur Verbesserung langsamer Edge-Case-Queries (Aktivierung eines SQL-Recompile-Settings) interagierte mit einem spezifischen, hochbelastenden Query-Pattern unseres Connect/externen „Type of Work"-Endpoints, was zu unkontrollierter Last und einem vollständigen Service-Ausfall führte.

Impact

awork war während des Incident-Fensters nicht erreichbar
Kein Datenverlust und kein Security-Impact

Root Cause Eine Performance-Optimierung (SQL-Recompile aktiviert) erhöhte die Datenbank-Kosten unter Last. Ein einzelnes problematisches Query-Pattern trieb die Datenbank-CPU auf 100 %, und lang laufende Queries wurden nicht schnell genug gecancelt, was die Überlastung verschlimmerte.

Resolution Der Service wurde durch Rollback der Backend-Version und Deaktivierung des Recompile-Settings wiederhergestellt.

Timeline

09:08 Incident erkannt/deklariert (Alerts gefeuert)
09:08–09:45 Triage und Lastreduktionsversuche; Root-Cause-Untersuchung
09:45 Compatibility Level während der Untersuchung angepasst (keine Lösung)
09:55 Backend auf vorherige Version zurückgerollt
10:20 SQL-Recompile-Setting deaktiviert
10:35 Service vollständig wiederhergestellt

Prevention

Den betroffenen Connect/externen „Type of Work"-Endpoint/Query überarbeiten
Query-Timeouts/Cancellation erzwingen, um minutenlange Ausführungen zu verhindern
Sichereres Rollout + Canary-/Rollback-Trigger für datenbank-impactende Änderungen

Posted Feb 26, 2026 - 14:20 CET

Resolved

Database is stable again. We will provide a Post-Mortem for this incident later today.

Posted Feb 26, 2026 - 11:24 CET

Update

Database has recovered but we are still monitoring and investigating the root cause.

Posted Feb 26, 2026 - 10:39 CET

Monitoring

We implemented a fix and are monitoring the results.

Posted Feb 26, 2026 - 10:33 CET

Identified

We are seeing an increased database load and working on a solution.

Posted Feb 26, 2026 - 09:43 CET

This incident affected: Web-App and API (Login, Core (Projects, Tasks, Time Tracking), Workspace, Search).