Summary:
Users with a subscription could not log in to the system. The failure of the login page was due to an inaccessible outside API. During the login process, the billing and subscription information is fetched from the billing system. The billing system was unavailable and caused the login to fail.
Customer Impact:
Users could not log in to the platform.
Timeline:
13h03: users start reporting that they can’t login to the support team
13h07: after several reports, the support team reports the incident to the dev team
13h15: after trying to understand the incident, the problem is reported to the head of the dev team
13h35: the source of the login failures is identified
13h55: a patch is applied to the system to mitigate the issue, the login is restored
15h10: a fix is deployed to correctly handle the issue
Contributing factors:
An external system was not responding (the billing system).
Lessons Learned:
We should always implement our code while expecting the third party to have some issues. A fallback should always be implemented.
Action items: