CWB Connection errors (Prod IRL)

Major incident Production IRL Cognitive WorkBench
2023-03-22 09:38 UTC · 1 hour, 36 minutes

Updates

Post-mortem

Summary:

Our Platform Operations teams were informed of an issue within the Production IRL environment that was preventing access to the Cognitive Workbench (CWB) feature across all projects.
The teams were able to determine that the errors preventing access were due to connectivity requests being rejected by the database. Priority was to restore connectivity for users; resetting the database connections and adjusting timeout configuration was identified, validated and executed as a temporary mitigating action which subsequently restored service.
Connections continued to be monitored whilst the teams worked on a full resolution to the issue.
The ongoing investigation narrowed the cause to 2 specific connections that were incorrectly configured causing a flood of rejected connections from the database. The configuration on these connections was corrected ensuring a permanent resolution to this issue.

Customer Impact:

Users were unable to access Cognitive Workbench to view recommendations.

Root Cause:

Misconfigured credentials on 2 project connections caused a flood of rejection errors from the database causing the maximum threshold for error connections to be exceeded and CWB to become inaccessible.

Remediations:

  • Temporary remediation was to reset the database connections and increase the threshold for rejection errors
  • Full resolution occurred when the 2 connections, that had been identified as triggering the errors, had their configuration updated

Future Mitigating Actions:

  • Alerting will be implemented to detect when the threshold for rejected connections is approaching its limit
  • A full review of database configuration related to this scenario has been planned for all environments
March 27, 2023 · 08:08 UTC
Resolved

We have confirmed internally and with our customers that the Aera platform is now fully restored.

We appreciate your patience during this incident and apologise for any inconvenience that this issue may have caused. Our teams are now working on documenting a comprehensive root cause analysis which we will share with you shortly.

If you have any questions or experience any further problems please don’t hesitate to reach out to our Support team at support@aeratechnology.com

March 22, 2023 · 11:15 UTC
Update

We have identified a workaround for the reported issues. Our engineers have flushed connections to the database and service should have been restored. We will continue to monitor to ensure no additional issues arise and will send a further update to confirm full resolution.

You should now be able to resume normal activities however if you continue to experience any problems please contact our support team support@aeratechnology.com

Thank you for your patience and understanding whilst our engineers restored service.

March 22, 2023 · 10:26 UTC
Investigating

We are continuing to investigate the CWB connection issues. Our engineers are actively working to restore service as quickly as possible. Thank you for bearing with us whilst we work through these issues.

March 22, 2023 · 10:12 UTC
Issue

This notice is to advise you that we are receiving reports of our customers experiencing difficulties with the platform. We are actively investigating and will provide regular updates until the issues are resolved.

Our apologies for the inconvenience this may be causing and we appreciate your patience as we investigate further.

March 22, 2023 · 09:44 UTC

← Back