PRODIRL - Cortex services are unavailable

Minor incident Production IRL Cortex
2025-01-20 05:30 UTC · 10 hours, 28 minutes, 1 second

Updates

Post-mortem

Summary:

On 20th January 2025 at approximately 5:30 AM UTC, Internal monitoring tools, internal teams and customers were unable to access Cortex services within the PRODIRL environment.

Customer Impact:

The outage resulted in a temporary disruption of customers’ access to Cortex services within the PRODIRL01 environment

Root Cause:

Several hours after the infrastructure maintenance that occurred on the
PRODIRL environment, the CoreDNS services entered an unexpected unhealthy state, the infrastructure team redeployed the service and all services became healthy again

Remediations:

The infrastructure teams promptly addressed the issue by redeploying the
CoreDNS build to the host cluster. This remediation action successfully restored access to Cortex services for all affected users

Future Mitigating Actions:

  • Enhanced Monitoring and Alerting: Strengthen monitoring capabilities for CoreDNS, including real-time alerts for any deviations from expected behaviour or performance degradation.

  • Regular Version Updates and Testing: Implement a robust process for regular updates and thorough testing of CoreDNS versions within a controlled environment (e.g., staging) before deploying them to
    production.

  • Enhanced Runbooks: Slight differences in deployments caused delays in the identification and remediation of the services, the runbooks for these deployments have been updated to ensure that the slight differences are captured. These deficiencies have been remediated in later builds of the Aera platform.

January 28, 2025 · 17:17 UTC
Resolved

We have confirmed internally and with our customers that the Aera platform is now fully restored.

We appreciate your patience during this incident and apologise for any inconvenience that this issue may have caused. Our teams are now working on documenting a comprehensive root cause analysis which we will share with you shortly.

If you have any questions or experience any further problems please don’t hesitate to reach out to our Support team at Aera Support Portal

January 20, 2025 · 15:57 UTC
Investigating

We are continuing to investigate the Cortex issues. Our engineers are actively working to restore service as quickly as possible. Thank you for bearing with us whilst we work through these issues.

January 20, 2025 · 12:02 UTC
Issue

This notice is to inform you that we are receiving reports of our customers experiencing difficulties with Cortex services. We are actively investigating and will provide regular updates until the issues are resolved.

Our apologies for the inconvenience this may be causing and we appreciate your patience as we investigate further.

January 20, 2025 · 11:33 UTC

← Back