PRODIRL - Cortex services are unavailable
Updates
Summary:
On 20th January 2025 at approximately 5:30 AM UTC, Internal monitoring tools, internal teams and customers were unable to access Cortex services within the PRODIRL environment.
Customer Impact:
The outage resulted in a temporary disruption of customers’ access to Cortex services within the PRODIRL01 environment
Root Cause:
Several hours after the infrastructure maintenance that occurred on the
PRODIRL environment, the CoreDNS services entered an unexpected unhealthy state, the infrastructure team redeployed the service and all services became healthy again
Remediations:
The infrastructure teams promptly addressed the issue by redeploying the
CoreDNS build to the host cluster. This remediation action successfully restored access to Cortex services for all affected users
Future Mitigating Actions:
-
Enhanced Monitoring and Alerting: Strengthen monitoring capabilities for CoreDNS, including real-time alerts for any deviations from expected behaviour or performance degradation.
-
Regular Version Updates and Testing: Implement a robust process for regular updates and thorough testing of CoreDNS versions within a controlled environment (e.g., staging) before deploying them to
production. -
Enhanced Runbooks: Slight differences in deployments caused delays in the identification and remediation of the services, the runbooks for these deployments have been updated to ensure that the slight differences are captured. These deficiencies have been remediated in later builds of the Aera platform.
We have confirmed internally and with our customers that the Aera platform is now fully restored.
We appreciate your patience during this incident and apologise for any inconvenience that this issue may have caused. Our teams are now working on documenting a comprehensive root cause analysis which we will share with you shortly.
If you have any questions or experience any further problems please don’t hesitate to reach out to our Support team at Aera Support Portal
We are continuing to investigate the Cortex issues. Our engineers are actively working to restore service as quickly as possible. Thank you for bearing with us whilst we work through these issues.
This notice is to inform you that we are receiving reports of our customers experiencing difficulties with Cortex services. We are actively investigating and will provide regular updates until the issues are resolved.
Our apologies for the inconvenience this may be causing and we appreciate your patience as we investigate further.
← Back