Atera Root Cause Analysis (RCA) Report - July 28,2025

gilgi
gilgi Internal Posts: 581 ✭✭✭

Hello Atera Community,
This is a cross-post for both our public r/Atera subreddit and the Community Forums.

In the past couple of weeks we had several outages that compromised our service with some of our customers.

As a community you came through and flagged this, confirming with your peers that there is an issue, and I once again want to share my thanks, it’s one of your strengths as a community!
This was a main factor in pushing the issue forward, along with the support tickets, ping the developers to investigate.

Below are the threads associated to the incidents
Forums thread
Reddit threads #1 #2 #3

I wanted to share our RCA report which provides a better technical explanation of the events, as well as the next steps to mitigate future incidents stemming from the root cause.
To simplify the official formatting of RCA’s in these channels, I’m sharing it here below as part of this post as well. You can also reach out to your CSMs or via a support ticket for the formal document.

As part of my own little ‘retro’ - I’ve taken your notes and feedback regarding updating both channels.
Reddit has the advantage that I can update on the go from anywhere from my mobile, the forums have an integrated pop-up when the status page is updated - so I’ll try as best as possible to bridge the positives of each space.

My commitment remains to keep you up to date, sharing information as soon as it’s approved.

Atera Root Cause Analysis (RCA) Report


General information

Report title:            Atera agents incorrectly displayed as offline

Date of report:            July 28, 2025

Incident date/time: July 17th to July 24th, 2025

Prepared by:            Office of the CTO


Incident description

Summary: 

Atera experienced several incidents where a large number of agents were inaccurately flagged as offline. This led to incorrect availability data across the platform, impacting real-time monitoring and visibility for end-users.

Impact:  

  • Customers received erroneous alerts indicating agents were offline when they were operational.
  • This led to a surge in customer support inquiries and a decline in confidence regarding agent monitoring.
  • The incident heightened the risk of overlooking critical alerts or misinterpreting system health.

Root cause

The problem stemmed from inaccurate agent status messages, which were transmitted through a third-party vendor who has confirmed the issue originated on their end, acknowledging an internal problem affecting presence accuracy.

Mitigation and action plan

Double-verification logic: To mitigate risks and enhance reliability, the Atera team has implemented a double-verification logic on the platform. This new measure validates agent status before any updates are applied.

Up-to-date and granular signals: The heartbeat frequency has been increased to ensure more frequent and detailed signals are sent to the vendor.

Presence indicators:  Added additional presence indicators to help triangulate true agent status.

Internal fallback checks: Lowered the interval for Atera’s internal fallback checks, allowing for faster correction of inaccurate offline flags.

Next steps

As a next step, Atera team will continue to closely monitor agent status accuracy in real time to ensure consistency and reliability. We will also explore alternative or supplemental presence-detection mechanisms to strengthen system fault tolerance and minimize the risk of similar disruptions in the future.

Tagged:

Topics