Why RTO explodes when architecture complexity grows

Jan 25, 2026 min read

For years, I have been working on software and architecture modernization.
Mostly existing systems. Sometimes brand new products.

This week made one thing painfully clear: RTO is not an operational problem. It is an architectural one.

When modernization meets reality

I recently participated in a Chaos Engineering EBA with a customer in the UK.
Think AWS style hackathon, but under real constraints, real pressure, and real failure scenarios.

Three teams were involved:

  • A red team actively attacking the system
  • A chaos engineering team injecting failures
  • A disaster recovery team trying to keep the platform alive

I was leading the disaster recovery effort.

What followed was not theoretical.

Database connections were nuked.
Availability Zones were taken down.
IAM access was tampered with.

While the system was under attack, we had to react live.

Disaster recovery under fire

In parallel with the failures, we were:

  • Fixing infrastructure via CloudFormation
  • Enabling AWS security services
  • Reviewing IAM permissions
  • Writing steering documents
  • Protecting the application while it was actively degrading

This is where architecture stops being slides and becomes muscle memory.

DORA makes it explicit: restore procedures must be replayed, not documented and forgotten.

RTO is not linear

One of the key observations from the exercise is simple to state, but hard to accept.

RTO does not grow linearly.

As the number of manual configuration parameters increases, recovery time explodes.

To illustrate this, consider the following model:

  • X = number of configuration parameters to restore manually
    Examples: IAM rules, disk mappings, networking settings, ordering constraints
  • RTO(X) = recovery time objective, measured in hours

RTO Growth as Configuration Complexity Increases

At low values of X, recovery feels manageable.
Beyond a threshold, humans become the bottleneck.
Past that point, RTO exceeds any acceptable objective.

Automation helps, but architecture defines the ceiling.

What Chaos Engineering really reveals

Chaos Engineering is not about breaking systems for fun.

It reveals:

  • Hidden manual steps
  • Fragile dependencies
  • Restore procedures that do not scale
  • Architectural debt disguised as operational knowledge

Many teams practice deployments daily.
Very few practice restores with the same intensity.

Key takeaways

  • Resilience is architecture
  • Security is architecture
  • Compliance is architecture
  • Manual recovery steps are measurable risk
  • RTO grows faster than teams expect

If you want to improve your DORA posture, do not start with tools.
Start by reducing what must be restored manually.

Chaos Engineering is one of the most honest mirrors you can put in front of your system.

And it rarely lies.

comments powered by Disqus