Why RTO explodes when architecture complexity grows

Jan 25, 2026 min read

For years, I have been working on software and architecture modernization.
Mostly existing systems. Sometimes brand new products.

This week made one thing painfully clear: RTO is not an operational problem. It is an architectural one.

When modernization meets reality

I recently participated in a Chaos Engineering EBA with a customer in the UK.
Think AWS style hackathon, but under real constraints, real pressure, and real failure scenarios.

Three teams were involved:

A red team actively attacking the system
A chaos engineering team injecting failures
A disaster recovery team trying to keep the platform alive

I was leading the disaster recovery effort.

What followed was not theoretical.

Database connections were nuked.
Availability Zones were taken down.
IAM access was tampered with.

While the system was under attack, we had to react live.

Disaster recovery under fire

In parallel with the failures, we were:

Fixing infrastructure via CloudFormation
Enabling AWS security services
Reviewing IAM permissions
Writing steering documents
Protecting the application while it was actively degrading

This is where architecture stops being slides and becomes muscle memory.

DORA makes it explicit: restore procedures must be replayed, not documented and forgotten.

RTO is not linear

One of the key observations from the exercise is simple to state, but hard to accept.

RTO does not grow linearly.

As the number of manual configuration parameters increases, recovery time explodes.

To illustrate this, consider the following model:

X = number of configuration parameters to restore manually
Examples: IAM rules, disk mappings, networking settings, ordering constraints
RTO(X) = recovery time objective, measured in hours

RTO Growth as Configuration Complexity Increases

At low values of X, recovery feels manageable.
Beyond a threshold, humans become the bottleneck.
Past that point, RTO exceeds any acceptable objective.

Automation helps, but architecture defines the ceiling.

What Chaos Engineering really reveals

Chaos Engineering is not about breaking systems for fun.

It reveals:

Hidden manual steps
Fragile dependencies
Restore procedures that do not scale
Architectural debt disguised as operational knowledge

Many teams practice deployments daily.
Very few practice restores with the same intensity.

Key takeaways

Resilience is architecture
Security is architecture
Compliance is architecture
Manual recovery steps are measurable risk
RTO grows faster than teams expect

If you want to improve your DORA posture, do not start with tools.
Start by reducing what must be restored manually.

Chaos Engineering is one of the most honest mirrors you can put in front of your system.

And it rarely lies.