For years, I have been working on software and architecture modernization.
Mostly existing systems. Sometimes brand new products.
This week made one thing painfully clear: RTO is not an operational problem. It is an architectural one.
When modernization meets reality
I recently participated in a Chaos Engineering EBA with a customer in the UK.
Think AWS style hackathon, but under real constraints, real pressure, and real failure scenarios.
Three teams were involved:
- A red team actively attacking the system
- A chaos engineering team injecting failures
- A disaster recovery team trying to keep the platform alive
I was leading the disaster recovery effort.
What followed was not theoretical.
Database connections were nuked.
Availability Zones were taken down.
IAM access was tampered with.
While the system was under attack, we had to react live.
Disaster recovery under fire
In parallel with the failures, we were:
- Fixing infrastructure via CloudFormation
- Enabling AWS security services
- Reviewing IAM permissions
- Writing steering documents
- Protecting the application while it was actively degrading
This is where architecture stops being slides and becomes muscle memory.
DORA makes it explicit: restore procedures must be replayed, not documented and forgotten.
RTO is not linear
One of the key observations from the exercise is simple to state, but hard to accept.
RTO does not grow linearly.
As the number of manual configuration parameters increases, recovery time explodes.
To illustrate this, consider the following model:
- X = number of configuration parameters to restore manually
Examples: IAM rules, disk mappings, networking settings, ordering constraints - RTO(X) = recovery time objective, measured in hours

At low values of X, recovery feels manageable.
Beyond a threshold, humans become the bottleneck.
Past that point, RTO exceeds any acceptable objective.
Automation helps, but architecture defines the ceiling.
What Chaos Engineering really reveals
Chaos Engineering is not about breaking systems for fun.
It reveals:
- Hidden manual steps
- Fragile dependencies
- Restore procedures that do not scale
- Architectural debt disguised as operational knowledge
Many teams practice deployments daily.
Very few practice restores with the same intensity.
Key takeaways
- Resilience is architecture
- Security is architecture
- Compliance is architecture
- Manual recovery steps are measurable risk
- RTO grows faster than teams expect
If you want to improve your DORA posture, do not start with tools.
Start by reducing what must be restored manually.
Chaos Engineering is one of the most honest mirrors you can put in front of your system.
And it rarely lies.
