7 best practices to shift focus to scalability, reliability and security

IT Consulting

Nov 17

As digital products and services are more deeply embedded in critical industries and infrastructure and the implications of problems grow in scope, engineering organizations are renewing their focus on building platforms that are scalable, reliable and secure. This is an important shift—from teams and cultures that valued speed of delivery above all else to teams and cultures that make high-quality systems the foundation of everything they do.

Here are 7 best practices you can implement to begin shifting focus away from speed and toward scalability, reliability and security.

1. Evangelize the Narrative That the Systems you Work on Matter

At the end of the day, your systems are there to serve your customers and users. Leadership and teammates who embody these values are incredibly important. People should worry about letting their customers down, not about being yelled at by their manager or receiving a bad performance review.

2. Set Explicit Expectations about Uptime

When you say that your app needs to have 99.99% uptime, it doesn’t have the same impact as saying 52.6 minutes of downtime a year. Translate uptime numbers so they’re tangible—and on people’s minds every time there’s an outage.

3. Don’t Sweep Problems Under the Rug

After the dust settles from an incident, it’s important to get together and talk about it. Be transparent about what happened, which person or team made the error and what the implications to users were. The idea isn’t to point fingers or assign blame but to bring the team together so that you can learn, evolve and figure out how to prevent this from happening in the future.

Here are some questions you can use:

Who wrote the code?
How was it tested?
Should our automated testing have caught this?
Who reviewed this code?
Were they the right people to review the code?
Why wasn’t it caught in the staging environment?
Who released it and who went through the release validation checklist?

Formalizing this process lets everyone on the team know that mistakes are part of the job and that there’s a standard way that problems are handled. This is a great way to help engineers feel free to do their best work without fear or avoidance.

4. Define Individual Owners for Every Piece of the System

If there’s any confusion about who owns what within your engineering org, you need to address that right away; it’s a massive problem. Not only do you need an owner for every part of the system, but they must also be empowered to fix issues and know that they’re the ones on the hook.

5. Invest in Documentation, Runbooks and Logs

The owner of each piece of the application is also responsible for making sure that documentation is up to date, runbooks are clear and easy to follow and logs are readily accessible. When you’re getting paged at 3:00 a.m. about a service you’ve never heard of before, these resources are key.

6. Prioritize the Delivery Pipeline Like it’s a Key Feature

You should always be thinking about how to build safeguards around your workflows, like automated testing or new tools. This may mean you need to go toe-to-toe with product managers to make sure the work required to build a robust delivery system isn’t getting kicked down to the backlog.

7. Get Business Buy-In by Using Terms They Understand

Unreliable code is technical debt. And debt is something all leaders understand, regardless of their technical abilities. Stripe published a survey in 2018 that found the average developer spends more than 17 hours a week dealing with maintenance issues, costing $300 billion per year globally. If you’re pushing to invest in SRE practices, those numbers might help you make your case.

Culture comes first, but the right tools can reinforce the culture and drive accountability over time. Tools that help teams keep track of critical information such as service owners, documentation and runbooks can help tremendously. Additionally, having easily available dependency maps of the architecture are key to quickly understanding complex systems. Lastly, setting organization-wide standards for best practices and providing templates for spinning up new services that follow these can help ensure that these things are not forgotten.

Source: devops.com