Small things that multiply our effort*
* Sometimes not that small.
Let’s consider how we deliver the bits around our code and how improving these might help us. From planning, to deprecation and the things in-between. Making small changes in these areas will help us and our team members work better. But first we need to talk about why.
Documenting the Why
Throughout most of our careers, we will not be working in a brand new greenfields environment. Nor will we own them for their entire lifespan. We need to plan for new people joining our team and plan around our limits in memory. Catering to both of these aspects is the same process. We want to record our reasons. When we build out a new system or split out a function, let us record the business goals for why we did it. When we make architectural choices, we need to have a log recording why we did things in certain ways. This isn’t for oversight or audit (but it benefits both), but for future us.
When we talk about the business reasons for doing something, we want to include enough information to help later people. We want to know what the business goals are, what metrics and values define success or failure. This provides concrete measures for our post-implementation review. We also want to know how we measure the value provided by a system. We need to know what business value is provided to determine when it is no longer optimal or when other systems can provide the value at lower costs.
We want to ensure that we cover important factors in our record. What does success or failure look like? What was the problem we faced? What other options did we consider, and how we made the decision? Finally, we need to also think about deprecation plans and start planning to deprecate a component when we plan the creation.
Regardless of how rigorous we are or how many people we consult, there will always be gaps and mistakes in these plans. The point is not to put down perfectly correct documentation but to put something down so that future us can make iterate and improve on it once we know more. So next time we start a feature let’s make a small difference and write down our reasoning behind our choices. But we still need to put it somewhere and that brings us to git.
Documentation lives with the Code.
We likely use some form of versioning system to store our code. It doesn’t matter if it’s Git or Mercurial, Perforce or TFS(Team Foundation Server). We want to minimise the impedance of keeping our documentation up to date. Keeping it with our source code gives us easy access and concurrent versioning and importantly, keeps it in the same processes and tooling as our code. This applies to people writing Go or Ruby, YAML or Terraform. A single uniform place to go and make changes allowing our reviewers to check if we’ve updated the documentation. We can even build CI tooling that validates examples or checks around the documentation. Documentation errors become issues in our normal ticket and issue tools. We can use our CI & CD tooling to generate, validate and deploy these changes ensuring that the content is in sync.
Many of these benefits can also apply to other parts of our documentation. It can enhance our business cases and decision logs, run-books or API docs. A process that tightly integrates with our development process can easily outweigh the cost of teaching BAs, iteration managers or anyone else to update Markdown in the unlikely case that this is required. These are educable people, and we do a disservice to ourselves and them if we underestimate them.
We don’t even have to give up the traditional wiki presentation layer. If systems like docsy are not suitable, we can use our build process to push into the CMS. Even the opaque box of Confluence has some tools available. These tools will allow us to move to a new pattern while maintaining interoperability with the rest of the organisation.
Product Maturity and API support objectives
Hyrum’s law applies to everything we do. All we can hope to do is guide it. We can guide it by being totally upfront about the expectations of our systems and services. Explicitly calling out in documentation what stability guarantees we offer with both the API interface and our service level objectives. Kubernetes does this with their General Availability process, which allows Experimentation in the Alpha stage, Stabilisation in the Beta stage and Compatibility guarantees in the Generally Available (GA) stage. Later we can seek agreement with our users around our deprecation process. With our new shared expectations around interface changes, we need to ensure we have shared expectations around the operation and performance of the API. This brings us to Service Level Objectives (SLO).
Long before we have commercial requirements on our APIs and Products, we need to know when our product is under-performing. We do that with Service Level Objectives (SLO). Service Level Objectives (SLO) define our expected performance ranges. Combined with status pages, these demonstrate our capabilities and give reasoned confidence to our consumers. Inside our team, SLOs inform our prioritisation efforts. As we deliver features and value, our SLO tracking gives us the confidence we haven’t destabilised the product. If we start trending down, we change focus to stabilisation or remediation efforts. It smooths our experiments over time and helps with long term feature planning.
We don’t want to aim for the high 9s of reliability or rapid response times at the start. Like most of this post, we’re talking about systems and processes that we iterate on. We start with safe SLOs that won’t aggravate our team or give false confidence to our users. Over time as we know more about our product and users, we make them tighter. Start with something easy and inaccurate and iteratively improve on it. I can strongly recommend the following pieces for more info on SLOs and how we can improve our process with them Implementing Service Level Objectives by Alex Hidalgo and the excellent SLIs, SLOs, SLAs, oh my video by Seth Vargo and Liz Fong-Jones
Ownership and Responsibility
To ensure we have the best chance of delivering the best quality software we need to reconsider how we address ownership and responsibility for our products. Traditional organisations operating with a siloed resource model introduce substantial impedance to context sharing and trust. In many of these same organisations structuring teams around short term projects with hand off prevents retention of context or a sense of ownership in these products. If a team doesn’t feel they own the product they will not have confidence to change it. If a team isn’t responsible for the product through its entire life-cycle they lose information and context on the value it provides and how it operates. We need to look at how we manage these factors and how we can internalise them.
This includes the responsibilities that traditional siloed organisations would delegate to teams that may or may not be empowered to resolve issues. It may have originally been the responsibility of Operations1 or Release Management or even Security, Accessibility or Quality Assurance. If we want to own our product we need to be responsible for these aspects. Handing it off to specialists and making them rote reviewers is a waste of their skills. We must consider these throughout our development life-cycle. This doesn’t mean we don’t need these specialists, we need to maximise how we apply their knowledge.
As the specialists move from rote review to building practices and tooling that can be used throughout the org we can see a change in how we scale their knowledge. These practices begin to scale with the organisation. Further as more staff are exposed to these aspects of software ownership we get a wider review of the practices and greater testing of them. This is similar to how automated testing catches known bugs preventing regression, freeing us to make new and exciting
bugs features. The work of the specialists in our organisation is of course much like everything else. Not a short term project but a constant stream of work and exploration.2
But this is a post about small things. So let’s start with one. Go find one of the specialists in your path to production. Work with them, understand their processes and limitations and then pair with them on smoothing it out. Perhaps this means expanding the test suite to cover more cases, or smoothing out one part of the deployment pipeline or even internalising part of their processes.
Always be deploying
I believe fostering an environment of early and often deploys helps to deliver many important outcomes. In an organisation that encourages deployment frequency many . If we’re deploying our systems frequently we’re going to find the sharp edges of our processes and push us to smooth them. This could be slow or cumbersome deployments and change systems. If we’re deploying multiple times a day a slow deployment or onerous change process will rapidly become the constraint in our pipeline and we’ll be forced to resolve it or at least reduce the pain it causes.
Frequent deployments help us deliver smaller changes as we don’t need to bundle them together. Smaller changes are easier to understand, easier to test and easier to revert. As our releases are smaller we can run canary releases and get significant data about our experiments faster.
Frequent deployments are not only one of the key metrics but help every other aspect. If we’re deploying frequently the process is in the front of our mind, it’s well tested and understood and we feel confident doing it. If our deployment is part of our normal every day work we don’t want to be wasting time so we make it faster. When we deploy every change we make smaller releases. Smaller releases can be more reliable, more testable and easier to debug when they go wrong. Frequently deploying can help force us to do it well.
Now we have a process that’s fast enough, reliable enough and one that we are confident using. Our product owners and managers love it because it lets us deliver features faster and safer. A fast, safe, well tested, well understood process for pushing changes into production is an excellent tool for people responding to incidents. One path to production for business as usual changes and incidents.3 If you would like to know more may I suggest the Continuous Delivery book by Jez Humble and Dave Farley or any of the excellent work by Dr. Nicole Forsgren such as Accelerate (Co-starring Jez Humble and Gene Kim)
It’s on us (TL;DR)
So let’s take the smallest bits of the above practices and go do them.
- Write down why we’re building something and our success criteria.
- Stick it next to the code even if we publish it somewhere else later.
- Let’s be explicit about what our consumers can expect from our product and protect our experimental parts and the parts our consumers depend on.
- Let’s look to our path to production and see what sharp edges we leave for others and how we can reduce the toil we spread around the organisation.
- Deploy to prod.
To be continued
Like all of our backlogs this isn’t a complete list and it’s one that will grow and change over time. If I can leave you with one last thing — aim for changes that will help you own, communicate, understand and share. These will help you deliver, iterate and deprecate throughout your product’s lifespan.
This post builds on the advice, reviewing and critiquing from C and Delfick and I have to thank everyone else who read earlier versions of this post.
The practice of putting devs on call is a contentious matter and something that has a bunch more context in than I can include in this footnote. The TL;DR of the post on my todo list is developers make better products when they are responsible for the production environment of their product but this needs to be done in a way that doesn’t suck. On call shouldn’t suck for anyone. Any implemented system may or may not include developers being first line of incident response at all times. But operations must never be the only ones who can be responsible for the production environment or the pipeline to deliver products to it. It needs to be a shared responsibility that ops can hand back to the product owners when it’s being consistently shitty. ↩︎
Software Engineering at Google: The Limits of Automated Testing ↩︎
Having a common process for changes regardless of incident response or BAU makes the life of anyone on call much easier. ↩︎