#NoDeployFriday: helpful or harmful?
A fun tweet, no?
Well, maybe not.
Should there be particular times in which production deploys are forbidden? Or is #NoDeployFriday a relic of a time before comprehensive integration tests and continuous deployment?
You may face a similar dilemma in your team. Who's right and who's wrong? Is not deploying on a Friday a sensible risk-averse strategy or is it a harmful culture that prevents us from building better and more resilient systems?
Ring ring
I'm sure that engineers who have had the pleasure of being on call have had their weekend ruined by a Friday change that's blown up. I've been there too. That robot phone call strikes during a family outing or in the middle of the night stating that the application is down. After scrambling to the computer to check rapidly filling logs, it becomes apparent that a subtle edge-case and uncaught exception has killed things. Badly.
However, on investigation, it is discovered that there were no tests written for the scenario that caused the failure, presumably because it wasn't thought possible. After a variety of day-ruining calls to other engineers to work out the best way to revert the change and fix the mess, everything's back online again. Phew.
A Five Whys session happens on the following Monday.
"Let's just stop deploying on Friday. That way everything'll be stable going into the weekend, and we'll be around during the week after all of our deploys."
Everyone nods. If it hasn't hit production by Thursday afternoon, it's waiting until Monday morning. But is this behavior doing more harm than good?
As we all know, interactions on Twitter are often strongly opinionated. Although the logic behind forbidding Friday deploys may seem reasonable, others are quick to point out that it's a band-aid to underlying fragility in a platform, caused by poor tests and bad deployment processes.
Some go as far to suggest that the pleasure of worry-free deployment is better than the weekend itself:
Another user suggests that feature flags are probably the solution to this problem.
This user suggests that risky deploys shouldn't be an issue with the processes and tooling that we have available today.
Who decides?
These exchanges highlight is that as a community of engineers we can be strongly opinionated and not necessarily agree. Who'd have thought? It perhaps also shows that the full picture of #NoDeployFriday contains nuances of arguments that don't translate too well to Twitter. Is it correct that we should we all be practicing continuous delivery, else we're "doing it wrong"?
One aspect is the psychology involved in the decision. The aversion to Friday deployments stems from a fear of making mistakes due to the time of the week (tiredness, rushing) and also the potential that those mistakes might cause harm while most staff are getting two days of rest. After all, the Friday commit that contains a potential production issue could end up bothering a whole host of people over the weekend: the on-call engineer, the other engineers that get contacted to solve the problem, and perhaps even specialist infrastructure engineers who mend corrupted data caused by the change. If it blows up badly, then others in the business may potentially need to be involved for client communications and damage limitation.
Taking the stance of the idealist, we could reason that in a perfect world with perfect code, perfect test coverage and perfect QA, no change should ever go live that causes a problem. But we are humans, and humans will always make mistakes. There's always going to be some bizarre edge case that doesn't get picked up during development. That's just life. So #NoDeployFriday makes sense, at least theoretically. However, it's a blunt instrument. I would argue that we should consider changes on a case by case basis, and our default stance should be to deploy them whenever, even on Fridays, but we should be able to isolate the few that should wait until Monday instead.
There are some considerations that we can work with. I've grouped them into the following categories:
Understanding the blast radius of a change
The maturity of the deployment pipeline
The ability to automatically detect errors
The time it takes to fix problems
Let's have a look at these in turn.
Understanding the blast radius
Something vital is always missed when differences of opinion butt heads online about Friday deploys: the nature of the change itself. No change to a codebase is equal. Some commits make small changes to the UI and nothing else. Some refactor hundreds of classes with no changes in the functionality of the program. Some alter database schemas and make breaking changes to how a real-time data ingest works. Some may restart one instance whereas others may trigger a rolling restart of a global fleet of different services.
Engineers should be able to look at their code and have a good idea of the blast radius of their change. How much of the code and application estate is affected? What could fail if this new code fails? Is it just a button click that will throw an error, or will all new writes get dropped on the floor? Is the change in one isolated service or have many services and dependencies changed in lockstep?
I can't see why anyone would be averse to shipping changes with small blast radii and straightforward deployment at any time of the week, yet I would expect major - especially storage infrastructure-related - changes to a platform to be considered more carefully, perhaps being done at the time when there are the least number of users online. Even better, such large-scale changes should run in parallel in production so that they can be tested and measured with real system load without anyone ever knowing.
Good local decisions are key here. Does each engineer understand the blast radius of their changes in the production environment and not just on their development environment? If not, why not? Could there be better documentation, training and visibility into how code changes impact production?
Tiny blast radius? Ship it on Friday.
Gigantic blast radius? Wait until Monday.
The maturity of the deployment pipeline
One way of reducing risk is by continually investing in the deployment pipeline. If getting the latest version of the application live still involves specialist knowledge of which scripts to run and which files to copy where, then it's time to automate, automate, automate. The quality of tools in this area has improved greatly over the last few years. We've been using Jenkins Pipeline and Concourse a lot, which allow the build, test and deploy pipeline to be defined as code.
The process of fully automating your deployment is interesting. It lets you step back and try to abstract what should be going on from the moment that a pull request is raised through to applications being pushed out into production. Defining these steps in code, such as in the tools mentioned previously, also lets you generalize your step definitions and reuse them across all of your applications. It also does wonders at highlighting some of the wild or lazy decisions you've made in the past and have been putting up with since.
For every engineer that has read the previous two paragraphs and reacted in a way such as "But of course! We've been doing that for years!", I can guarantee you that there are nine others picturing their application infrastructure and grimacing at the amount of work that it would take to move their system to a modern deployment pipeline. This entails taking advantage of the latest tools that not only perform continuous integration, but also allow continuous deployment by publishing artifacts and allowing engineers to press a button to deploy them into production (or even automatically, if you're feeling brave).
Investing in the deployment pipeline needs buy-in, and it needs proper staffing: it's definitely not a side-project. Having a team dedicated to improving internal tooling can work well here. If they don't already know the pressing issues - and they probably will - they can gather information on the biggest frustrations around the deployment process, then prioritize them and work with teams on fixing them. Slowly but surely, things will improve: code will move to production faster and with fewer problems. More people will be able to learn best practice and make improvements themselves. And as things improve, practices begin to spread, and that new project will get done the right way from the start, rather than copying old bad habits ad infinitum.
The journey between a pull request being merged and the commits going live should be automated to the point that you don't need to think about it. Not only does this help isolate real problems in QA, since the changed code is the only variable, it also makes the job of writing code much more fun. The power to deploy to production becomes decentralized, increasing individual autonomy and responsibility, which in turn breeds more considered decisions about when and how to roll out new code.
Solid deployment pipeline? Deploy on Friday.
Copying scripts around manually? Wait until Monday.
The ability to detect errors
Deployment to production doesn't stop once the code has gone live. If something goes wrong, we need to know, and preferably we should be told rather than needing to hunt out this information ourselves. This involves the application logs being automatically scanned for errors, the explicit tracking of key metrics (such as messages processed per second, or error rates), and an alerting system that lets engineers know when there are critical issues or particular metrics that have shown a trend in the wrong direction.
Production is always a different beast to development, and engineers should be able to view the health of the parts of the system they care about, and also be able to compose dashboards that allow them to view trends over time. It should allow questions to be answered about each subsequent change: has it made the system faster, or slower? Are we seeing more or less timeouts? Are we CPU bound or I/O bound?
Tracking of metrics and errors should also feed into an alerting system. Teams should be able to identify which signals from telemetry mean that something bad has happened, and route automated alerts through to a pager system. We happen to use PagerDuty for our teams and top-level major incident rota.
A focus on measurement of production system metrics means that engineers can see if something has changed as the result of each deployment, whether that change is for the better or for worse, and in the absolute worst case, the system will automatically let somebody know if something has broken.
Good monitoring, alerts and on-call rota? Deploy on Friday.
Scanning through the logs manually via ssh? Wait until Monday.
The time it takes to fix problems
Finally, a key consideration is the time that it will take to fix problems, which is somewhat related to the blast radius of a change. Even if you have a slick deployment pipeline, some changes to your system may be tricky to fix quickly. Reverting a change to the data ingest and to the schema of a search index might involve an arduous reindex as well as fixing the actual line of code. The average time to deploy, check, fix and redeploy a CSS change may be a matter of minutes, whereas a bad change to underlying storage might require days of work.
For all of the deployment pipeline work that can increase the reliability of making changes at a macro level, no change is equal, so we need to consider them individually. If something goes wrong, can we quickly make this right?
Totally fixable with one revert commit? Deploy on Friday.
Potentially a massive headache if it goes wrong? Wait until Monday.
Picking your battles
So where am I on #NoDeployFriday? My answer is that it depends on each individual deploy. For changes with a small blast radius that are straightforward to revert if needed, then I'm all for deploying at any time of the day or week. For any big, major version changes that need their effects carefully monitored in the production system, then I'd strongly advise waiting until Monday instead.
Fundamentally, whether or not you deploy on Friday is up to you. If you're struggling with a creaking and fragile system, then maybe don't until the necessary work has been put in to improve how changes go into production. But do that work - don't put it off. Banning Friday deploys as a band-aid to cover temporary underinvestment in infrastructure is fine. That's sensible damage limitation for the good of the business. However using it to cover permanent underinvestment is bad.
If you're not entirely certain of the potential effects of a change, then delay until Monday. But work out what could be in place next time to make those effects more clear, and invest in the surrounding infrastructure to make it so. As with life, there are always nuances in the details of each of our decisions. It's not black or white, wrong or right: as long as we're doing our best for the business, our applications and each other, whilst improving our systems as we go along, then we're doing just fine.
Happy deploying.