Production Deployments Don’t have to be a High Wire Act
Reading time 6 minutes
It’s 3 AM in California, and you are still awake trying to sort through a release process that has been delayed by several hours. The deployment to the main application cluster took several extra hours due to an unanticipated problem with the servers, and now everyone is waiting on the lead database administrator to call into a conference bridge so you can all move on to Step 53.5b of the deployment. Your production deployments always seem to be problematic and this one might be the worst you’ve experienced yet.
Everyone’s Tired: Bad Decisions Abound
By “everyone,” I mean the 30 QA testers and 3 project managers in California, several developers in Sydney and India, and an operations team spread throughout the EU and the US. You can tell by some of the ambient noise on this conference call that people are starting to falter.
Turn risky Go-Lives into streamlined non-events. Get real-time visibility, orchestration, and automation with Plutora.
Your California team is up at 3AM and exhausted to the point that they are making irrational decisions about risk and pushing changes without adequate testing. Some people on the bridge are yawning while others are just waking up. You work for an impressively global organization, but it’s times like these that make you realize your release process needs to be refactored to allow for more sleep.
What Time Zone is this Spreadsheet Using?
Your developers are attentive and well rested, but the deployment playbook Excel spreadsheet lists a sequence of events to be completed with times like 11 PM and 10 PM. “10 AM shift all traffic to backup database. 10:30 AM upgrade database. 11:00 AM deploy new code to application servers.”
There’s no mention of time zone in this spreadsheet which means that the Sydney team is constantly asking, “This release plan is in PST, right?” You run a global company, but no one bothered to add time zone to the release plan; unfortunately, this caused problems when the team in India decided that “10 AM” meant “10 AM Bangalore time.” Twelve and a half hours before the release was even scheduled to start, your production database was taken offline and upgraded. You make a mental note to yourself – “specify time zones on all release spreadsheets.”
“No one else can run Step 53.5b, really? Can someone call Jack?”
Six months ago you scheduled an emergency meeting to convey with a sense of urgency that every project must have a repeatable, automated release process. Most of the teams responded quickly: the application team perfected a series of scripts to automate deploys, but the DBAs are another story. Six months later you have a DBA on the phone informing you that no one other than the lead DBA – Jack – can run the necessary scripts to modify replication settings in production.
You make another note – “Tell DBA team manager to ensure that all production deployments have coverage.” Tired and annoyed you tell the team, “Ok, can someone get Jack on the phone so we can move forward?” Minutes turn into hours. At 4AM your team reports that Jack isn’t picking up his phone and someone remembers that he might just be camping in the middle of the Australian outback. There’s no wireless signal within miles of Jack so your entire release now depends on seeing if the intern can figure out what steps Jack would have performed on your production database. Everyone is blocked.
Time to Rollback?
You are beginning to have that sinking feeling that you might have to tell everyone to execute the rollback plan. That’s not something you want to have to start unless you have to, but it’s 4 AM and your deployment window is going to end. There’s one problem: no one bothered to update the rollback plan so you are not even sure it’s possible.
A rollback also requires the input of management…
Another Broken Production Deployment? How Surprising.
The VP called into the bridge to check on status and made a sarcastic comment about how surprised he was that we were having problems. He wasn’t really. He’s used to production deployments reflecting poorly on his leadership skills. In that sarcastic remark is a hidden message to you: fix this or I will find someone who can and I’m not waiting much longer.
This happens every time your team does a release, and you keep on having to explain it away to management as something that comes with the territory of running a “complex system.” The only reliable constant in your organization is that your release process causes serious downtime, and after months of effort it still isn’t getting better.
Your valid excuse consists of pointing to the tools used to manage the release process – a combination Excel Spreadsheets, wiki pages, and Remedy tickets. You tell your management that teams can’t communicate with each other because there’s no coordination between systems. One team might be focused on JIRA and Rally while the other might be focused on BMC Remedy. Instead of having one tool to hold teams accountable your team ends up starting 120 person email threads and you are sure several teams don’t read your release announcements.
A Hundred Wikis with an Excel Spreadsheet Playing Catch-up
The problem isn’t entirely tool-focused, but there’s no reliable way for your teams to surface management issues and other problems in a transparent way. Your broken deployment process is a symptom of a larger issue with. You lack a consistent approach to deployments because your deployment process is spread across a hundred Wiki pages, fifty development teams, and a high-level Excel spreadsheet that is always playing catch-up.
It doesn’t have to be like this. Your deployments don’t have to be a high-wire act that always seems to end in one of your systems failing in production.
With Plutora’s Deployment Manager you can keep track of your deployments alongside the releases and environments they are designed to orchestrate. You can identify the people responsible for each step and ensure that everyone understands exactly what schedule is happening when. You’ll have a single source of truth for release status so that you don’t have to spend 40 hours on the phone asking, “What step are we on now?”
If you are looking for release play books that can be measured and evolved over time it’s time to start using Plutora.