Reliable releases and rollbacks CRE life lessons By Adrian Hilton, Customer Reliability Engineer Editor�s note : One of the most common causes of service outages is releasing a new version of the service binaries; no matter how good your testing and QA might be, some bugs only surface when the affected code is running in production. Over the years, Google Site Reliability Engineering has seen many outages caused by releases, and now assumes that every new release may contain one or more bugs. As software engineers, we all like to add new features to our services; but every release comes with the risk of something breaking. Even assuming that we are appropriately diligent in adding unit and functional tests to cover our changes, and undertaking load testing to determine if there are any material effects on system performance, live traffic has a way of surprising us. These are rarely pleasant surprises. The release of a new binary is a common source of outages. From the point of v...