It is now time to add the concept of “DevOps-Driven Development” to our repertoire.
“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.
Definition and Motivation
First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).
How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.
Practicing DevOps-Driven Development
I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.
We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.
Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.
Here is a “thinking DevOps” check list:
First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc
- Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
- Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database
The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.
If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.
Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:
Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.
Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):
- Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
- All log statements used during the coding and QA phase must be removed
- Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
- The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely meaningful information is output when troubleshooting is required
- Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.
This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).
Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.
Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.
The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.
The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.
This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc
This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).
Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).
We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.
The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).
As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).
Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.
Analytics – Growth Hacking – Usability
This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.
In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)
Even More So for Microservices
As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.
What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx, Nirmata, etc)
Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.
More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.