DevOps-Driven Development

It is now time to add the concept of “DevOps-Driven Development” to our repertoire.

“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.

Definition and Motivation

First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).

How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post  thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.

Practicing DevOps-Driven Development

I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.

We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.

Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.

Here is a “thinking DevOps” check list:

Deployable

First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc

This implies:

  • Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
  • Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database

The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating  server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.

Monitor-able

If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.

Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:

Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.

Diagnosable

Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):

  • Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
  • All log statements used during the coding and QA phase must be removed
  • Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
  • The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely  meaningful information is output when troubleshooting is required
  • Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.

Security

This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).

Business Continuity

Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.

Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.

The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.

Scalable

The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.

This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc

This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).

Tunable

Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).

We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.

The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).

QA-able

As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).

Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.

Analytics – Growth Hacking – Usability

This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.

In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)

Even More So for Microservices

As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.

What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx,  Nirmata, etc)

Summary

Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.

More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.

Scalable Software Architecture for a Startup

Say we are the founders of a startup and we just got a big fat check for our A-round funding. The VCs love our idea, and we all know that our app will attract millions of users in no time. This means that from day one we architect for millions of page-views per day…

But wait … do we really need to deploy Hadoop now? Do we need to design for geographical redundancy now? OR should we just build something that’s going to take us through the next 3 months, so that we can focus our energy on customer development and fine-tuning our product features? …

This is a dilemma that most startups face.

Architecting for Scale

The main argument for architecting for scale from the get-go is akin to: “do it right the first time”: we know that lots of users will be using our app, so we want to be ready when they come, and we certainly don’t want the site going down just as our product catches fire.

In addition, for those of us who have been through the pain of a complete rewrite, a rewrite is something we want to avoid at all costs: it is a complex task that is fun under the right circumstances, but very painful under time pressure, e.g. when the current version of the product is breaking under load, and we risk turning away customers, potentially for ever.

On a more modest level, working on big complex problems keeps the engineering team motivated, and working on bleeding or leading edge technology makes it easier to attract talent.

Keeping It Simple

On the other hand, keeping the technology as simple as possible allows the engineering team to be responsive to the product team during the customer development phase. If you believe, as I do, one of Steve Blank’s principles of customer development: “No Business Plan Survives First Contact with Customers”, then you need to prepare for its corollary namely: “no initial product roadmap survives first contact with customers”. Said differently, attempting to optimize the product for scale until the company has reached clear validation of its business assumptions, and product roadmap, is premature.

On the contrary, the most important qualities that are needed from the Engineering team in the early stages of the company are velocity and adaptability. Velocity, in order to reduce time-to-market, and adaptability, so that the team can rapidly adapt to feedback from “outside the building”.

Spending time designing and implementing a scalable architecture is time that is Not spent responding to customer needs. Similarly, having built a complex system makes it more difficult to adapt to changes.

Worst of all, the investment in early optimization may be all for naught: as the product evolves with customer feedback, so do the scalability constraints.

Case Study: Cloudtalk

I lived through such an example at Cloudtalk. Cloudtalk is designed as a social communication platform with emphasis on voice. The first 2 products “Cloudtalk” and “Let’s Talk” are mobile apps that implement various flavors of group messaging with voice (as well as text and other media). Predicint rapid success, Cloudtalk was designed around the highly scalable noSQL database Cassandra.

I came on board to launch “Just Sayin”, another mobile app that runs on the same backend (very astute design). Just Sayin is targeted to celebrities and allows them to cross-post voice messages to Twitter and Facebook. One of my initial tasks coming on board was to scale the app, and it was suggested that we needed it to move it to Amazon Web Services so that we can scale rapidly as more celebrities (such as Ricky Gervais) adopt our product. However, a quick analysis revealed that unlike the first two products (Let’s Talk and Cloudtalk), Just Sayin’ impact on the database was relatively light, because communications were 1-to-many (e.g. Lady Gaga to her 10M fans). Rather, in order to scale, we first needed a Content Delivery Network (CDN) so that we could feed the millions of fans the messages from their celebrities with low response time.

Furthermore, while Cassandra is a great product, it was somewhat immature at the time (stability, management tools) and consequently slowed down our development. It also took us a long time to train new engineers.

While Cassandra will have been a good choice in the long run, we would have been better served in the formative stages of the company to use more established technology like mySQL. Our velocity in developing new features, and our ability to respond to changes in product strategy would have been significantly faster.

Architecting for Scale is a Process, not an Event

A startup needs to earn the right to design for scale, by first proving that it has found a legitimate market. During this first phase adaptability and velocity are its most important attributes.

This being said, we also need to anticipate that we will need to scale the system at some point. Here is how I like to approach the problem:

  • First of all, scaling is an on-going process. Even if traffic increases dramatically over a short period of time, not all parts of the system need to be scaled at the same time. Yet, as usage increases, it is likely that any point in time, some part of the system will need to be scaled.
  • In order to avoid complete rewrites of the system, we need to break it into independent components. This allows us to redesign each component independently, and have different teams work on different problems concurrently. As a consequence, good modularization of the system is much more important early on, than designing for scale
  • Every release cycle needs to budget time and resources for redesign – including both modularization and scalability. This is just like maintenance on the Golden Gate bridge: the painters are always working; when they finish at one end, they start all over at the other end.
  • We need to treat our software architecture the same way, and budget maintenance work every release cycle: dollars, time, people. CEOs have to be trained to not only think about the “shiny features” – those that are customer-facing – but also about the “continuous improvements” of the architecture that has to be factored in every release cycle.
  • We also need to instrument the code to tell us were it is under strain. Unlike the Golden Gate bridge, we can’t always see where it’s breaking, or even rationalize it. Scaling sometimes works in mysterious ways that are not always obvious to predict.

 

In summary, designing for scale is a high-class problem, on which we only get to work once we have demonstrated true demand for our product. During this first phase, velocity and adaptability are critical, and are better served with well-understood technologies, and a well modularized design. Once our product reaches an adoption phase, then designing for scale is a continuous process that hopefully can be focused on individual modules in turn – guided by proper instrumentation of the code

 

QA does not stop in QA

Quality Assurance does not stop after the software receives the “thumbs up” from the QA team. QA must continue while the product is Live! … because QA is not perfect, and real users only exist on a Production system. We need to be humble and accept that our design, development and quality processes will not catch all the issues. Consequently, we must equip ourselves with tools that will allow us to catch these problems in Production as early as possible … rather than “wait for the phone to ring”

When the product exits QA, it simply means that we have we’ve run out of ideas on how to make the system fail. Unfortunately, this does not imply that the system, once in Production, will not fail. If we are successful and get a high volume of traffic, the simple law of large numbers guarantees that our users will find yet-never-thought-of ways to – unintentionally – make the system fail. These are part of the “unknown unknowns” as Mr. Donald Rumsfeld would say. Deploying the product on the production servers, and handing-off (abdicating?) the responsibility to keeping it up to the Ops team shows wishful thinking or naïveté, or both.

Why QA must continue in Production

There are a few categories of issues that one needs to anticipate in Production:

  • Functional defects: in essence, bugs that neither developers, nor QA caught – while this is the obvious category that comes to mind, it is far from being the only source of issues
  • User experience (UX) defects: Product works “as spec’d”, but users either can’t figure how to make the product work, or don’t like it. A typical example is a high abandon rate in a purchasing experience, or any kind of work flow, or a feature that’s never used, a button that’s never clicked.
    This is not reserved to new products, by improving the layout of a given page, we may have broken another feature on that same page
  • Performance issues: while we may have run performance, and load tests, in our QA environments, the real world always offers surprises. Furthermore, if we are lucky enough to have the kind of traffic that Google or Facebook have, there is no other way but to test and fine-tune performance in production
    Running tests on non-production systems requires to not only simulate the load of the system, but also to simulate the “weight” of existing data (e.g. in database, file system) as well as longevity to ensure that there is no resource leak (memory, threads, etc)
  • Operational issues: while all cloud applications are typically clustered for high-availability, there are other sources of failure than equipment failure:
  • External resources, such as partners, data feeds, can fail, or have bugs of their own, or simply not keep up their response time. Sometimes, the partner updates the API without notification.
  • User-provided data can be mal-formed, or in an unexpected format, or a new data format can be introduced after the launch of the product
  • System resources can be consumed at an unexpected rate. Databases are notorious for having non-linear response times based on load: as long as the load is under a given threshold response time is high, but once the load exceeds this threshold response time can deteriorate very rapidly.

 

A couple of examples:

  • At my previous company, weeks after the product had been launched, we started receiving occasional complaints that some of the user-created videos were not showing up in their timeline. After (reluctantly) poking around in our log files, we did find out that about 10% of the videos that had been uploaded to our site for the past 2 weeks (but not earlier) were not processed properly. Our transcoder simply failed. Worse, it failed silently. The root cause was a minor modification to the video format introduced by Apple after our product was released. Since this failure was occurring for a small fraction of our users, and we had no “operational instrumentation” in our code, it took us a long time to even become aware of it.
  • Recently, we launched a product that exchanges data with our partner. Their API is well documented, and we tested our product in their sandbox environment, as well as their production environment. However, after launch, we had reports of occasional failures. It turns out that users on our partner’s site were modifying the data in ways that we did not expect, and causing the API to return error codes that we had never seen. Our code duly logged this problem each time it occurred in our log files … among the thousands of other log events generated every minute

 

Performing QA on Production Systems

As I mentioned, the Google and Facebook of the world, do a lot (if not most) of their QA on Production systems. Because they run hundreds of thousands of servers, they can use a small subset to run tests will live user data. This is clearly a fantastic option.

Similarly, “A/B comparisons” techniques are typically used in Marketing to compare 2 different user experiences, where the outcome (e.g. a purchase) can be measured. The same technique can be applied in testing, e.g. to validate that a fix of an intermittent bug difficult to reproduce does work.

 

More generally, Production code needs to be instrumented:

  • To detect failures, or QoS (Quality of Service) degradations, with internal causes (e.g. database is slowing down)
  • To detect failures, or QoS degradations, with external causes (e.g. partner API times out a lot)
  • To monitor resource utilization for each service or application – at a finer grain than provided by Operations monitoring tools which are typically at the server level.

The point is that if a user can’t buy a book on our website because our servers crash under load – this is a bug. While the crash is not due to code written incorrectly, it is due to the absence of code warning us that the system was running out of steam … this is still a bug.

 

In order to monitor quality in Production, we need to:

  • Clean up the code that writes to log files: eliminate all logging used for code testing, or statements such as “the code should never reach here”. Instead, write messages that will be meaningful to the poor soul who, a few weeks later, will be poring over megabytes of log files on a Sunday night trying to figure out why the system crashed
  • Ensure that log messages have consistent severity levels (e.g. as recommended by RFC 5424Wikipedia has a nice table), so that meaningful alerts can be triggered
  • Use a log aggregation system, like GrayLog2 (open source), so log files from multiple nodes in the same cluster, as well as nodes from different services can (a) be searched from a console and (b) viewed, time-aligned, on a single page (critical for troubleshooting). GrayLog2 can handle hundreds of millions of log events and terabytes of data.
  • MEASURE: establish a base line for response time, resources consumption, errors – and trigger alerts when the metrics deviate from the baseline beyond a predetermined threshold
  • Track that core functions – from a user perspective – complete, and log when, and ideally, why, they fail along with key parameters. E.g.: are users able to upload files to our system, are failures related to file size, time of day, location of user, etc?
  • Log UX and operationally meaningful events to track how users actually use the system, what features are most used and track them over time. These metrics are critical for the Product Management team
  • Monitor resource utilization and correlate with usage patterns. Quantify key usage parameters in order to scale the right resources in advance of the demand. For example, as traffic grows, the media server and the database servers may grow at the different rates.
  • Integrate alarms from application errors into the Ops monitoring tools: e.g. too many “can’t connect” errors should trigger an Ops alert that our partner is down – slow response time on a single server in a cluster may indicate the disk is failing

 

Quality is not a one-time event, it is an everyday activity, because users change their behaviors, partners change their APIs, systems get full and slow down. What used to work yesterday, may not work today, or no longer be good enough for our customers. As a consequence, the concept the “test driven” development must be extended to the Production systems, and our code must be instrumented to provide metrics that confirm that everything works as desired, and alerts when they don’t. But that’s not sufficient, developers and QA engineers must also take the time to look at the data, not just when a fire drill has been called, but also on a regular basis to understand how the system is being used, and how resources are consumed as the system scales, and apply this knowledge to subsequent releases.

Cloud Computing – The Miracle Tool for Testing

Cloud Computing eliminates restrictions due to the number of servers in the QA lab, and thus allows concurrent testing by developers and QA engineers. By making it easy to test often, and to expose early releases to the outside world, Cloud Computing will improve product quality

Does this story rings familiar? You are in a planning meeting for the next release, and learn that in addition to supporting Oracle 11g, the product will also need to support Microsoft SQL Server 2008 (or DB2, or mySQL, or PostgreSQL). Once the typical brouhaha dies down about how complicated this will be, how the whole code will need to be ripped apart, and how much time this will take, the Director of QA turns to you and asks for a couple of additional servers for the QA lab, so that the software can be tested on the two databases in parallel; minimum of three servers: 1 for the database, 1 for our software, and 1 for the test fixtures. The following day, it’s the developer lead’s turn to ask for more servers: need at least 1 “populated” database against which the developers can test, plus another set up for the daily build, etc.  Makes perfect sense … Except that no budget has been allocated for these servers! Soon you find yourself with your beggar’s cup in the CEO’s office, explaining to him, and the CFO, why your team needs these extra servers when “you already have so many!!”

Rejoice! Here comes Cloud Computing to the rescue ..

Cloud Computing could not only eliminate the need to purchase servers for testing, but also actually radically improves your ability to test, and thus improve product quality.

Cloud Computing, such as Amazon EC2,  offers the ability to deploy (and un-deploy) software on demand. One pays “by the hour” of computing used, and storage and bandwidth consumed. This is perfect for testing (by developers and by QA): compute load varies greatly over the cycle of the day, as well as the cycles of the release.

First of all, every developer can now have his/her own test setup against which to test. There is no limitation of hardware, no begging, borrowing or stealing from your colleagues for unutilized servers. One can just deploy at will. Furthermore, there is no restriction on the number of servers. So if you need to test a four-server cluster, you don’t have to hunt around for free servers, you just do it.

Similarly the daily build can deploy to multiple test environments concurrently and thus accelerate the validation of the build.

Finally, the QA team can also test in multiple environments simultaneously, e.g. Oracle and SQL Server at the same time! This offers the potential benefit of being able to test a much larger number of deployment scenarios, than would be possible using one’s own hardware.

Naturally, leveraging a Cloud Computing infrastructure, requires new tools.

First and foremost, all the tests must be automated. While technology has created virtual servers, it has not yet inventing virtual test engineers J.  Secondly, one will have to build tools to automatically deploy, e.g. from the build environment, the new version of the software, and the test fixtures, as well as collect the results of the test runs.

One can be quite creative with the test management tools. For example, if a test setup encounters a high-severity bug, you could configure your test software to pause the test, deploy to a second environment and continue testing in the second environment. This allows you to go back to the first test setup to troubleshoot, and find the cause of the crash.

Another fascinating advantage is that you can deploy demo or beta systems at will  (assuming your deployment model allows it.), and let your sales team or prospective customers to “play with” the early release. By making it easier to expose early releases of the product to the outside world, Cloud Computing further improves the quality of your product.

Will you save money by testing in a Cloud Computing infrastructure?

Obviously the answer depends … on your usage, but also on factors like how much data you need to keep permanently in the cloud. For example you may need to permanently store a synthetic database of a million users (it would be too slow to upload it each time). You will also incur higher networking traffic.

In addition, you may not want to move all your tests to the cloud. For example, you may want to keep your stress-tests, or longevity tests in-house, since these will be running 24×7, and you may want the option of running them on bare-metal.

At the end of the day, to me the attraction of Cloud Computing for testing is that it will increase quality (in addition to reducing costs). It will allow each developer to have access to a test environment at will.  It will create an additional impetus for test automation. Cloud Computing will also allow the concurrent deployment of tests to an arbitrary number of computing environments, and make it easier to give early access to your customers. Net-net, this translates to more tests in the same amount of time with less effort. It’s all goodness.