DevOps-Driven Development

It is now time to add the concept of “DevOps-Driven Development” to our repertoire.

“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.

Definition and Motivation

First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).

How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post  thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.

Practicing DevOps-Driven Development

I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.

We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.

Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.

Here is a “thinking DevOps” check list:

Deployable

First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc

This implies:

  • Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
  • Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database

The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating  server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.

Monitor-able

If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.

Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:

Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.

Diagnosable

Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):

  • Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
  • All log statements used during the coding and QA phase must be removed
  • Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
  • The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely  meaningful information is output when troubleshooting is required
  • Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.

Security

This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).

Business Continuity

Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.

Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.

The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.

Scalable

The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.

This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc

This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).

Tunable

Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).

We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.

The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).

QA-able

As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).

Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.

Analytics – Growth Hacking – Usability

This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.

In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)

Even More So for Microservices

As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.

What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx,  Nirmata, etc)

Summary

Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.

More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.

Scalable Software Architecture for a Startup

Say we are the founders of a startup and we just got a big fat check for our A-round funding. The VCs love our idea, and we all know that our app will attract millions of users in no time. This means that from day one we architect for millions of page-views per day…

But wait … do we really need to deploy Hadoop now? Do we need to design for geographical redundancy now? OR should we just build something that’s going to take us through the next 3 months, so that we can focus our energy on customer development and fine-tuning our product features? …

This is a dilemma that most startups face.

Architecting for Scale

The main argument for architecting for scale from the get-go is akin to: “do it right the first time”: we know that lots of users will be using our app, so we want to be ready when they come, and we certainly don’t want the site going down just as our product catches fire.

In addition, for those of us who have been through the pain of a complete rewrite, a rewrite is something we want to avoid at all costs: it is a complex task that is fun under the right circumstances, but very painful under time pressure, e.g. when the current version of the product is breaking under load, and we risk turning away customers, potentially for ever.

On a more modest level, working on big complex problems keeps the engineering team motivated, and working on bleeding or leading edge technology makes it easier to attract talent.

Keeping It Simple

On the other hand, keeping the technology as simple as possible allows the engineering team to be responsive to the product team during the customer development phase. If you believe, as I do, one of Steve Blank’s principles of customer development: “No Business Plan Survives First Contact with Customers”, then you need to prepare for its corollary namely: “no initial product roadmap survives first contact with customers”. Said differently, attempting to optimize the product for scale until the company has reached clear validation of its business assumptions, and product roadmap, is premature.

On the contrary, the most important qualities that are needed from the Engineering team in the early stages of the company are velocity and adaptability. Velocity, in order to reduce time-to-market, and adaptability, so that the team can rapidly adapt to feedback from “outside the building”.

Spending time designing and implementing a scalable architecture is time that is Not spent responding to customer needs. Similarly, having built a complex system makes it more difficult to adapt to changes.

Worst of all, the investment in early optimization may be all for naught: as the product evolves with customer feedback, so do the scalability constraints.

Case Study: Cloudtalk

I lived through such an example at Cloudtalk. Cloudtalk is designed as a social communication platform with emphasis on voice. The first 2 products “Cloudtalk” and “Let’s Talk” are mobile apps that implement various flavors of group messaging with voice (as well as text and other media). Predicint rapid success, Cloudtalk was designed around the highly scalable noSQL database Cassandra.

I came on board to launch “Just Sayin”, another mobile app that runs on the same backend (very astute design). Just Sayin is targeted to celebrities and allows them to cross-post voice messages to Twitter and Facebook. One of my initial tasks coming on board was to scale the app, and it was suggested that we needed it to move it to Amazon Web Services so that we can scale rapidly as more celebrities (such as Ricky Gervais) adopt our product. However, a quick analysis revealed that unlike the first two products (Let’s Talk and Cloudtalk), Just Sayin’ impact on the database was relatively light, because communications were 1-to-many (e.g. Lady Gaga to her 10M fans). Rather, in order to scale, we first needed a Content Delivery Network (CDN) so that we could feed the millions of fans the messages from their celebrities with low response time.

Furthermore, while Cassandra is a great product, it was somewhat immature at the time (stability, management tools) and consequently slowed down our development. It also took us a long time to train new engineers.

While Cassandra will have been a good choice in the long run, we would have been better served in the formative stages of the company to use more established technology like mySQL. Our velocity in developing new features, and our ability to respond to changes in product strategy would have been significantly faster.

Architecting for Scale is a Process, not an Event

A startup needs to earn the right to design for scale, by first proving that it has found a legitimate market. During this first phase adaptability and velocity are its most important attributes.

This being said, we also need to anticipate that we will need to scale the system at some point. Here is how I like to approach the problem:

  • First of all, scaling is an on-going process. Even if traffic increases dramatically over a short period of time, not all parts of the system need to be scaled at the same time. Yet, as usage increases, it is likely that any point in time, some part of the system will need to be scaled.
  • In order to avoid complete rewrites of the system, we need to break it into independent components. This allows us to redesign each component independently, and have different teams work on different problems concurrently. As a consequence, good modularization of the system is much more important early on, than designing for scale
  • Every release cycle needs to budget time and resources for redesign – including both modularization and scalability. This is just like maintenance on the Golden Gate bridge: the painters are always working; when they finish at one end, they start all over at the other end.
  • We need to treat our software architecture the same way, and budget maintenance work every release cycle: dollars, time, people. CEOs have to be trained to not only think about the “shiny features” – those that are customer-facing – but also about the “continuous improvements” of the architecture that has to be factored in every release cycle.
  • We also need to instrument the code to tell us were it is under strain. Unlike the Golden Gate bridge, we can’t always see where it’s breaking, or even rationalize it. Scaling sometimes works in mysterious ways that are not always obvious to predict.

 

In summary, designing for scale is a high-class problem, on which we only get to work once we have demonstrated true demand for our product. During this first phase, velocity and adaptability are critical, and are better served with well-understood technologies, and a well modularized design. Once our product reaches an adoption phase, then designing for scale is a continuous process that hopefully can be focused on individual modules in turn – guided by proper instrumentation of the code

 

Migrating a Self-hosted Architecture to the Cloud

While it may possible to migrate a self-hosted architecture to the cloud with servers in identical configuration, it almost certainly will lead to a sub-optimal architecture in terms of performance, and higher costs, in some cases prohibitively so.

The common objectives for moving to the cloud are:

  • Ability to scale transparently as the business grows
  • Reduce costs
  • Benefit from a word-class IT infrastructure without having to hire the talent

 

We’ll focus on the first two objectives as the third one is achieved – by nature – the moment you flip the switch to the cloud.

Memory Drives Pricing in the Cloud – not CPU

in the cloud, whether with Amazon EC2 or other vendors, the primary dimension driving pricing is the amount of memory (RAM) available in the server. In addition, CPU allocated is roughly proportional to the amount of RAM.

For example, as of this writing, per the Amazon EC2 pricing and the Amazon EC2 Instance Types definitions:

  • A Small instance has (only) 1.7 GB of RAM – and 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit) and costs: $0.08 per hour On-Demand
  • An Extra Large instance is 8 times bigger than a small instance and costs 8 times as much: 15 GB of RAM – and 8 EC2 Compute Unit and costs: $0.64 per hour On-Demand
  • In order to get 32 GB of RAM, one needs to move to the High-Memory Double Extra Large Instance aka m2.2xlarge: $0.90 per hour On-Demand

Note that the prices quoted here are for US East (N. Virginia) zone. Prices for US West (Northern California) are more expensive (about 12% based on a few data points I correlated)

Database Servers

Database servers have unique requirements:

  • They require fast I/O to disk. While Amazon recommends using networked storage, this is typically not practical from a performance perspective.
  • So database servers also require large local disks: to hold the data
  • Most databases require a fair amount of memory at least 16 GB. We use Cassandra, and they recommend 32 GB per server.
  • They hate noisy neighbors (see previous blos). While virtualization technology does a fairly good job at partitioning CPU and RAM, it does a much poorer job at sharing I/O bandwidth. Having another virtual machine running on your database server can kill its performance, even if the neighbor does not do much, it can kill the I/O efficiency. All the tricks that databases use to optimize I/O performance assume that the database is in control of all I/O busses.

As a consequence, one should first of all use a reserved instance – simply because the cost of getting data in and out of the local disks makes it impossible to set-up / tear-down database servers “at will”.

Secondly, one should buy a large enough instance (e.g. m2.4xlarge) so that we are the only tenant on the server. This will cost $7,203 per year – based on Heavy Utilization Reserved Instances pricing, and get us 68.4 GB memory, 4 cores (8 virtual – with Intel Hyper-Threading) and 1.69 TB of local storage.

SSD

As Adrian Cockcroft from Netflix illustrates in his detailed post: Benchmarking High Performance I/O with SSD for Cassandra on AWS, moving to SSD instances for I/O and compute intensive systems can bring significant cost reductions. In his example, he compares a traditional system with 36 x m2.xlarge + 48 x m2.4xlarge instances at a cost of $772,806 (Total 3 Year Heavy Use Cost) – with a 15 x hi1.4xlarge system at a cost of $354,405 – a 54% savings.

As the article illustrates, selecting one versus the other requires careful understanding of the computational profile of the application, and some changes in the application’s architecture

Do I want to use Proprietary Amazon Solutions?

Following the logic that motivates us to move to the Cloud forces to consider using Amazon proprietary solutions: reduce need for sys admin talent, leverage out-of-the-box a scalable high-availability solution, etc

  • Should I replace mySQL with RDS? Or Casssandra, HBase with DynamoDB?
  • Should I replace my ActiveMQ (e.g.) message queue with SQS?
  • … and similarly for AWS many products

 

These are excellent products, battle tested by Amazon. However, there are 2 very important considerations to examine:

  • First, these products are obviously proprietary – making a move to another cloud provider like Rackspace or Joyent, will be take an extensive code rewrite. This may turn out to be impractical.
  • Secondly, cost can be a (bad) surprise once the application is deployed live. For both RDS or SQS, pricing is driven by data bandwidth AND the number of operations performed using the service – which requires careful analysis to estimate ahead of time. For example, polling every 10 seconds to check whether new data is present in SQS generates 250K operations per month (assuming each check requires only 1 operation). This is fine if this function is performed by a few servers, but would break the bank if it’s performed by 100,000 end-user clients. This adds up to $25,000 ($0.000001 per Request).

Algorithm Tuning and Server Selection

More generally, Amazon offers seven families of servers: Standard, Micro, High-Memory, High-CPU, Cluster Compute, Cluster GPU, and High I/O (SSD). Porting an existing application will thus require an iterative process evaluating the following questions:

  • How do I best match each of my system’s components with an Amazon instance types>
  • Can I fine-tune, or even re-write, my algorithms to maximize RAM & CPU utilization? In particular, would I make the same memory vs computation trade-offs? Do I need this hash-table, or can I re-compute the query?
  • How does my architecture evolve as I scale out? For example, do I need to replicate shared resources – like caches – or will sharding (e.g.) avoid this duplication of data – which will directly impact my cost since pricing is memory driven. An algorithm may work best using a approach favoring memory (and minimizing CPU) when running on a single server but it may be more cost-effective when optimized for memory when scaled out over many servers.
  • How do new technologies like SSD impact my architecture? As the Netflix article illustrates, the cost impact can be radical, but it required substantial architecture redesign, not just a simple server replacement

 

In conclusion, moving from a hosted environment (where each server can be configured at will) to the cloud where servers come in pre-determined configurations requires not only an architecture review, but a sophisticated excel spreadsheet to compare the costs of various architectures. This upfront financial modeling is absolutely necessary in order to avoid unpleasant surprises as the business scales up.

Deploying to the Cloud? Hang on to your Trousers!

My team and I have spent the past months investigating a deployment to the Cloud with vendors such as Amazon, Rackspace, GoGrid … to name a few who provide Infrastructure As A Service (IaaS).

A few conclusions have surfaced:

  • One needs to clear about one’s motivations to migrate to the Cloud- different motivations will lead to different outcomes, for a given product
  • It is almost impossible to predict the cost of a cloud-hosted system – without deploying a test system with the selected vendor. As a corollary, precise comparison shopping is almost impossible.
  • It is almost impossible to design, let alone deploy, your system architecture – without prior hands-on experimentation with your selected vendor. Also, the optimal architecture once deployed in the Cloud is likely to be radically different than one deployed on your own servers.
  • Some Cloud vendors are moving aggressively up the value chain by offering innovative software technologies on top of their infrastructure. They are thus becoming PaaS (Platform As A Service) vendors. For example, as we commented in a previous post “Is Amazon After Oracle and Microsoft?” Amazon is deploying an array of software technologies – combined with services – that are tailored specifically for the Cloud, and are technically very advanced

We expand each of these points in upcoming posts, starting with the first one today.

The main arguments advanced in favor of a cloud infrastructure are:

  • Offload the system management responsibilities to the Cloud services provider:
    This is more than an economic trade-off: managing systems for high-volume Internet applications is a complex task requiring a broad set of technical skills – where said skills are in permanent evolution. Acquiring all these skills typically requires multiple engineers with varied backgrounds: computer hardware, operating systems, storage, networking, scripting, security, etc. These system administrators have been in high-demand for the past couple of years, demand high compensation, and usually want to work for companies which offer challenging work … namely those with a very large number of systems. As a result, some companies are simply unable to hire the necessary system administration talent in-house, and are forced to move to the Cloud for this single reason.
  • Leverage best practices established by Cloud vendors.
    Cloud services providers have optimized every aspect of running a datacenter. For example, Facebook released the Open Compute project in 2011 for Server and Data Center Technology. RackSpace launched the OpenStack initiative in late 2010, to standardize and share software for Compute (systems management, Storage, Media, Security, as well as Identity and Dashboard. Even managing systems at a hosting provider requires constant tuning of system management tools –  whereas a Cloud service provider will take on this burden
  • Benefit from the economies of scale that the Cloud vendors have created for themselves
    Building data centers, finding cheap sources of power, buying and racking computers, creating high-bandwidth links to the Internet, etc. are all activities whose cost drops with volume. However, to me, the impact of price is much smaller than that of pure skills. The aforementioned tasks are becoming more and more complex, to the point where only the largest companies are capable of investing enough to keep up with the state-of-the-art.
    In particular, Cloud vendors offer high-availability and recoverability “for free” – namely: free from a technical perspective, but not from a financial one.
  • Ability to rapidly scale systems up or down according to load
    This is one of the main theoretical benefits of the cloud. However, it requires a few architectural components to be in place:
    (a) the software architecture has to be truly scalable and free of bottlenecks. For example, traditional N-tier architectures were advertised to be scalable because web servers could be added easily. Unfortunately, the database rapidly becomes the throttling component as the load rises. Scaling up traditional database sub-systems, while maintaining high-availability , is both difficult and expensive.
    (b) Tools and algorithms are required to detect variations in load, and to provision/decommission the appropriate servers. This requires a good understanding of how each component of the system contributes to the performance of the whole system. The complexity increases when the performance of components does not behave linearly with load.
    (c) Data repositories are slow and expensive to migrate. For example, doubling the size of a Cassandra (noSQL database) cluster is time consuming, uses a lot of bandwidth (for which the vendor may charge) and creates load on the nodes in the cluster.
  • Ability to create/delete complete system instances (most useful to development and testing)
    The Cloud definitely meets this promise for the front-end and business logic layers, but if an instance requires a large amount of data to be populated, you must either pay the time & cost at each deployment or keep the data tier up at all times.  This being said, deploying complete instances in the Cloud is still a lot cheaper and faster than doing it in one’s data center, assuming it can be done at all.
  • The Cloud is cheaper:
    This is a simple proposition, with a complex answer. As we’ll examine in the next blog: figuring out pricing in the cloud is a lot more complex than adding the cost of servers.

Appreciating the business and technical drivers that motivate a migration to the Cloud will drive how we approach the next steps in the process: system architecture design, vendor selection, and pricing analysis. As always, different goals will lead to different outcomes.