Why Digital Transforma­tions Fail – the Monolith Syndrome

Previously published on Silicon Valley Software Group Insights in March 2023.

A number of our engagements come from clients who experience a similar pattern of symptoms: release velocity is trending down, critical bugs pop up with each release, yet hiring more developers does not seem to improve anything. In parallel, the digital imperative, which has gained momentum over the past couple of years, whether imposed by the pandemic, or simply overall evolution, keeps building the pressure: consumers require a flawless digital experience. When the technology team does not deliver, the consequences for the business are painful: customers are disappointed, competition edges ahead and, even more heartbreaking, our clients are unable to capture the demand that their marketing has generated.

The goal of this post is to inform both CEOs and CTOs on how to diagnose what we term the “Monolith Syndrome”. As with any condition, early diagnosis vastly improves the chances of success. It is thus critical for CEOs and CTOs to know how to recognize this pattern, and take the necessary early actions. Further, it often falls on the CEO to identify the situation, because the CTO is usually consumed in trying to just keep up.

Symptoms

The symptoms of what we term the “Monolith Syndrome” look like this:

  • The application’s response time keeps degrading;
  • Outages are becoming more frequent;
  • As outages occur, new features requests do not get delivered. Customer complaints rise;
  • Re-prioritization of the product roadmap occurs before the main features of the previous roadmap are delivered (because they took too long);
  • Distrust between the executive and the technology teams grows.

Like any challenge, each company faces its own flavor of the “Monolith Syndrome”, yet to the experienced eye, the pattern is easily recognizable. More fundamentally, it is absolutely normal: it occurs when a company has grown into a new stage of maturity – where a new way of running the business, including the technology, is now necessary. Like most living organisms, when looking on a short time horizon, companies grow incrementally. However, when taking a step back, discrete stages become evident. On the technical front, transitioning between maturity stages call for what is called a “Digital Transformation”.

The Monolith Syndrome encapsulates scenarios of pain when the technology team cannot keep up with the needs of the business through “business as usual”.

There are multiple scenarios that require a digital transformation, the Monolith Syndrome is one of them. We will explore the others in subsequent posts.

Causes

From a technical perspective, the root causes of the “Monolith Syndrome” are often a combination of:

  • The architecture of the current codebase was developed more than five years ago, and has changed little since;
  • The code is built on a single codebase and uses a single database – hence the term “monolith”;
  • Development expediency has been the priority which has led to: poorly organized code, little documentation, few tests, and even fewer automated tools for QA, release and operational management;Critical areas of functionality are implemented in “dark code”: code that was written by developers who are no longer employed by the company, and which current developers are scared to touch, because the code is difficult to understand and there is no documentation.

The Monolith Syndrome encapsulates scenarios of pain when the technology team cannot keep up with the needs of the business through “business as usual”. We described the symptoms above in technical terms. Yet, the underlying cause is that the company has grown into a different maturity level – where “what got you here” no longer works.

To be clear, a monolithic codebase is usually the right way to go in the early stages of a company: there are a handful of developers, a manageable number of lines of code, and few features that are quick to test manually. Yet, at some point in the company’s growth, the nimbleness and expediency become a detriment rather than an asset. For example, it becomes cumbersome to develop, let alone release, when twenty-plus developers are writing code in a monolith: different developers’ new code interact with each other in a way that creates unforeseen bugs.

The underlying cause of the Monolith Syndrome is that the company has grown into a different maturity level, but not the technology team.

As a company battles through the Monolith Syndrome, the CEO and CTO have a heart-to-heart: the CEO asks “what do you need to develop new features faster?” – to which the CTO invariably answers “I need more engineers”, and then proceeds to build a “better monolith”, i.e continue to work on the same codebase with the same processes and tools. Yet with poor architecture, software organization, and documentation, the extra developers only create more confusion and barely accelerate development velocity. The root cause of this lack of progress is that the business side has gone through a change of paradigm, but not the technology team.

Again, this is why it is the CEO, who understands the business context, who needs to recognize the pattern.

The goal of the transformation is not to update to the latest and greatest technologies, but rather to identify the technologies most appropriate for the foreseeable needs of the business.

The Proper Mindset

In order for the transformation to be successful, everyone needs to have the proper mindset:

  • Recognize that this effort is the “price of success”. Understand that current architecture, code, tools, etc. were not a mistake – no one deserves blame. On the contrary, they were optimal for the previous stage of maturity. Now that the business has grown, and evolved, technology also has to transform to a more mature architecture.
  • The goal of the transformation is not to update to the latest and greatest technologies, but rather to identify the technologies most appropriate for the foreseeable needs of the business.
  • The transformation will require a set of skills that is typically not present in-house. Rare are the CTOs who have successfully led digital transformations. Hence, it is usually wise to enlist the help of technical leaders who do have this experience.

SVSG’s Framework

SVSG follows the following framework:

  • Re-align the technology to the business: understand the main stakeholder journeys (customer and employee), which have likely evolved since the current architecture was designed.
  • Design the architecture – and data models – before coding, based on the new stakeholder experiences, as well as needs for scale, resilience, security, etc.
  • Incorporate the full business context such as scale, security, resiliency, etc.
  • Design an incremental migration path from the current state to the desired state. For example, start by breaking up the monolith by creating one additional microservice, validating its design before moving one to a second microservice.
  • Evangelize that the transformation goes beyond architecture and code. The whole development process, from end to end, must align with the company’s new stage of growth.

Final Thoughts

Digital transformations are rare events in the life of a company. Technology leaders are usually selected and trained to design and build technology incrementally. Unless you have gone through it before, detecting that your company might be experiencing the Monolith Syndrome is an unusual, and difficult, challenge for both CTOs and CEOs; but when the symptoms arise, it’s important to act swiftly if the business is to keep up with its growth.

Technical Due Diligence For Companies On The Cusp Of High Growth

Published on Forbes Technology Council 12/27/2022

You are ecstatic: You just executed a term sheet with a startup, which, thanks to your large investment, will grow two to three times each year for the foreseeable future (i.e., two years). Now begins the hard work of ensuring that the CTO delivers the technology and features laid out on the product roadmap. Yet, sustaining high growth, defined (arbitrarily here) as growing revenues at more than 100% per year for at least two years, requires a different playbook than a more mundane growth rate. For example, bigger hardware may accommodate the first doubling of traffic, but the second or third will likely require substantially different software and data architectures, which need to be planned long in advance.

While it is not an investor’s job to identify or address these challenges, the return on investment will ultimately depend on how well and how timely the portfolio company manages them. This article provides pointers on what investors should know and look out for during technical due diligence, as well as post-investment.

The Difference Between High Growth And Regular Growth

In general, growing at a high rate raises four types of challenges.

• Tough Technical Challenges

Handling twice the traffic, with twice the amount of data stored, leads to a different category of problems, technically, compared to handling 10% more traffic. In addition, when you decide to build a new architecture because your traffic is doubling every year, you actually need to design for 10 times the traffic so that you do not go through the same exercise again each year.

• Incremental Changes No Longer Effective

Changes need to be performed in discrete steps. As illustrated above, as traffic surges, incremental measures (e.g., bigger hardware) will keep the business going for a while, but a new architecture needs to be analyzed, designed, implemented and deployed rapidly. Because this work is complex, it needs to start early—well before the real pain starts. Furthermore, the transition to the new architecture often presents a more complex challenge than the new architecture itself.

• The Need For Everything To Change At Once

Along with technical changes in the architecture and the tech stack comes the need to deliver more features faster. This, in turn, requires more engineers as well as a new team organization, along with new tools and new processes.

• Changing Nonfunctional Requirements (NFR)

As the company grows and acquires bigger customers, securing data, meeting regulatory compliance, protecting privacy, preventing downtime and ensuring business continuity take heightened importance. While security might not appear critical for a company managing $10 million worth of transactions, it becomes critical when $100 million flows through the platform. Growing companies often miss this because a slow evolution over time eventually adds up to a category-changing situation.

Where Technical Due Diligence Should Focus

The first step when reviewing a company prior to investment is to identify and quantify impediments to growth. For example, is the amount of technical debt such that even a minor increase in traffic or features will create serious risks of downtime? Do the CTO and the technical leadership have the talent and experience for the design and implementation of the next-generation architecture? Does the CTO have the business acumen, in addition to the technical expertise, to align technical operations with the evolving business?

Next, the plans for growth need to be examined. Are they aggressive enough in scope as well as technology to meet the anticipated growth? How well developed are the plans: Are they conceptual, or do detailed designs exist along with development plans? How robust is the new architecture design? Without detailed plans, the product roadmap is aspirational rather than achievable.

In our investigations, we often see parallel roadmaps for the product, technology and NFR, each assuming access to the same resources. This is a recipe for disaster; fuzzy resource plans lead to fuzzy budgets, misalignment with the CEO and confusion about the allocation of the newly invested funds. The worst case scenario is to find out six months after a deal has closed that the engineering budget needs a 25% increase to deliver the product roadmap because the resources to upgrade the architecture, scalability or security were double counted.

Recruiting and new employee onboarding are often overlooked activities, but when they’re performed poorly, they are a huge, yet hidden, drain on productivity. Because high growth often entails increasing the size of the team quickly, engineers must spend time interviewing prospects. When the recruiting process is poor, candidates do not meet standards, and desirable prospects accept offers from other companies.

As a consequence, engineers end up spending a lot more time in interviews, and building the team takes longer than it should, thus delaying the product roadmap. In addition, frustration builds because time spent in interviews is rarely factored in project scoping, causing delays in projects. Investing time upfront in building efficient recruiting and onboarding processes will be recovered many times over.

Companies rarely have everything figured out. The purpose of the review is not to give a “beauty contest” score but rather to determine whether critical changes need to take place before the company is ready to fully “step on the accelerator,” as well as how much these changes will cost and how long they will take. Getting technical debt to an acceptable level, hiring a new CTO, building a baseline of automated regression tests—all these projects can easily take one or two quarters and commensurately affect the growth rate and revenue.

Conclusion

High growth differs materially from traditional growth by the breadth and speed of the changes that are needed, thus requiring a different playbook. Investors need to know whether a company is ready from day one, whether it will require time to pay down technical debt and whether its growth plans are ready for execution. A lack of readiness can easily consume two quarters, which is a long time in the startup world. It may determine whether the company will dominate its market or get edged out by a faster competitor.

How To Maximize The Value Of Technical Due Diligence

Previously published on Forbes on 11/16/2021

Technical due diligence (TDD) is typically requested by investors prior to closing a growth-stage investment or when acquiring a company. A smart investor should expect a lot more out of TDD than a “yes or no” answer to the question “Are there any red flags that warrant canceling the investment or acquisition?” 

Instead, as I highlighted previously in my article “The Art of Technical Due Diligence,” “Technical due diligence should provide actionable information about the upcoming 24 months, including critical dependencies, risk factors and major technical milestones that will usher in product milestones.” 

TDD allows future board members to track technical milestones and thus anticipate the financial ones. Technical milestones typically precede some of the financial milestones by three to six months — for example, when software needs to be re-architected to deliver the scale to serve the expected growth. 

A good technical due diligence identifies: 

• When and where the past is no longer a predictor of the future.

• What new skills will need to be developed in the technology and product teams.

• What new risks need to be handled.

Here are some examples: 

Scale will hit a wall.

This is almost a universal concern in technical due diligence projects. The deal is based on four times or 10 times revenue growth in the next 24 months, but can the software keep up? If the answer is “no,” investors will want to know what it will take to meet the growth projections: architecture redesign, implementation plan along with schedule, resources and budget estimates.

There is a large amount of technical debt.

Only close inspection of the code by a talented CTO can identify whether the code is ready for the next phase of growth. Some of the more frequent scenarios include:

• The company is generating millions of dollars of revenues on code based on its first prototype, typically a monolith, with layers of dead code that supported use cases that were abandoned in the quest for product-market-fit. This impacts not only operational performance but also hinders the development velocity once the team grows beyond a dozen developers.

• The code base is “legacy” and poorly maintained. This often happens with companies that were early on the market, persevered through years of slow growth and now suddenly take off. The code is based on old technology, has been updated — expediently — over time by different teams of developers and has poor documentation. In this situation, a rewrite from scratch is usually the only practical solution.

• For enterprise companies, another common scenario occurs when the software and the data storage are still single-tenant. Transitioning to a multi-tenant architecture is a problem with a known solution, but it is time-consuming and costly.

Development velocity will tank.

Probably the hardest transition to navigate for a startup is when the size of the userbase dictates that quality trumps new features. When a company has a large number of customers, the cost of a serious bug — let alone a DOA release — becomes prohibitive.

This is when test automation and CI/CD automation (including Infrastructure as Code) need to be deployed, which is usually a painful process because existing code must be “retrofitted” with automated regression tests. In addition, development velocity temporarily stalls before accelerating again once a critical mass of automation has been reached.

Another common scenario occurs when the target company is developing products like “three founders in a garage,” i.e., with very little documentation, limited QA, manual deployments. Scaling the team will require changing processes as well as attitudes and, possibly, the CTO.

Risk arbitration is drastically different.

A company with one million users should look at security — and business continuity — very differently than a company that has 10,000 users. At the risk of oversimplifying, the cost of implementing state-of-the-art security is the same in both scenarios, yet the ROI is different: The cost of being hacked is much greater for the former than the latter. Similarly for business continuity: The cost of a one-day outage may be acceptable for the latter company, but may kill the former company.

One of the companies we reviewed at my organization had grown organically from a prototype to one that stored hundreds of thousands of credit cards in its database. Because the growth has been organic and moderate, no one in the executive team noticed that the company had reached a scale where a hacker could destroy the company.

There is an inefficient development process.

An often-overlooked factor affecting development velocity is the alignment, or misalignment, between the executive team, product team and technology team.

This shows up in two ways: a product road map that is aspirational (i.e., dates are not backed up by engineering estimates) and a product road map that zig-zags (i.e, changes every quarter). This situation is normal, and possibly desired, when the company is searching for product-market-fit but counterproductive when it is attempting to conquer the large market that it has discovered.

Moving from chasing opportunities to a mode where formal business cases for new features are developed cooperatively is challenging for the company’s leadership but essential to ensure stability in the product road map, which, in turn, allows the technology team to develop a technology road map as well as predictable releases.

Conclusion

None of the issues presented above are deal killers, but they can lead to a modification of the terms of the deal. For example, investors may want to increase their investment to cover the rewrite of major components of the products. In all situations, even with a well-performing technical team, TDD delivers a list of major milestones that can be tracked by the investors as the company grows.

How Machine Learning Will Disrupt The Established Cloud Providers

Previously published in Forbes on October 24, 2017

In the past few years, new categories of products have emerged thanks to the extraordinary advances in machine learning (ML) and deep learning (DL). These new techniques power product recommendations, computer-aided diagnosis in medical imaging and self-driving cars, just to name a few.

Most ML and DL algorithms require compute profiles (hardware, software, storage, networking) that are significantly different from those optimized for traditional applications. Consequently, as more and more companies develop their own ML/DL solutions and deploy them to production, the demand for the ML-optimized compute resources will grow dramatically and create opportunities for new entrants to offer solutions that compete with today’s dominant cloud providers: Amazon AWS, Microsoft Azure and Google Cloud.

The ML/DL Cloud Is Different

In an article on Mesosphere’s blog page, Edward Hsu presented the case that web applications are now primarily data-driven. Consequently, a new set of frameworks (a.k.a. stacks), namely SMACK (Spark, Mesos, Akka, Cassandra, Kafka), must replace the traditional LAMP (Linux, Apache, MySQL, PHP) stack used to build web-based applications. In my view, rather than replacing LAMP, SMACK will coexist side by side with, and feed data to, traditional web-based based frameworks, which are still needed to present nice-looking webpages and interface with mobile phones.

Yet the main point is well-taken. We need to update Marc Andreesen’s famous line about how “Software is eating the world” to “Data is eating the world.” Let’s unpack this statement and derive the consequences.

Hardware

The disruption created by machine learning and deep learning extends well beyond the software stack into chips, servers and cloud providers. This disruption is rooted in the simple fact that GPUs are much more efficient processors for ML and DL than traditional CPUs.

Up until recently, the solution was to augment traditional servers with GPU add-on cards. We are now at a point where demand for ML/DL computing is such that special-purpose servers, optimized for ML/DL compute loads, are being built.

Data centers are also being re-architected to support the extremely large amount of data consumed by ML and DL. Imagine you are designing the brains for self-driving cars. You need to process thousands and thousands of hours of video (and other such signals as GPS, gyroscopes, LIDAR) to train your algorithms. The amount of data that a Tesla on the road records in one second is a million times larger than a tweet or a post on Facebook.

ML/DL data centers thus require both huge amounts of storage and extremely high bandwidth.

Software

The software side is even more complex. A new infrastructure stack, typically using machine learning-specific frameworks such as Tensorflow (originally developed by Google) or PyTorch (originally developed at Facebook), is required to shepherd data around and manage the execution of the compute jobs. Furthermore, open-source code libraries (pandasscikit-learnmatplotlib) are used to implement the models (e.g., neural networks, data displays). These model libraries are critical because they are optimized to be both easy to use for algorithm research and offer high performance for use in production.

Finally, each vendor offers complete building blocks for specific use cases. For example, Amazon LexGoogle Cloud Speech and Microsoft Bing Speech provide speech recognition and can even recognize intent. Each has its own API and unique behavior, making the migration from one vendor to the other time-consuming.

New Entrants

In addition to the Big Three cloud providers (Amazon AWS, Microsoft Azure and Google Cloud) that have offered GPU-accelerated instances for a few years, new ML-optimized offerings have emerged:

• NVIDIA, which is already the dominant provider of GPUs that power the graphics cards that drive computer displays, recently introduced a portfolio of “purpose-built AI supercomputers” servers known as its DGX systems.

• Servers.com offers its Prisma Cloud with dedicated GPU-optimized servers.

• Rescale, one of the niche cloud providers that focuses on high-performance computing (HPC), just announced the availability of the latest generation of GPU-powered servers, along with high-bandwidth interconnect, to create high-performance multi-node clusters.

What’s At Stake

The Big Three cloud providers are the ones most immediately at risk to be disrupted by new entrants such as NVIDIA, Servers.com and Rescale. ML/DL innovation is still running at a torrid pace thanks to innovation in algorithms as well as compute efficiency. This is creating a small arms race where end users are constantly looking for the provider that can give that extra edge.

On one hand, end users are benefiting hugely from this arms race to provide the best software and hardware compute environment. On the other, this requires constant vigilance to keep abreast of the latest offerings. Even more importantly, when deploying ML/DL products to production, CEOs and CTOs need to pick the winner — or at least a future survivor — that will keep their edge for the next two to five years. This is not an easy task.

We will delve deeper into these two topics in future posts — stay tuned.

The Machine Learning Imperative

Previously published in Forbes on June 28, 2017

There’s no longer a debate as to whether companies should invest in machine learning (ML); rather, the question is, “Do you have a valid reason not to invest in ML now?”

Machine learning is here, and it’s finally mature enough to cause a major seismic shift in virtually every industry. For example, Matt Swanson, founder of SVSG, wrote an article last year about how chatbots will disrupt a $200 billion industry. While ML cannot solve every problem, it has demonstrated a game-changing impact in enough markets that every CEO and CTO must ask himself/herself whether they understand ML well enough to rule it out for their own business. While appreciating the rewards of ML may be difficult, we do know the risks: ML has already disrupted several industries, including e-commerceautonomous driving and customer engagement. The risk of ignoring ML today is one that is probably too large for any established company to take.

Machine Learning Changes The Game

While artificial intelligence grabs most of the spotlight in discussions about machine learning (primarily due to its easily graspable life-altering implications), it is but one of many disciplines in ML. Big data has demonstrated the enormous value of data: Netflix and Amazon recommend films and products based on our own purchase history and those of customers like us. Thus, big data has helped us answer questions we already knew to ask, questions such as, “What more can I sell to my customers?”

Machine learning allows us to make even better use of the data we have, as well as the data we don’t currently possess, and answer the questions we didn’t know we should ask.

Machine Learning Uses Data We Don’t Yet Have

Analytics and business intelligence extract information from structured data (i.e., data stored in databases: customer information, purchase history, etc.). But thanks to ML, we can now extract information from unstructured data such as texts, phone calls, images and videos.

Search engines used to return pages based the exact words of the query. ML takes this text analysis a few steps further. First, it extracts concepts out of words and associates pages that discuss the same concept with different words: A search for “artificial intelligence” will produce results that mention machine learning and robotics but not explicitly the words “artificial intelligence.” Beyond this, ML is now becoming proficient at sentiment analysis and determining intent in a given context. This means that ML can deduce, via our posts on social media, if we are happy or angry (sentiment analysis), for whom we are likely to vote for, or what purchase we are considering next (intent).

Similarly, ML techniques like natural language processing (NLP) and image categorization interpret and translate people’s speech as well as the content of images (e.g., facial recognition on Facebook).

This means that, thanks to ML, the huge amount of publicly available content — which, up until recently, was of little use — can now give us useful new insights.

Machine Learning Makes Better Use Of The Data We Have

Machine learning provides a new class of algorithms that manipulates structured data that we already possess. AWS has a nice blog, including code, on how to build a prediction engine for customer churn. BlackRock is using machines to manage funds.

In addition, data that every company gathers from its customers (emails, chats, comments, support requests, etc.) can now be analyzed by ML to extract accurate customer sentiment (satisfaction with the service, suggestions, identifying emergency requests). Even polls and surveys may be replaced by ML algorithms that can mine Facebook, Twitter and news sites to capture the sentiment of millions of people expressing themselves openly.

Machine Learning Answers Questions We Didn’t Know To Ask

At the risk of stating the obvious, the power of machine learning is that it learns. The more information provided, the faster it learns and the better it answers.

While traditional business intelligence techniques can tell us how often products A and B are purchased together, these techniques fail in the face of a massive organization such as Amazon, which sells over 368 million products. However, ML can digest the flow of purchase transactions and identify patterns of joint purchases. ML can even use these predictions to automatically make purchase decisions (see German e-commerce merchant Otto as an example).

Furthermore, by leveraging data we don’t have — such as stock market indices, weather data, political news and government statistics — we can correlate external events with our business data and thus enrich the accuracy of our predictions and decisions.

Why Now?

The rapid growth of machine learning leads to uncertainty, which may entice business leaders to hesitate in utilizing it. Yes, machine learning is complex, but it is also a powerful force of disruption. Because ML is still developing, it presents an opportunity to pull ahead of the competition by taking advantage of this maturation period. The choice is simple: disrupt or be disrupted.

It will take some time to ascertain what use cases are relevant to your company, so it is important to start this investigation now. ML is complex and challenging to master, yet the tools for machine learning are all readily available to you and are already being employed by AmazonGoogle and  Microsoft.

The journey to machine learning must start now.

DevOps-Driven Development

It is now time to add the concept of “DevOps-Driven Development” to our repertoire.

“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.

Definition and Motivation

First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).

How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post  thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.

Practicing DevOps-Driven Development

I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.

We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.

Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.

Here is a “thinking DevOps” check list:

Deployable

First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc

This implies:

  • Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
  • Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database

The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating  server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.

Monitor-able

If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.

Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:

Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.

Diagnosable

Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):

  • Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
  • All log statements used during the coding and QA phase must be removed
  • Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
  • The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely  meaningful information is output when troubleshooting is required
  • Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.

Security

This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).

Business Continuity

Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.

Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.

The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.

Scalable

The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.

This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc

This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).

Tunable

Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).

We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.

The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).

QA-able

As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).

Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.

Analytics – Growth Hacking – Usability

This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.

In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)

Even More So for Microservices

As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.

What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx,  Nirmata, etc)

Summary

Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.

More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.

Scalable Software Architecture for a Startup

Say we are the founders of a startup and we just got a big fat check for our A-round funding. The VCs love our idea, and we all know that our app will attract millions of users in no time. This means that from day one we architect for millions of page-views per day…

But wait … do we really need to deploy Hadoop now? Do we need to design for geographical redundancy now? OR should we just build something that’s going to take us through the next 3 months, so that we can focus our energy on customer development and fine-tuning our product features? …

This is a dilemma that most startups face.

Architecting for Scale

The main argument for architecting for scale from the get-go is akin to: “do it right the first time”: we know that lots of users will be using our app, so we want to be ready when they come, and we certainly don’t want the site going down just as our product catches fire.

In addition, for those of us who have been through the pain of a complete rewrite, a rewrite is something we want to avoid at all costs: it is a complex task that is fun under the right circumstances, but very painful under time pressure, e.g. when the current version of the product is breaking under load, and we risk turning away customers, potentially for ever.

On a more modest level, working on big complex problems keeps the engineering team motivated, and working on bleeding or leading edge technology makes it easier to attract talent.

Keeping It Simple

On the other hand, keeping the technology as simple as possible allows the engineering team to be responsive to the product team during the customer development phase. If you believe, as I do, one of Steve Blank’s principles of customer development: “No Business Plan Survives First Contact with Customers”, then you need to prepare for its corollary namely: “no initial product roadmap survives first contact with customers”. Said differently, attempting to optimize the product for scale until the company has reached clear validation of its business assumptions, and product roadmap, is premature.

On the contrary, the most important qualities that are needed from the Engineering team in the early stages of the company are velocity and adaptability. Velocity, in order to reduce time-to-market, and adaptability, so that the team can rapidly adapt to feedback from “outside the building”.

Spending time designing and implementing a scalable architecture is time that is Not spent responding to customer needs. Similarly, having built a complex system makes it more difficult to adapt to changes.

Worst of all, the investment in early optimization may be all for naught: as the product evolves with customer feedback, so do the scalability constraints.

Case Study: Cloudtalk

I lived through such an example at Cloudtalk. Cloudtalk is designed as a social communication platform with emphasis on voice. The first 2 products “Cloudtalk” and “Let’s Talk” are mobile apps that implement various flavors of group messaging with voice (as well as text and other media). Predicint rapid success, Cloudtalk was designed around the highly scalable noSQL database Cassandra.

I came on board to launch “Just Sayin”, another mobile app that runs on the same backend (very astute design). Just Sayin is targeted to celebrities and allows them to cross-post voice messages to Twitter and Facebook. One of my initial tasks coming on board was to scale the app, and it was suggested that we needed it to move it to Amazon Web Services so that we can scale rapidly as more celebrities (such as Ricky Gervais) adopt our product. However, a quick analysis revealed that unlike the first two products (Let’s Talk and Cloudtalk), Just Sayin’ impact on the database was relatively light, because communications were 1-to-many (e.g. Lady Gaga to her 10M fans). Rather, in order to scale, we first needed a Content Delivery Network (CDN) so that we could feed the millions of fans the messages from their celebrities with low response time.

Furthermore, while Cassandra is a great product, it was somewhat immature at the time (stability, management tools) and consequently slowed down our development. It also took us a long time to train new engineers.

While Cassandra will have been a good choice in the long run, we would have been better served in the formative stages of the company to use more established technology like mySQL. Our velocity in developing new features, and our ability to respond to changes in product strategy would have been significantly faster.

Architecting for Scale is a Process, not an Event

A startup needs to earn the right to design for scale, by first proving that it has found a legitimate market. During this first phase adaptability and velocity are its most important attributes.

This being said, we also need to anticipate that we will need to scale the system at some point. Here is how I like to approach the problem:

  • First of all, scaling is an on-going process. Even if traffic increases dramatically over a short period of time, not all parts of the system need to be scaled at the same time. Yet, as usage increases, it is likely that any point in time, some part of the system will need to be scaled.
  • In order to avoid complete rewrites of the system, we need to break it into independent components. This allows us to redesign each component independently, and have different teams work on different problems concurrently. As a consequence, good modularization of the system is much more important early on, than designing for scale
  • Every release cycle needs to budget time and resources for redesign – including both modularization and scalability. This is just like maintenance on the Golden Gate bridge: the painters are always working; when they finish at one end, they start all over at the other end.
  • We need to treat our software architecture the same way, and budget maintenance work every release cycle: dollars, time, people. CEOs have to be trained to not only think about the “shiny features” – those that are customer-facing – but also about the “continuous improvements” of the architecture that has to be factored in every release cycle.
  • We also need to instrument the code to tell us were it is under strain. Unlike the Golden Gate bridge, we can’t always see where it’s breaking, or even rationalize it. Scaling sometimes works in mysterious ways that are not always obvious to predict.

 

In summary, designing for scale is a high-class problem, on which we only get to work once we have demonstrated true demand for our product. During this first phase, velocity and adaptability are critical, and are better served with well-understood technologies, and a well modularized design. Once our product reaches an adoption phase, then designing for scale is a continuous process that hopefully can be focused on individual modules in turn – guided by proper instrumentation of the code

 

QA does not stop in QA

Quality Assurance does not stop after the software receives the “thumbs up” from the QA team. QA must continue while the product is Live! … because QA is not perfect, and real users only exist on a Production system. We need to be humble and accept that our design, development and quality processes will not catch all the issues. Consequently, we must equip ourselves with tools that will allow us to catch these problems in Production as early as possible … rather than “wait for the phone to ring”

When the product exits QA, it simply means that we have we’ve run out of ideas on how to make the system fail. Unfortunately, this does not imply that the system, once in Production, will not fail. If we are successful and get a high volume of traffic, the simple law of large numbers guarantees that our users will find yet-never-thought-of ways to – unintentionally – make the system fail. These are part of the “unknown unknowns” as Mr. Donald Rumsfeld would say. Deploying the product on the production servers, and handing-off (abdicating?) the responsibility to keeping it up to the Ops team shows wishful thinking or naïveté, or both.

Why QA must continue in Production

There are a few categories of issues that one needs to anticipate in Production:

  • Functional defects: in essence, bugs that neither developers, nor QA caught – while this is the obvious category that comes to mind, it is far from being the only source of issues
  • User experience (UX) defects: Product works “as spec’d”, but users either can’t figure how to make the product work, or don’t like it. A typical example is a high abandon rate in a purchasing experience, or any kind of work flow, or a feature that’s never used, a button that’s never clicked.
    This is not reserved to new products, by improving the layout of a given page, we may have broken another feature on that same page
  • Performance issues: while we may have run performance, and load tests, in our QA environments, the real world always offers surprises. Furthermore, if we are lucky enough to have the kind of traffic that Google or Facebook have, there is no other way but to test and fine-tune performance in production
    Running tests on non-production systems requires to not only simulate the load of the system, but also to simulate the “weight” of existing data (e.g. in database, file system) as well as longevity to ensure that there is no resource leak (memory, threads, etc)
  • Operational issues: while all cloud applications are typically clustered for high-availability, there are other sources of failure than equipment failure:
  • External resources, such as partners, data feeds, can fail, or have bugs of their own, or simply not keep up their response time. Sometimes, the partner updates the API without notification.
  • User-provided data can be mal-formed, or in an unexpected format, or a new data format can be introduced after the launch of the product
  • System resources can be consumed at an unexpected rate. Databases are notorious for having non-linear response times based on load: as long as the load is under a given threshold response time is high, but once the load exceeds this threshold response time can deteriorate very rapidly.

 

A couple of examples:

  • At my previous company, weeks after the product had been launched, we started receiving occasional complaints that some of the user-created videos were not showing up in their timeline. After (reluctantly) poking around in our log files, we did find out that about 10% of the videos that had been uploaded to our site for the past 2 weeks (but not earlier) were not processed properly. Our transcoder simply failed. Worse, it failed silently. The root cause was a minor modification to the video format introduced by Apple after our product was released. Since this failure was occurring for a small fraction of our users, and we had no “operational instrumentation” in our code, it took us a long time to even become aware of it.
  • Recently, we launched a product that exchanges data with our partner. Their API is well documented, and we tested our product in their sandbox environment, as well as their production environment. However, after launch, we had reports of occasional failures. It turns out that users on our partner’s site were modifying the data in ways that we did not expect, and causing the API to return error codes that we had never seen. Our code duly logged this problem each time it occurred in our log files … among the thousands of other log events generated every minute

 

Performing QA on Production Systems

As I mentioned, the Google and Facebook of the world, do a lot (if not most) of their QA on Production systems. Because they run hundreds of thousands of servers, they can use a small subset to run tests will live user data. This is clearly a fantastic option.

Similarly, “A/B comparisons” techniques are typically used in Marketing to compare 2 different user experiences, where the outcome (e.g. a purchase) can be measured. The same technique can be applied in testing, e.g. to validate that a fix of an intermittent bug difficult to reproduce does work.

 

More generally, Production code needs to be instrumented:

  • To detect failures, or QoS (Quality of Service) degradations, with internal causes (e.g. database is slowing down)
  • To detect failures, or QoS degradations, with external causes (e.g. partner API times out a lot)
  • To monitor resource utilization for each service or application – at a finer grain than provided by Operations monitoring tools which are typically at the server level.

The point is that if a user can’t buy a book on our website because our servers crash under load – this is a bug. While the crash is not due to code written incorrectly, it is due to the absence of code warning us that the system was running out of steam … this is still a bug.

 

In order to monitor quality in Production, we need to:

  • Clean up the code that writes to log files: eliminate all logging used for code testing, or statements such as “the code should never reach here”. Instead, write messages that will be meaningful to the poor soul who, a few weeks later, will be poring over megabytes of log files on a Sunday night trying to figure out why the system crashed
  • Ensure that log messages have consistent severity levels (e.g. as recommended by RFC 5424Wikipedia has a nice table), so that meaningful alerts can be triggered
  • Use a log aggregation system, like GrayLog2 (open source), so log files from multiple nodes in the same cluster, as well as nodes from different services can (a) be searched from a console and (b) viewed, time-aligned, on a single page (critical for troubleshooting). GrayLog2 can handle hundreds of millions of log events and terabytes of data.
  • MEASURE: establish a base line for response time, resources consumption, errors – and trigger alerts when the metrics deviate from the baseline beyond a predetermined threshold
  • Track that core functions – from a user perspective – complete, and log when, and ideally, why, they fail along with key parameters. E.g.: are users able to upload files to our system, are failures related to file size, time of day, location of user, etc?
  • Log UX and operationally meaningful events to track how users actually use the system, what features are most used and track them over time. These metrics are critical for the Product Management team
  • Monitor resource utilization and correlate with usage patterns. Quantify key usage parameters in order to scale the right resources in advance of the demand. For example, as traffic grows, the media server and the database servers may grow at the different rates.
  • Integrate alarms from application errors into the Ops monitoring tools: e.g. too many “can’t connect” errors should trigger an Ops alert that our partner is down – slow response time on a single server in a cluster may indicate the disk is failing

 

Quality is not a one-time event, it is an everyday activity, because users change their behaviors, partners change their APIs, systems get full and slow down. What used to work yesterday, may not work today, or no longer be good enough for our customers. As a consequence, the concept the “test driven” development must be extended to the Production systems, and our code must be instrumented to provide metrics that confirm that everything works as desired, and alerts when they don’t. But that’s not sufficient, developers and QA engineers must also take the time to look at the data, not just when a fire drill has been called, but also on a regular basis to understand how the system is being used, and how resources are consumed as the system scales, and apply this knowledge to subsequent releases.

Cloud Computing – The Miracle Tool for Testing

Cloud Computing eliminates restrictions due to the number of servers in the QA lab, and thus allows concurrent testing by developers and QA engineers. By making it easy to test often, and to expose early releases to the outside world, Cloud Computing will improve product quality

Does this story rings familiar? You are in a planning meeting for the next release, and learn that in addition to supporting Oracle 11g, the product will also need to support Microsoft SQL Server 2008 (or DB2, or mySQL, or PostgreSQL). Once the typical brouhaha dies down about how complicated this will be, how the whole code will need to be ripped apart, and how much time this will take, the Director of QA turns to you and asks for a couple of additional servers for the QA lab, so that the software can be tested on the two databases in parallel; minimum of three servers: 1 for the database, 1 for our software, and 1 for the test fixtures. The following day, it’s the developer lead’s turn to ask for more servers: need at least 1 “populated” database against which the developers can test, plus another set up for the daily build, etc.  Makes perfect sense … Except that no budget has been allocated for these servers! Soon you find yourself with your beggar’s cup in the CEO’s office, explaining to him, and the CFO, why your team needs these extra servers when “you already have so many!!”

Rejoice! Here comes Cloud Computing to the rescue ..

Cloud Computing could not only eliminate the need to purchase servers for testing, but also actually radically improves your ability to test, and thus improve product quality.

Cloud Computing, such as Amazon EC2,  offers the ability to deploy (and un-deploy) software on demand. One pays “by the hour” of computing used, and storage and bandwidth consumed. This is perfect for testing (by developers and by QA): compute load varies greatly over the cycle of the day, as well as the cycles of the release.

First of all, every developer can now have his/her own test setup against which to test. There is no limitation of hardware, no begging, borrowing or stealing from your colleagues for unutilized servers. One can just deploy at will. Furthermore, there is no restriction on the number of servers. So if you need to test a four-server cluster, you don’t have to hunt around for free servers, you just do it.

Similarly the daily build can deploy to multiple test environments concurrently and thus accelerate the validation of the build.

Finally, the QA team can also test in multiple environments simultaneously, e.g. Oracle and SQL Server at the same time! This offers the potential benefit of being able to test a much larger number of deployment scenarios, than would be possible using one’s own hardware.

Naturally, leveraging a Cloud Computing infrastructure, requires new tools.

First and foremost, all the tests must be automated. While technology has created virtual servers, it has not yet inventing virtual test engineers J.  Secondly, one will have to build tools to automatically deploy, e.g. from the build environment, the new version of the software, and the test fixtures, as well as collect the results of the test runs.

One can be quite creative with the test management tools. For example, if a test setup encounters a high-severity bug, you could configure your test software to pause the test, deploy to a second environment and continue testing in the second environment. This allows you to go back to the first test setup to troubleshoot, and find the cause of the crash.

Another fascinating advantage is that you can deploy demo or beta systems at will  (assuming your deployment model allows it.), and let your sales team or prospective customers to “play with” the early release. By making it easier to expose early releases of the product to the outside world, Cloud Computing further improves the quality of your product.

Will you save money by testing in a Cloud Computing infrastructure?

Obviously the answer depends … on your usage, but also on factors like how much data you need to keep permanently in the cloud. For example you may need to permanently store a synthetic database of a million users (it would be too slow to upload it each time). You will also incur higher networking traffic.

In addition, you may not want to move all your tests to the cloud. For example, you may want to keep your stress-tests, or longevity tests in-house, since these will be running 24×7, and you may want the option of running them on bare-metal.

At the end of the day, to me the attraction of Cloud Computing for testing is that it will increase quality (in addition to reducing costs). It will allow each developer to have access to a test environment at will.  It will create an additional impetus for test automation. Cloud Computing will also allow the concurrent deployment of tests to an arbitrary number of computing environments, and make it easier to give early access to your customers. Net-net, this translates to more tests in the same amount of time with less effort. It’s all goodness.