Scalable Software Architecture for a Startup

Say we are the founders of a startup and we just got a big fat check for our A-round funding. The VCs love our idea, and we all know that our app will attract millions of users in no time. This means that from day one we architect for millions of page-views per day…

But wait … do we really need to deploy Hadoop now? Do we need to design for geographical redundancy now? OR should we just build something that’s going to take us through the next 3 months, so that we can focus our energy on customer development and fine-tuning our product features? …

This is a dilemma that most startups face.

Architecting for Scale

The main argument for architecting for scale from the get-go is akin to: “do it right the first time”: we know that lots of users will be using our app, so we want to be ready when they come, and we certainly don’t want the site going down just as our product catches fire.

In addition, for those of us who have been through the pain of a complete rewrite, a rewrite is something we want to avoid at all costs: it is a complex task that is fun under the right circumstances, but very painful under time pressure, e.g. when the current version of the product is breaking under load, and we risk turning away customers, potentially for ever.

On a more modest level, working on big complex problems keeps the engineering team motivated, and working on bleeding or leading edge technology makes it easier to attract talent.

Keeping It Simple

On the other hand, keeping the technology as simple as possible allows the engineering team to be responsive to the product team during the customer development phase. If you believe, as I do, one of Steve Blank’s principles of customer development: “No Business Plan Survives First Contact with Customers”, then you need to prepare for its corollary namely: “no initial product roadmap survives first contact with customers”. Said differently, attempting to optimize the product for scale until the company has reached clear validation of its business assumptions, and product roadmap, is premature.

On the contrary, the most important qualities that are needed from the Engineering team in the early stages of the company are velocity and adaptability. Velocity, in order to reduce time-to-market, and adaptability, so that the team can rapidly adapt to feedback from “outside the building”.

Spending time designing and implementing a scalable architecture is time that is Not spent responding to customer needs. Similarly, having built a complex system makes it more difficult to adapt to changes.

Worst of all, the investment in early optimization may be all for naught: as the product evolves with customer feedback, so do the scalability constraints.

Case Study: Cloudtalk

I lived through such an example at Cloudtalk. Cloudtalk is designed as a social communication platform with emphasis on voice. The first 2 products “Cloudtalk” and “Let’s Talk” are mobile apps that implement various flavors of group messaging with voice (as well as text and other media). Predicint rapid success, Cloudtalk was designed around the highly scalable noSQL database Cassandra.

I came on board to launch “Just Sayin”, another mobile app that runs on the same backend (very astute design). Just Sayin is targeted to celebrities and allows them to cross-post voice messages to Twitter and Facebook. One of my initial tasks coming on board was to scale the app, and it was suggested that we needed it to move it to Amazon Web Services so that we can scale rapidly as more celebrities (such as Ricky Gervais) adopt our product. However, a quick analysis revealed that unlike the first two products (Let’s Talk and Cloudtalk), Just Sayin’ impact on the database was relatively light, because communications were 1-to-many (e.g. Lady Gaga to her 10M fans). Rather, in order to scale, we first needed a Content Delivery Network (CDN) so that we could feed the millions of fans the messages from their celebrities with low response time.

Furthermore, while Cassandra is a great product, it was somewhat immature at the time (stability, management tools) and consequently slowed down our development. It also took us a long time to train new engineers.

While Cassandra will have been a good choice in the long run, we would have been better served in the formative stages of the company to use more established technology like mySQL. Our velocity in developing new features, and our ability to respond to changes in product strategy would have been significantly faster.

Architecting for Scale is a Process, not an Event

A startup needs to earn the right to design for scale, by first proving that it has found a legitimate market. During this first phase adaptability and velocity are its most important attributes.

This being said, we also need to anticipate that we will need to scale the system at some point. Here is how I like to approach the problem:

  • First of all, scaling is an on-going process. Even if traffic increases dramatically over a short period of time, not all parts of the system need to be scaled at the same time. Yet, as usage increases, it is likely that any point in time, some part of the system will need to be scaled.
  • In order to avoid complete rewrites of the system, we need to break it into independent components. This allows us to redesign each component independently, and have different teams work on different problems concurrently. As a consequence, good modularization of the system is much more important early on, than designing for scale
  • Every release cycle needs to budget time and resources for redesign – including both modularization and scalability. This is just like maintenance on the Golden Gate bridge: the painters are always working; when they finish at one end, they start all over at the other end.
  • We need to treat our software architecture the same way, and budget maintenance work every release cycle: dollars, time, people. CEOs have to be trained to not only think about the “shiny features” – those that are customer-facing – but also about the “continuous improvements” of the architecture that has to be factored in every release cycle.
  • We also need to instrument the code to tell us were it is under strain. Unlike the Golden Gate bridge, we can’t always see where it’s breaking, or even rationalize it. Scaling sometimes works in mysterious ways that are not always obvious to predict.

 

In summary, designing for scale is a high-class problem, on which we only get to work once we have demonstrated true demand for our product. During this first phase, velocity and adaptability are critical, and are better served with well-understood technologies, and a well modularized design. Once our product reaches an adoption phase, then designing for scale is a continuous process that hopefully can be focused on individual modules in turn – guided by proper instrumentation of the code