Deploying to the Cloud? Hang on to your Trousers!

My team and I have spent the past months investigating a deployment to the Cloud with vendors such as Amazon, Rackspace, GoGrid … to name a few who provide Infrastructure As A Service (IaaS).

A few conclusions have surfaced:

  • One needs to clear about one’s motivations to migrate to the Cloud- different motivations will lead to different outcomes, for a given product
  • It is almost impossible to predict the cost of a cloud-hosted system – without deploying a test system with the selected vendor. As a corollary, precise comparison shopping is almost impossible.
  • It is almost impossible to design, let alone deploy, your system architecture – without prior hands-on experimentation with your selected vendor. Also, the optimal architecture once deployed in the Cloud is likely to be radically different than one deployed on your own servers.
  • Some Cloud vendors are moving aggressively up the value chain by offering innovative software technologies on top of their infrastructure. They are thus becoming PaaS (Platform As A Service) vendors. For example, as we commented in a previous post “Is Amazon After Oracle and Microsoft?” Amazon is deploying an array of software technologies – combined with services – that are tailored specifically for the Cloud, and are technically very advanced

We expand each of these points in upcoming posts, starting with the first one today.

The main arguments advanced in favor of a cloud infrastructure are:

  • Offload the system management responsibilities to the Cloud services provider:
    This is more than an economic trade-off: managing systems for high-volume Internet applications is a complex task requiring a broad set of technical skills – where said skills are in permanent evolution. Acquiring all these skills typically requires multiple engineers with varied backgrounds: computer hardware, operating systems, storage, networking, scripting, security, etc. These system administrators have been in high-demand for the past couple of years, demand high compensation, and usually want to work for companies which offer challenging work … namely those with a very large number of systems. As a result, some companies are simply unable to hire the necessary system administration talent in-house, and are forced to move to the Cloud for this single reason.
  • Leverage best practices established by Cloud vendors.
    Cloud services providers have optimized every aspect of running a datacenter. For example, Facebook released the Open Compute project in 2011 for Server and Data Center Technology. RackSpace launched the OpenStack initiative in late 2010, to standardize and share software for Compute (systems management, Storage, Media, Security, as well as Identity and Dashboard. Even managing systems at a hosting provider requires constant tuning of system management tools –  whereas a Cloud service provider will take on this burden
  • Benefit from the economies of scale that the Cloud vendors have created for themselves
    Building data centers, finding cheap sources of power, buying and racking computers, creating high-bandwidth links to the Internet, etc. are all activities whose cost drops with volume. However, to me, the impact of price is much smaller than that of pure skills. The aforementioned tasks are becoming more and more complex, to the point where only the largest companies are capable of investing enough to keep up with the state-of-the-art.
    In particular, Cloud vendors offer high-availability and recoverability “for free” – namely: free from a technical perspective, but not from a financial one.
  • Ability to rapidly scale systems up or down according to load
    This is one of the main theoretical benefits of the cloud. However, it requires a few architectural components to be in place:
    (a) the software architecture has to be truly scalable and free of bottlenecks. For example, traditional N-tier architectures were advertised to be scalable because web servers could be added easily. Unfortunately, the database rapidly becomes the throttling component as the load rises. Scaling up traditional database sub-systems, while maintaining high-availability , is both difficult and expensive.
    (b) Tools and algorithms are required to detect variations in load, and to provision/decommission the appropriate servers. This requires a good understanding of how each component of the system contributes to the performance of the whole system. The complexity increases when the performance of components does not behave linearly with load.
    (c) Data repositories are slow and expensive to migrate. For example, doubling the size of a Cassandra (noSQL database) cluster is time consuming, uses a lot of bandwidth (for which the vendor may charge) and creates load on the nodes in the cluster.
  • Ability to create/delete complete system instances (most useful to development and testing)
    The Cloud definitely meets this promise for the front-end and business logic layers, but if an instance requires a large amount of data to be populated, you must either pay the time & cost at each deployment or keep the data tier up at all times.  This being said, deploying complete instances in the Cloud is still a lot cheaper and faster than doing it in one’s data center, assuming it can be done at all.
  • The Cloud is cheaper:
    This is a simple proposition, with a complex answer. As we’ll examine in the next blog: figuring out pricing in the cloud is a lot more complex than adding the cost of servers.

Appreciating the business and technical drivers that motivate a migration to the Cloud will drive how we approach the next steps in the process: system architecture design, vendor selection, and pricing analysis. As always, different goals will lead to different outcomes.

Is Amazon After Oracle and Microsoft?

Amazon is quietly, slowly, but surely becoming a software vendor (in addition to being the largest etailer), with product offerings that compete directly, and in some cases, are broader than the “traditional” software vendors such as Oracle and Microsoft.

For example, a simple review of Amazon Products shows no less than 3 database options Amazon Relational Database Service (RDS), SimpleDB and DynamoDB (launched earlier this year), which offers almost infinite scale and reliability.

Amazon also offers an in-memory cache – ElastiCache. You can also use their SIMPLE services: Workflow Service (SWF) – e.g for business processes, Queue Service (SQS) – for asynchronous inter-process communications, a Notification Service (SNS) – for push notifications, as well as email (SES). Amazon calls them all “simple”, yet a number of startups have been built and gone public or been acquired in the past couple decades on the basis of a single of these products: PointCast, Tibco, IronPort, just to name a few.

This is not all … Amazon offers additional services in other product categories: storage, of course, with S3 and EBS (Elastic Block Store), Web traffic monitoring, Identification management, load balancing, application containers, payment services (FPS), billing software (DevPay), backup software, content delivery network, MapReduce … my head spins trying to name all the companies whose business is to provide just a single one of these products.

Furthermore, Amazon is not just packaging mature technologies and slapping a “cloud” label on them. Some of them, like DynamoDB, are truly leading edge. Yet, what is most impressive, and where Amazon’s offering is arguably superior to that of Oracle, Microsoft or the product category competitors, is that Amazon commits to supporting and deploying these products at “Internet scale” – namely as large as they are. This is not only a software “tour-de-force” but also an operational one – as anyone who has tried to run high-availability and high-throughput Oracle or SQL Server clusters can testify.

Given its breadth of products, its ability to operate them at Internet-scale with high-availability, Amazon could become the default software stack: a foundation on which to architect products, displacing the traditional stacks such as: .Net, LAMP, or {mySQL,Oracle}-Java-Apache-JavaScript

The costs of deploying software on the Amazon stack is another story … and the topic of a future post

Setting Expectations about Formal Releases with the Business Team

Product Management sets features and priorities – Engineering sets schedule … and meets schedule

While the business team may desire, or be obligated, to fix both the date of a release and the features that will comprise it, it is our job as Engineers to educate them on how unrealistic this approach is [under the assumption that staffing cannot be increased and quality is not negotiable]. It is also our job to offer alternatives.

By working collaboratively, we can redefine the desired outcome in a way that still meets the business needs and allows for a speedier implementation, without compromising quality.

The fundamental rule of engagement is that …

Product Management Sets Features And Priorities — Engineering Estimates The Schedule … And Meets The Schedule.

Engineering is a key contributor to the product roadmap, sometimes even the primary contributor. However, the Product Management (PM) team has, by definition, the ownership of the business derived from the product, and as such they call the shots when it comes to the definition of features, and priorities.

On the other hand, PM has no business pronouncing how long it will take to implement the desired feature sets – this is Engineering’s purview. This is no different than when we are dealing with a contractor to remodel our kitchen: we can tell them what we want the kitchen to look like, but he/she is not going to do business with us if we tell him how much he/she can charge us, and/or how quickly the job will be completed.

Why? It simply boils down to ownership and accountability

Engineers, by and large, are a good sport and will do whatever they can to meet even the most unrealistic schedule. Ultimately, however, hard work cannot compensate for a schedule that is plainly not feasible.

Since it is Engineering’s job to develop the product, having anyone other than Engineering make estimates, or worse, commitments about schedule, removes the ownership and accountability of meeting the schedule from the Engineering team.

Schedules Only Have Value If There Is A Reasonable Expectation That They Will Be Met.

We set a schedule for the release of a product for a reason: so that other teams inside (e.g. marketing, sales) and outside the company (e.g. customers, partners) can make their own plans based on the availability of the product at a given date.  If we don’t meet the schedule, these other teams will have to redo their plans and will resent us for it. Worse, if we establish a habit of missing schedules, they will stop making plans and just wait-and-see until the product is actually delivered. This can create a vicious circle where the Engineering team sees that the Sales team does not plan on the product being ready on time, and thus does not feel the pressure to deliver on time – which reinforces the Sales team’s attitude, ….

The best way – by far – to build reliable schedules is to let the engineers who are responsible for the delivery of the product estimate their own schedule. For two reasons, one because the estimate will be more reliable and two because the engineers then have ownership of the schedule.

I have worked with a few CEOs who did not trust Engineering with estimates, and who were convinced that giving impossible tasks to the Engineering team ensured that they could get every last drop of blood/work out of the team. This is plain wrong.

Once in a while you can indeed rally the troops to meet an impossible target and “save the company”. However, over time, it quickly becomes counter-productive. People will not accept arbitrary challenges and will simply dis-engage.

On the other hand, when empowered to estimate their own schedule, Engineers will then feel accountable for meeting it, and it will become a matter of personal, and team, pride to deliver on-time. Furthermore, this fosters a culture of success – and a virtuous circle of people being able to rely on teammates’ commitments.

Schedule And Feature Set Result From A Collaborative Effort

Pushing the kitchen remodeling analogy further: Usually, the first bid is too expensive. What follows is a discussion about how flexible the dates are, what are the critical elements driving the dates and price, and a series of “what if …” discussions. The same needs to happen between Engineering and Product Management: what flexibility do we have in the dates? For example, will the customer to whom we promised certain features actually go into production at the said date, or will they accept a beta release because they need to do their own tests in a Staging environment? Are all the options and variations of a particular capability required by the release date, or can some of them be pushed out to the next release?

One of the most satisfying moments of the job is when product managers and engineers brainstorm on how to meet the business needs of our customers in innovative ways. By bringing these two teams together and sharing the knowledge of customer needs, or why a certain feature requires a lot of work, or why by softening a specific aspect, implementation becomes much easier, we truly create value for the company. At the end of this brainstorm, we have truly optimized the features we offer AND the effort required to deliver them. The continuous repetition of this exercise allows us to deliver more, faster.

 

The final ingredient is to provide transparency to the Engineering process – which is frequently considered as a black box. It is because of this lack of visibility that people on the outside tend to buy themselves insurance and ask for a schedule that’s more aggressive than need be. In the next blog, we will show how Agile Software Engineering provides, among other things, not only visibility on the progress of the Engineering team, but also the ability to adjust the course.

Who Owns Quality? Part 1

Understanding what role in the Engineering team owns quality is critical to determining how we run our projects

Over the past twelve years, I have had the opportunity to lead the Engineering team in over a half-dozen companies, and have observed an incredible variance in how each of the engineers answered this question: “Who owns quality?”

For only one of the companies that I joined, has the answer met my own.

In my experience, answering this question properly – and building corresponding software engineering processes – is critical. How an Engineering team addresses the ownership of quality has fundamental implications on how it operates. It impacts just about everything!

  • The daily tasks of each developer
  • The daily tasks of each QA engineer
  • The selection of software development tools and artifacts
  • The sequencing of tasks in software releases
  • The ability of the team to deliver quality product on time

The vast majority of answers fall into two bins: it is either “Everybody” or “QA”.

While it is hard to argue against the philosophy that everyone owns quality, this is an empty, and non-actionable, answer. When “everybody” is responsible, no one takes responsibility.

QA certainly has a big role to play in ensuring that we deliver high quality products. However, there is a fundamental reason why QA does not own Quality: they have little control over it: QA does not write the code, developers do. Asking QA to own quality is akin to asking the proverbial blind man to define the elephant! Asking QA to own quality implies a process where Quality is added after the fact, once the code has been written. Let us remember what QA stands for: Quality Assurance, not Quality Addition, or Quality Creation.

We all know that quality has to be built in, not added on.

To me, the right answer is: Developers Own Quality.

… to be continued