Migrating a Self-hosted Architecture to the Cloud

While it may possible to migrate a self-hosted architecture to the cloud with servers in identical configuration, it almost certainly will lead to a sub-optimal architecture in terms of performance, and higher costs, in some cases prohibitively so.

The common objectives for moving to the cloud are:

  • Ability to scale transparently as the business grows
  • Reduce costs
  • Benefit from a word-class IT infrastructure without having to hire the talent

 

We’ll focus on the first two objectives as the third one is achieved – by nature – the moment you flip the switch to the cloud.

Memory Drives Pricing in the Cloud – not CPU

in the cloud, whether with Amazon EC2 or other vendors, the primary dimension driving pricing is the amount of memory (RAM) available in the server. In addition, CPU allocated is roughly proportional to the amount of RAM.

For example, as of this writing, per the Amazon EC2 pricing and the Amazon EC2 Instance Types definitions:

  • A Small instance has (only) 1.7 GB of RAM – and 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit) and costs: $0.08 per hour On-Demand
  • An Extra Large instance is 8 times bigger than a small instance and costs 8 times as much: 15 GB of RAM – and 8 EC2 Compute Unit and costs: $0.64 per hour On-Demand
  • In order to get 32 GB of RAM, one needs to move to the High-Memory Double Extra Large Instance aka m2.2xlarge: $0.90 per hour On-Demand

Note that the prices quoted here are for US East (N. Virginia) zone. Prices for US West (Northern California) are more expensive (about 12% based on a few data points I correlated)

Database Servers

Database servers have unique requirements:

  • They require fast I/O to disk. While Amazon recommends using networked storage, this is typically not practical from a performance perspective.
  • So database servers also require large local disks: to hold the data
  • Most databases require a fair amount of memory at least 16 GB. We use Cassandra, and they recommend 32 GB per server.
  • They hate noisy neighbors (see previous blos). While virtualization technology does a fairly good job at partitioning CPU and RAM, it does a much poorer job at sharing I/O bandwidth. Having another virtual machine running on your database server can kill its performance, even if the neighbor does not do much, it can kill the I/O efficiency. All the tricks that databases use to optimize I/O performance assume that the database is in control of all I/O busses.

As a consequence, one should first of all use a reserved instance – simply because the cost of getting data in and out of the local disks makes it impossible to set-up / tear-down database servers “at will”.

Secondly, one should buy a large enough instance (e.g. m2.4xlarge) so that we are the only tenant on the server. This will cost $7,203 per year – based on Heavy Utilization Reserved Instances pricing, and get us 68.4 GB memory, 4 cores (8 virtual – with Intel Hyper-Threading) and 1.69 TB of local storage.

SSD

As Adrian Cockcroft from Netflix illustrates in his detailed post: Benchmarking High Performance I/O with SSD for Cassandra on AWS, moving to SSD instances for I/O and compute intensive systems can bring significant cost reductions. In his example, he compares a traditional system with 36 x m2.xlarge + 48 x m2.4xlarge instances at a cost of $772,806 (Total 3 Year Heavy Use Cost) – with a 15 x hi1.4xlarge system at a cost of $354,405 – a 54% savings.

As the article illustrates, selecting one versus the other requires careful understanding of the computational profile of the application, and some changes in the application’s architecture

Do I want to use Proprietary Amazon Solutions?

Following the logic that motivates us to move to the Cloud forces to consider using Amazon proprietary solutions: reduce need for sys admin talent, leverage out-of-the-box a scalable high-availability solution, etc

  • Should I replace mySQL with RDS? Or Casssandra, HBase with DynamoDB?
  • Should I replace my ActiveMQ (e.g.) message queue with SQS?
  • … and similarly for AWS many products

 

These are excellent products, battle tested by Amazon. However, there are 2 very important considerations to examine:

  • First, these products are obviously proprietary – making a move to another cloud provider like Rackspace or Joyent, will be take an extensive code rewrite. This may turn out to be impractical.
  • Secondly, cost can be a (bad) surprise once the application is deployed live. For both RDS or SQS, pricing is driven by data bandwidth AND the number of operations performed using the service – which requires careful analysis to estimate ahead of time. For example, polling every 10 seconds to check whether new data is present in SQS generates 250K operations per month (assuming each check requires only 1 operation). This is fine if this function is performed by a few servers, but would break the bank if it’s performed by 100,000 end-user clients. This adds up to $25,000 ($0.000001 per Request).

Algorithm Tuning and Server Selection

More generally, Amazon offers seven families of servers: Standard, Micro, High-Memory, High-CPU, Cluster Compute, Cluster GPU, and High I/O (SSD). Porting an existing application will thus require an iterative process evaluating the following questions:

  • How do I best match each of my system’s components with an Amazon instance types>
  • Can I fine-tune, or even re-write, my algorithms to maximize RAM & CPU utilization? In particular, would I make the same memory vs computation trade-offs? Do I need this hash-table, or can I re-compute the query?
  • How does my architecture evolve as I scale out? For example, do I need to replicate shared resources – like caches – or will sharding (e.g.) avoid this duplication of data – which will directly impact my cost since pricing is memory driven. An algorithm may work best using a approach favoring memory (and minimizing CPU) when running on a single server but it may be more cost-effective when optimized for memory when scaled out over many servers.
  • How do new technologies like SSD impact my architecture? As the Netflix article illustrates, the cost impact can be radical, but it required substantial architecture redesign, not just a simple server replacement

 

In conclusion, moving from a hosted environment (where each server can be configured at will) to the cloud where servers come in pre-determined configurations requires not only an architecture review, but a sophisticated excel spreadsheet to compare the costs of various architectures. This upfront financial modeling is absolutely necessary in order to avoid unpleasant surprises as the business scales up.