Notes from SF Data Mining Meetup: Recommendation Engines

Excellent talks on each of the presenting companies approach the design of their recommendation engines based on the specifics of their markets and users

Recommendation Engines

Thursday, Apr 4, 2013, 6:30 PM

Pandora HQ
2101 Webster Street, Suite 1650 Oakland, CA

200 Data Scientists Went

6:30 – 7:00pm Social and Food7:00 – 8:30pm Talks**8:30 – 9:00pm SocialWe’re excited to have three sets of speakers:1. Trulia: Todd Holloway will be giving a talk on Trulia Suggest.2. Rich Relevance: John Jensen and Mike Sherman will be giving their perspectives on recommendation engines.3. Pandora: Eric Bieschke will be giving his perspec…

Check out this Meetup →

Here are my notes on their respective technology stacks. Hadoop, Hive, Memcached, Java are used by all 3.

1. Trulia: Todd Holloway on Trulia Suggest.

  • Hadoop
  • Hive
  • R on each Hadoop Server
  • Memcached
  • Java

2. Rich Relevance: John Jensen and Mike Sherman

  • Hadoop
  • Hive
  • Pig
  • Crunch

Starting to deploy

  • Kafka
  • Storm

3. Pandora: Eric Bieschke

  • Python. Hadoop. Hive for  Offline processing
  • Memcached. Reddis: for near line & online
  • Java & PostgreSQL for online

Memcached: Used as key-value store in the sky  as long as you don’t care about losing data

Reddis: “Persistent Memcached”

Want to Predict your Cost in the Cloud? Roll Up Your Sleeves!

 

The selection of a cloud service provider is a critical decision for any a software service provider. Cost is, naturally, a key driver in this selection. However, predicting the cost of running servers in the cloud is a project in, and of, itself, because the only way to build a reliable model of costs, is to go ahead and deploy our systems with the service providers.

 

Why is not possible to forecast costs with pen and paper?

The main reason that pricing is so hard to forecast is that our system architecture in the cloud will likely be different from the one currently running in our own datacenter: the server configurations are different, the networking is different, and most likely we want to take advantage of the new features that come “for free” with a deployment in the cloud: higher availability, geographical redundancy, larger scale, etc. We’ll cover this in details in an upcoming post.

 

Another reason why it is hard to predict costs is that we don’t really know what we are getting:

When one considers the primary attributes of a server: RAM, CPU, storage, I/O (network bandwidth) – only RAM and storage capacity are guaranteed by cloud vendors. Vendors provide varying degrees of specificity about CPU and other key characteristics. Amazon defines EC2 Compute Units: “One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”. Rackspace’s price sheet categorizes servers by available RAM and disk space (the more RAM, the more disk space). Their FAQ mentions the number of virtual cores each server receives, based on the amount of RAM allocated, but I could not find their definition of a virtual core. GoGrid, or Joyent provide similarly limited information.

 

As a side note, one needs to be aware that vendors typically refer to “virtual cores” – as opposed to real (physical) cores. A virtual core corresponds to one of the two hyperthreads that run on modern Intel processors since 2002. Conversely, a server with a quad-core Intel Xeon processor runs 8 virtual cores. You can read this 2009 post, plus the comments thread, for more specifics. While the data is dated, the observations are still relevant.

 

So, there is a lot that we don’t know about the servers on which we will run our system: CPU clock, size of LI, L2 RAM, I/O bus speed, disk spindle rotation speed, network card bandwidth, etc.

Furthermore, performance will vary across servers (since cloud vendors have a diverse park of servers of different age) and thus, each time a new image is deployed, it will land on a random server, with the same nominal specs (RAM, storage), but unknown other physical characteristics (CPU clock, I/O bandwidth, etc).

 

Another well-documented problem is that of noisy neighbors. While the hypervisors do a fairly good job at controlling allocation of CPU and memory, they are not as effective at controlling the multitude of other factors that affect performance. I/O in particular is very sensitive to contention. While VMware affirms that vSphere solves this problem, most (all ?) cloud vendors use open source hypervisors.

In any event, this problem is systemic and cannot be solved by the hypervisor. For example, we did a lot of research on the best configuration for our Cassandra servers (database for big data). One of the main performance optimizations driving Cassandra’s design is to maximize “append” (rather than update) operations, thus minimizing random movement of the read/write head of the disk, and thus maximizing disk I/O. Unfortunately, all this clever optimization goes out the window if we share the server – and thus the disks – with a noisy neighbor who is performing random read-write operations. I had the chance to discuss this a couple of months ago with members of the Cassandra team at Netflix (one of the largest users of Cassandra and almost 100% deployed on Amazon): they solve the issue by only using m2.4xlarge instances on AWS, which (today) ensures that they are the only tenant on the physical server – and don’t have any noisy neighbor.

 

Adding all this together makes it pretty clear that vendor comparison on paper is practically fruitless.

Let’s Try It Out

The only practical way to create a realistic budget forecast is to actually deploy systems on the selected cloud vendor(s) and “play” with them. Here are some areas to investigate and characterize, beyond simply validating functionality:

  • Optimal server configuration for each server role (web, database, search, middle tier, cache, etc). We need to make sure that each server role is adequately served by one of the configurations offered by the vendor. For example, very few offer servers with more than 64 GB of RAM
  • Performance at scale (since we only pay for the servers we rent, we can run full-scale performance tests for a few hours or days at relatively low cost – e.g. a few hundred dollars) – Netflix tested Cassandra performance, “Over a million writes per second”, on AWS for less than $600 and clusters as large as 288 nodes
  • End-to-end latency (measured from an end-user perspective) – since latency will be impacted by the physical distribution of the servers
  • Pricing model

 

For these tests to be meaningful, one needs to ensure that deployments are realistic: for example, across service zones and regions, if we plan on leveraging these capabilities – as they impact not only performance (due to increased network latency) but also pricing (data transfer charges).

 

In addition, each test must be run several (10 – 20) times – with fresh deployments – at different times of day – in order to have a representative sample of servers and neighbors.

 

As important as the technical performance validation, the pricing model must be validated as vendors charge for a variety of services in addition to the lease of the servers: most notably bandwidth for data transfers (e.g. across regions), but also optional services (e.g. AWS Monitoring or Auto-Scaling), as well as per operation fees (e.g. Elastic Block Store). The “per operation” fees can add up to very large amounts, if one is not careful. For example, see the Amazon SimpleDB price calculator – we have to run SimpleDB under real load in order to figure out what numbers to plug in. Overlooking this step can be costly.

 

Once the technical tests have been completed, and the system configuration validated,

I recommend at least a full billing cycle of simulated operations, in order to obtain an actual bill from the vendor from which we can build our pricing model.

Deploying to the Cloud? Hang on to your Trousers!

My team and I have spent the past months investigating a deployment to the Cloud with vendors such as Amazon, Rackspace, GoGrid … to name a few who provide Infrastructure As A Service (IaaS).

A few conclusions have surfaced:

  • One needs to clear about one’s motivations to migrate to the Cloud- different motivations will lead to different outcomes, for a given product
  • It is almost impossible to predict the cost of a cloud-hosted system – without deploying a test system with the selected vendor. As a corollary, precise comparison shopping is almost impossible.
  • It is almost impossible to design, let alone deploy, your system architecture – without prior hands-on experimentation with your selected vendor. Also, the optimal architecture once deployed in the Cloud is likely to be radically different than one deployed on your own servers.
  • Some Cloud vendors are moving aggressively up the value chain by offering innovative software technologies on top of their infrastructure. They are thus becoming PaaS (Platform As A Service) vendors. For example, as we commented in a previous post “Is Amazon After Oracle and Microsoft?” Amazon is deploying an array of software technologies – combined with services – that are tailored specifically for the Cloud, and are technically very advanced

We expand each of these points in upcoming posts, starting with the first one today.

The main arguments advanced in favor of a cloud infrastructure are:

  • Offload the system management responsibilities to the Cloud services provider:
    This is more than an economic trade-off: managing systems for high-volume Internet applications is a complex task requiring a broad set of technical skills – where said skills are in permanent evolution. Acquiring all these skills typically requires multiple engineers with varied backgrounds: computer hardware, operating systems, storage, networking, scripting, security, etc. These system administrators have been in high-demand for the past couple of years, demand high compensation, and usually want to work for companies which offer challenging work … namely those with a very large number of systems. As a result, some companies are simply unable to hire the necessary system administration talent in-house, and are forced to move to the Cloud for this single reason.
  • Leverage best practices established by Cloud vendors.
    Cloud services providers have optimized every aspect of running a datacenter. For example, Facebook released the Open Compute project in 2011 for Server and Data Center Technology. RackSpace launched the OpenStack initiative in late 2010, to standardize and share software for Compute (systems management, Storage, Media, Security, as well as Identity and Dashboard. Even managing systems at a hosting provider requires constant tuning of system management tools –  whereas a Cloud service provider will take on this burden
  • Benefit from the economies of scale that the Cloud vendors have created for themselves
    Building data centers, finding cheap sources of power, buying and racking computers, creating high-bandwidth links to the Internet, etc. are all activities whose cost drops with volume. However, to me, the impact of price is much smaller than that of pure skills. The aforementioned tasks are becoming more and more complex, to the point where only the largest companies are capable of investing enough to keep up with the state-of-the-art.
    In particular, Cloud vendors offer high-availability and recoverability “for free” – namely: free from a technical perspective, but not from a financial one.
  • Ability to rapidly scale systems up or down according to load
    This is one of the main theoretical benefits of the cloud. However, it requires a few architectural components to be in place:
    (a) the software architecture has to be truly scalable and free of bottlenecks. For example, traditional N-tier architectures were advertised to be scalable because web servers could be added easily. Unfortunately, the database rapidly becomes the throttling component as the load rises. Scaling up traditional database sub-systems, while maintaining high-availability , is both difficult and expensive.
    (b) Tools and algorithms are required to detect variations in load, and to provision/decommission the appropriate servers. This requires a good understanding of how each component of the system contributes to the performance of the whole system. The complexity increases when the performance of components does not behave linearly with load.
    (c) Data repositories are slow and expensive to migrate. For example, doubling the size of a Cassandra (noSQL database) cluster is time consuming, uses a lot of bandwidth (for which the vendor may charge) and creates load on the nodes in the cluster.
  • Ability to create/delete complete system instances (most useful to development and testing)
    The Cloud definitely meets this promise for the front-end and business logic layers, but if an instance requires a large amount of data to be populated, you must either pay the time & cost at each deployment or keep the data tier up at all times.  This being said, deploying complete instances in the Cloud is still a lot cheaper and faster than doing it in one’s data center, assuming it can be done at all.
  • The Cloud is cheaper:
    This is a simple proposition, with a complex answer. As we’ll examine in the next blog: figuring out pricing in the cloud is a lot more complex than adding the cost of servers.

Appreciating the business and technical drivers that motivate a migration to the Cloud will drive how we approach the next steps in the process: system architecture design, vendor selection, and pricing analysis. As always, different goals will lead to different outcomes.

Is Amazon After Oracle and Microsoft?

Amazon is quietly, slowly, but surely becoming a software vendor (in addition to being the largest etailer), with product offerings that compete directly, and in some cases, are broader than the “traditional” software vendors such as Oracle and Microsoft.

For example, a simple review of Amazon Products shows no less than 3 database options Amazon Relational Database Service (RDS), SimpleDB and DynamoDB (launched earlier this year), which offers almost infinite scale and reliability.

Amazon also offers an in-memory cache – ElastiCache. You can also use their SIMPLE services: Workflow Service (SWF) – e.g for business processes, Queue Service (SQS) – for asynchronous inter-process communications, a Notification Service (SNS) – for push notifications, as well as email (SES). Amazon calls them all “simple”, yet a number of startups have been built and gone public or been acquired in the past couple decades on the basis of a single of these products: PointCast, Tibco, IronPort, just to name a few.

This is not all … Amazon offers additional services in other product categories: storage, of course, with S3 and EBS (Elastic Block Store), Web traffic monitoring, Identification management, load balancing, application containers, payment services (FPS), billing software (DevPay), backup software, content delivery network, MapReduce … my head spins trying to name all the companies whose business is to provide just a single one of these products.

Furthermore, Amazon is not just packaging mature technologies and slapping a “cloud” label on them. Some of them, like DynamoDB, are truly leading edge. Yet, what is most impressive, and where Amazon’s offering is arguably superior to that of Oracle, Microsoft or the product category competitors, is that Amazon commits to supporting and deploying these products at “Internet scale” – namely as large as they are. This is not only a software “tour-de-force” but also an operational one – as anyone who has tried to run high-availability and high-throughput Oracle or SQL Server clusters can testify.

Given its breadth of products, its ability to operate them at Internet-scale with high-availability, Amazon could become the default software stack: a foundation on which to architect products, displacing the traditional stacks such as: .Net, LAMP, or {mySQL,Oracle}-Java-Apache-JavaScript

The costs of deploying software on the Amazon stack is another story … and the topic of a future post