At the risk of sounding harsh, I cannot remember the last time I interviewed a QA engineer who could clearly explain to me what a Performance Test is. Even more worrisome, when they described how they usually run performance test, they could only describe the mechanics of the test (“I ran JMeter”). More often than not, the tests were wrong, and, in any case, they could not interpret the results (largely because they did not know what numbers to look at).
So here is a basic description of what a performance test is and the important steps to follow – as well as one of the most common mistakes.
Note for illustration purposes, I’ll use the term site (e.g. retail e-commerce site), but this post applies identically to SaaS services.
Typical Performance Test
The primary performance test that engineers want to run on a product is: “How many users will our product support?”. This typically translates to “How many concurrent users can the current code and infrastructure support with acceptable response time?”
Another performance question is to determine how many total users can be supported. This problem is rarer, and we won’t cover it here.
The first challenge is: “what is a concurrent user?” While the answer is intuitively obvious: “the number of people using the site at the same time”, we still need to define what “at the same time” means. In particular, it does NOT mean “exactly” at the same time, i.e. within the same micro-second.
This incorrect “within the same microsecond” interpretation actually leads to the most common error in performance testing, where testers set up N JMeter clients (where N is the target number of concurrent users), launch them at the same time, and measure the time it takes until the last clients receives a response. This test does not measure concurrent users – in real life, users arrive randomly on a site, not at the exact same time. This is illustrated in the two figures below. Both represent 1,000 users using the site during a period of one hour. The “Burst” figure illustrates the “bad” test when all 1,000 users hit the same in the first minute (and then no activity in the remaining 59 minutes). While the correct scenario is likely to be closer to the second picture, where during each of the 60 minutes an average of 16.66 users hit the site (1,000 users / 60 minutes). However, in any given minute the number of users can vary between 5 and 25 (for example).
Another way of expressing that a site has 1,000 visitors per hour, is that a new visitor hits the site every 3.6 sec (3,600 seconds per hour / 1,000). Consequently, we can program JMeter to hit the site following a pseudo-random sequence of mean 3.6 sec and standard deviation 1 second (e.g.).
Depending on the length of our standard visitor visit, we will need to deploy multiple clients. For example, if each visit simulation takes a minute, we will need an average of 16.6 clients (60 sec per minute / 3.6 sec) to perform the simulation
How Many Concurrent Users to Use?
The excellent article: How do you find the peak concurrent users on your site? describes in details how we can find the peak number of visits during a given hour on the site (for performance testing, we care about visits – rather than visitors)
Of course, the key is to know our site well enough to identify the actual day(s) and hour(s) when we have the peak traffic.
However, we cannot stop there. The purpose of the test is to give us confidence that our site will be able to handle the load in the future. So the historic peak number of visits needs to be interpolated for the anticipated lifetime of the release we are certifying. E.g. if we want to be ready for the upcoming Black Monday, we need to take last year’s numbers and increase them based on the projected traffic from Sales & Marketing … plus an extra 25% as a safety buffer.
What is a Visit? What User Actions to Execute?
Now that we have figured out, how many visits we need to run, and how often, we need to figure out “what’s a visit”? There are 2 principal ways to answer the question:
- a visit is a isolated user action
- a visit is the simulation of an actual visit of a typical user.
Testing the Performance of a Isolated User Action
While we assume, and hope, that the visitors on our site don’t limit themselves to a single action when they visit, we may still want to simulate the performance of a single action under 2 generic conditions
- this is new functionality and we want to ensure that it is fast enough
- as part of performance regression testing – we want to ensure that the new release is no worse than the past releases
The question now is, what isolated user actions should we test? Here are some examples:
- Login / Home Page / Landing Page: the 1st page that is loaded when a user arrives on the site. It is important for it to be fast, since it is the “first impression” of the site for the user – and – it usually requires a lot of work server-side, since, by definition, we have no (or little – in the case of a returning user) cached data, so a lot of data needs to be computed or refreshed
- All pages related to checkout and payment: don’t want to lose a purchase because the site is too slow
- Pages which Analytics in Production identify as slow
- New Pages / heavily re-factored pages that are content- and/or compute-heavy
Testing the Performance of a Typical Visit
A “typical visit” can be constructed either “synthetically”, or “by replay”.
A synthetic visit is one that we build based on the nature of the site and statistics gleaned from the Production system – such as the average number of pages visited. A typical visit for a retail site would entail: login – select a category – then a sub-category – browse a couple pages – select an item – checkout and pay.
A “replay” visit is based on tracking the actual navigation of a random user on the site (e.g. using log files).
It is important to note that I should be using the plural – we need to identify multiple “typical visits”. For example, in the example above, a second typical visit would involve a couple of searches rather than browsing by category/sub-categoy.
We can also breakdown our visits to focus on specific sections or functionality of the site: search, browsing, check-out – this makes it easier to interpret the results
Let’s Not Forget Background Load!
In order for our performance test to be realistic and representative of Production, we need to simulate all the types of traffic taking place on the site – qualitatively and quantitatively – during our experiment – because each type of traffic consumes shared resources differently – whether it’s cache, database resources, access to disk, etc.
What I call “background load”, is a simulation of the typical traffic on our site. The best way to simulate it is to record it and replay it – either using a proxy server, log files, or other tools – e.g. Improving testing by using real traffic from production
One exception: If we want to characterize the performance of a specific code path – and/or benchmark it against previous versions – then we should not run any background load
How Long Should We Run the Test?
When thinking about the timing of a performance test, we first need to address the “ramp-up” or “warm-up” phase. When we launch our performance test, our system also starts “cold”: typically it has an empty cache, a small number of threads running, a few database connections set up, etc – so to obtain a realistic measurement of performance we need to wait for the system to reach its steady-state. It’s not always straightforward to tell when the system reaches steady-state, but we can get clues from monitoring critical system parameters: CPU, RAM, I/O on key servers (e.g database). Also, response time should stabilize.
Secondly, because the simulated visitors access the system in a random fashion, we need to take our measurements over an extended period of time, so as to smooth out the randomness. An easy way to figure out the exact duration is to experiment: if the results we get after 10 minutes are the same – over 10 experiments – as those that we get over 1 hour – then 10 minutes is enough.
How Many Tests Do We Run?
Depending on the test, and the environment, we may have to run it multiple times. For example, if we are running on AWS, even if we are running on reserved instances, the performance may vary between runs, depending on network traffic, load from other tenants of the physical servers, etc.
Interpreting the Results
We can look at the results in at least 2 different ways:
1/ In absolute: does the response time meet our target?
2/ Relatively to prior releases: is our response time no worse in this release than in prior releases
Ideally, performance tests are automated and run regularly – at least after each Sprint. This allows us to catch performance regressions early.
Conversely, running performance tests after “code complete” is almost a waste of time, since this leaves us no time to remedy any serious performance issue, and thus puts us in a quandary: release a slow product on time vs release a good product late?
Common Mistake: Initial Conditions
Finally, we need to address a very common mistake, namely ignoring initial conditions.
To ground the discussion, let me give an example: we can all agree that the query “select * from TableA” will execute much faster if TableA has 1 row vs if TableA has 100M rows.
The same applies to performance testing. Just as we let the system reach steady-state before we start measuring performance, we also need to ensure that all assets impacting performance are fully loaded.
To be more specific, let’s say today is March 15, and I am working on the release that will be in Production for Black Friday in November. In order to have a meaningful performance test, I need to make sure I load my E-commerce database not just with the same number of customers that we have today (March 15), but with the projected number of customers we’ll have by the end of November! Similarly, with the number of products in the catalog, documents in the search database, etc.
This is a complex – but absolutely critical – step. Otherwise, our tests will tell us about the past, not the future.
The final consideration is to ensure that the initial conditions – as well as the test – are 100% reproducible. This means that the “initial conditions” need to be exactly reproducible: e.g. by restoring from a previously archived database, by using the same pseudo-random sequence to trigger visits, the same logs to replay background traffic, etc. Otherwise, we are simply running a different test each time. As a consequence, any benchmarking would be meaningless.
Performance testing is complex, and requires a lot of thought, careful planning and detailed work to produce results that are meaningful. Specifically, we need to:
- Model one or more individual visit profiles constructed from traffic patterns on our site / in our service
- Model visit rate based on our site’s Analytics and interpolate it to projected level during the lifetime of the release
- Generate pseudo-random sequences to model users’ arrival on the site
- Generate a model of background load and/or a mixture of individual visits that together are a good approximation of actual traffic
- Make sure to give the site/service time to warm-up (i.e. reach steady state) before starting measurements, and run the test long enough to smooth out the pseudo-random patterns. Also run multiple tests if environmental conditions cannot be fully controlled
- Finally, make sure to properly initialize the whole system – in a reproducible fashion – so as to account for all the data already present in the system.
- Finally finally, ensure that all tests conditions are reproducible, and tests are automated so that they can be run regularly – preferably upon the completion of each sprint. This ensures that performance bugs are caught early, rather than 2 weeks before the target release date.
The most lively debates that I regularly encounter leading an Engineering team revolves around the allocation of resources between bug fixing and the development of new features: “Why doesn’t Engineering fix all the bugs?” exclaims a customer support person – “Why don’t we allocate all Engineering resources to New_Shiny_Feature_X?” wonders the salesperson whose major deal depends on this feature.
These are both absolutely legitimate questions! … It does not mean that their answer is easy.
The main challenge in satisfying these two rightful requests is that they compete for the same resources, and that different people within the company have strongly-held different perspectives. The same person can even switch camps in a matter of days. It all depends on the last sales call. Do we have a customer threatening not to renew until we fix “their bugs”, or do have a big deal pending on the delivery of a new feature?
As a consequence, it is imperative to create a business and technical framework that leads to decision making, where every stakeholder can not only express their perspective but also be satisfied about the decision process and thus about the decisions that come out of this process.
Framework for Decision Making
What’s more important? Or more precisely, what’s more important to implement in this release cycle?
- New features driven by product roadmap and corporate strategy
- Customer-driven enhancement requests
- Bug fixes requested by existing customers
- Paying down technical debt: upgrade architecture, refactor ugly code, optimize operational infrastructure, etc
The process to reach a decision is basically the same as for any business decision: we weigh how much income each item will generate and how much investment it will require.
Implementing a new feature, fixing a bug, enhancing a released feature or paying down technical debt demand the same activities: define requirements, design, code, test, deploy. They also all draw from the same pool of product managers, developers, QA and DevOps engineers. As a consequence, it is relatively easy to define the “investment” side of the equation.
Estimating the income side is a bit more complex, because it comes in multiple flavors. However, the process is the same as prioritizing the backlog of new features: we need to articulate the business case:
- Expected revenue stream (new features & enhancements)
- Reduction in subscription churn (enhancements & bug fixes, as well as new features)
- Cost reduction (technical debt / architecture) through increased future development velocity
- Customer satisfaction (bug & enhancements) which translates in better advocacy for the brand and churn reduction
- Strategic objectives (market positioning, competitive move, commitment to win a major deal)
Each of these categories is important in its own right. Since they cannot all be translated into a common unit of measure (e.g. dollars), I recommend quantifying each of these elements relative to one another (e.g. using T-shirt sizes: S, M, L, XL, …) for each item on the list.
Practically, I create a matrix with rows listing each feature, bug, enhancement request, technical debt, and the following columns:
- Short Description
- Link to longer description (Jira, Wiki, …)
- Summary business case
- Estimated engineering effort
- Estimated calendar duration
- Expected increase in revenue (if any)
- Expected cost reduction (if any)
- Customer satisfaction impact
- Strategic value
While this is not perfect – ideally we’d want to assign a single score for each item – this allows to (a) resolve the no-brainers (high-benefits at a low-cost or high-cost and low-benefits) (b) frame the discussion for the remainder against the business context of the company:
- Are we in a tight competitive race where we need to show momentum in our innovation?
- Do we have one, or more, major deals dependent on a given set of features?
- Are our customers grumbling about our product quality, or worse threatening to leave?
- Is our scalability at risk because of legacy code?
- Are we being hampered in our ability to deliver new features by too much legacy code?
While this will not eliminate passionate debates at Product Council, it will hopefully bound them, particularly if we can first agree on high-level priorities for the business.
Why Not Have a Dedicated Sustaining Engineering Team?
There are two primary reasons why a Sustaining Engineering team is a bad idea: first, it “does not answer the question” of prioritization, and secondly, it is a bad practice as it creates a class of “second-class citizens” engineers.
Say you want to have a Sustaining Engineering team. How large should it be? 5%, 10%, 20%, 50% of all engineering? Why? Should its size remain constant? Or are we allowed to shift resources in and out depending on business priorities? Answering these questions requires the same analysis and decision making as I propose above, but is burdened by the inflexibility of a split organizations
Regardless of whom you assign to Sustaining Engineering, these engineers will be considered second-class by the self-proclaimed hotshots who get to work on new features. Worse, it promotes laziness with respect to quality from the “new feature team”: they know that Sustaining will clean whatever mess they leave. It is pervasive, and over time can even lead to cherry-picking of work, which means that Sustaining ends up completing the “new feature” work. For example, the “new feature” team releases a new product on Chrome (so that they can meet “their date”), but Sustaining gets to make it work on Internet Explorer.
A Useful Best Practice
Any bug older than 12 months, should be removed from the bug backlog. They should either be marked as “Won’t Fix”, or assigned to a secondary backlog list (which, I predict, will never be reviewed). The justification is simple: if a given bug has lived through a year’s worth of bug triages without rising to the top and being fixed, then it is almost certain that it will never be prioritized for resolution. Better to put it out of its misery. Furthermore, this will keep the bug backlog to a reasonable size and bug triage a manageable task. Finally, if for some reason, the visibility of this bug raises anew, it can be returned to the active backlog.
The adage “Software always has bugs” remains true, not because it is impossible to write perfect software (I argue that this IS possible), but rather because in a business context, quality is not an end in-and-of-itself. Don’t get me wrong high-quality is critical, but fixing ALL the bugs is not a requirement for business success.
As a consequence, only 1 criterion matters: “what moves the business forward the most effectively?”
Typically this means making customers happy. There are times when customers are happier if we fix bugs, at other times they prefer to see a new feature brought to market earlier. The answer depends on what drives their business. Do they prefer that we fix a bug that costs them an extra hour of work per day or that we launch a new feature that will allow them to grow their business by 10% in 6 months?
Excellent talks on each of the presenting companies approach the design of their recommendation engines based on the specifics of their markets and users
Here are my notes on their respective technology stacks. Hadoop, Hive, Memcached, Java are used by all 3.
1. Trulia: Todd Holloway on Trulia Suggest.
- R on each Hadoop Server
2. Rich Relevance: John Jensen and Mike Sherman
Starting to deploy
3. Pandora: Eric Bieschke
- Python. Hadoop. Hive for Offline processing
- Memcached. Reddis: for near line & online
- Java & PostgreSQL for online
Memcached: Used as key-value store in the sky as long as you don’t care about losing data
Reddis: “Persistent Memcached”
Say we are the founders of a startup and we just got a big fat check for our A-round funding. The VCs love our idea, and we all know that our app will attract millions of users in no time. This means that from day one we architect for millions of page-views per day…
But wait … do we really need to deploy Hadoop now? Do we need to design for geographical redundancy now? OR should we just build something that’s going to take us through the next 3 months, so that we can focus our energy on customer development and fine-tuning our product features? …
This is a dilemma that most startups face.
Architecting for Scale
The main argument for architecting for scale from the get-go is akin to: “do it right the first time”: we know that lots of users will be using our app, so we want to be ready when they come, and we certainly don’t want the site going down just as our product catches fire.
In addition, for those of us who have been through the pain of a complete rewrite, a rewrite is something we want to avoid at all costs: it is a complex task that is fun under the right circumstances, but very painful under time pressure, e.g. when the current version of the product is breaking under load, and we risk turning away customers, potentially for ever.
On a more modest level, working on big complex problems keeps the engineering team motivated, and working on bleeding or leading edge technology makes it easier to attract talent.
Keeping It Simple
On the other hand, keeping the technology as simple as possible allows the engineering team to be responsive to the product team during the customer development phase. If you believe, as I do, one of Steve Blank’s principles of customer development: “No Business Plan Survives First Contact with Customers”, then you need to prepare for its corollary namely: “no initial product roadmap survives first contact with customers”. Said differently, attempting to optimize the product for scale until the company has reached clear validation of its business assumptions, and product roadmap, is premature.
On the contrary, the most important qualities that are needed from the Engineering team in the early stages of the company are velocity and adaptability. Velocity, in order to reduce time-to-market, and adaptability, so that the team can rapidly adapt to feedback from “outside the building”.
Spending time designing and implementing a scalable architecture is time that is Not spent responding to customer needs. Similarly, having built a complex system makes it more difficult to adapt to changes.
Worst of all, the investment in early optimization may be all for naught: as the product evolves with customer feedback, so do the scalability constraints.
Case Study: Cloudtalk
I lived through such an example at Cloudtalk. Cloudtalk is designed as a social communication platform with emphasis on voice. The first 2 products “Cloudtalk” and “Let’s Talk” are mobile apps that implement various flavors of group messaging with voice (as well as text and other media). Predicint rapid success, Cloudtalk was designed around the highly scalable noSQL database Cassandra.
I came on board to launch “Just Sayin”, another mobile app that runs on the same backend (very astute design). Just Sayin is targeted to celebrities and allows them to cross-post voice messages to Twitter and Facebook. One of my initial tasks coming on board was to scale the app, and it was suggested that we needed it to move it to Amazon Web Services so that we can scale rapidly as more celebrities (such as Ricky Gervais) adopt our product. However, a quick analysis revealed that unlike the first two products (Let’s Talk and Cloudtalk), Just Sayin’ impact on the database was relatively light, because communications were 1-to-many (e.g. Lady Gaga to her 10M fans). Rather, in order to scale, we first needed a Content Delivery Network (CDN) so that we could feed the millions of fans the messages from their celebrities with low response time.
Furthermore, while Cassandra is a great product, it was somewhat immature at the time (stability, management tools) and consequently slowed down our development. It also took us a long time to train new engineers.
While Cassandra will have been a good choice in the long run, we would have been better served in the formative stages of the company to use more established technology like mySQL. Our velocity in developing new features, and our ability to respond to changes in product strategy would have been significantly faster.
Architecting for Scale is a Process, not an Event
A startup needs to earn the right to design for scale, by first proving that it has found a legitimate market. During this first phase adaptability and velocity are its most important attributes.
This being said, we also need to anticipate that we will need to scale the system at some point. Here is how I like to approach the problem:
- First of all, scaling is an on-going process. Even if traffic increases dramatically over a short period of time, not all parts of the system need to be scaled at the same time. Yet, as usage increases, it is likely that any point in time, some part of the system will need to be scaled.
- In order to avoid complete rewrites of the system, we need to break it into independent components. This allows us to redesign each component independently, and have different teams work on different problems concurrently. As a consequence, good modularization of the system is much more important early on, than designing for scale
- Every release cycle needs to budget time and resources for redesign – including both modularization and scalability. This is just like maintenance on the Golden Gate bridge: the painters are always working; when they finish at one end, they start all over at the other end.
- We need to treat our software architecture the same way, and budget maintenance work every release cycle: dollars, time, people. CEOs have to be trained to not only think about the “shiny features” – those that are customer-facing – but also about the “continuous improvements” of the architecture that has to be factored in every release cycle.
- We also need to instrument the code to tell us were it is under strain. Unlike the Golden Gate bridge, we can’t always see where it’s breaking, or even rationalize it. Scaling sometimes works in mysterious ways that are not always obvious to predict.
In summary, designing for scale is a high-class problem, on which we only get to work once we have demonstrated true demand for our product. During this first phase, velocity and adaptability are critical, and are better served with well-understood technologies, and a well modularized design. Once our product reaches an adoption phase, then designing for scale is a continuous process that hopefully can be focused on individual modules in turn – guided by proper instrumentation of the code
Quality Assurance does not stop after the software receives the “thumbs up” from the QA team. QA must continue while the product is Live! … because QA is not perfect, and real users only exist on a Production system. We need to be humble and accept that our design, development and quality processes will not catch all the issues. Consequently, we must equip ourselves with tools that will allow us to catch these problems in Production as early as possible … rather than “wait for the phone to ring”
When the product exits QA, it simply means that we have we’ve run out of ideas on how to make the system fail. Unfortunately, this does not imply that the system, once in Production, will not fail. If we are successful and get a high volume of traffic, the simple law of large numbers guarantees that our users will find yet-never-thought-of ways to – unintentionally – make the system fail. These are part of the “unknown unknowns” as Mr. Donald Rumsfeld would say. Deploying the product on the production servers, and handing-off (abdicating?) the responsibility to keeping it up to the Ops team shows wishful thinking or naïveté, or both.
Why QA must continue in Production
There are a few categories of issues that one needs to anticipate in Production:
- Functional defects: in essence, bugs that neither developers, nor QA caught – while this is the obvious category that comes to mind, it is far from being the only source of issues
- User experience (UX) defects: Product works “as spec’d”, but users either can’t figure how to make the product work, or don’t like it. A typical example is a high abandon rate in a purchasing experience, or any kind of work flow, or a feature that’s never used, a button that’s never clicked.
This is not reserved to new products, by improving the layout of a given page, we may have broken another feature on that same page
- Performance issues: while we may have run performance, and load tests, in our QA environments, the real world always offers surprises. Furthermore, if we are lucky enough to have the kind of traffic that Google or Facebook have, there is no other way but to test and fine-tune performance in production
Running tests on non-production systems requires to not only simulate the load of the system, but also to simulate the “weight” of existing data (e.g. in database, file system) as well as longevity to ensure that there is no resource leak (memory, threads, etc)
- Operational issues: while all cloud applications are typically clustered for high-availability, there are other sources of failure than equipment failure:
- External resources, such as partners, data feeds, can fail, or have bugs of their own, or simply not keep up their response time. Sometimes, the partner updates the API without notification.
- User-provided data can be mal-formed, or in an unexpected format, or a new data format can be introduced after the launch of the product
- System resources can be consumed at an unexpected rate. Databases are notorious for having non-linear response times based on load: as long as the load is under a given threshold response time is high, but once the load exceeds this threshold response time can deteriorate very rapidly.
A couple of examples:
- At my previous company, weeks after the product had been launched, we started receiving occasional complaints that some of the user-created videos were not showing up in their timeline. After (reluctantly) poking around in our log files, we did find out that about 10% of the videos that had been uploaded to our site for the past 2 weeks (but not earlier) were not processed properly. Our transcoder simply failed. Worse, it failed silently. The root cause was a minor modification to the video format introduced by Apple after our product was released. Since this failure was occurring for a small fraction of our users, and we had no “operational instrumentation” in our code, it took us a long time to even become aware of it.
- Recently, we launched a product that exchanges data with our partner. Their API is well documented, and we tested our product in their sandbox environment, as well as their production environment. However, after launch, we had reports of occasional failures. It turns out that users on our partner’s site were modifying the data in ways that we did not expect, and causing the API to return error codes that we had never seen. Our code duly logged this problem each time it occurred in our log files … among the thousands of other log events generated every minute
Performing QA on Production Systems
As I mentioned, the Google and Facebook of the world, do a lot (if not most) of their QA on Production systems. Because they run hundreds of thousands of servers, they can use a small subset to run tests will live user data. This is clearly a fantastic option.
Similarly, “A/B comparisons” techniques are typically used in Marketing to compare 2 different user experiences, where the outcome (e.g. a purchase) can be measured. The same technique can be applied in testing, e.g. to validate that a fix of an intermittent bug difficult to reproduce does work.
More generally, Production code needs to be instrumented:
- To detect failures, or QoS (Quality of Service) degradations, with internal causes (e.g. database is slowing down)
- To detect failures, or QoS degradations, with external causes (e.g. partner API times out a lot)
- To monitor resource utilization for each service or application – at a finer grain than provided by Operations monitoring tools which are typically at the server level.
The point is that if a user can’t buy a book on our website because our servers crash under load – this is a bug. While the crash is not due to code written incorrectly, it is due to the absence of code warning us that the system was running out of steam … this is still a bug.
In order to monitor quality in Production, we need to:
- Clean up the code that writes to log files: eliminate all logging used for code testing, or statements such as “the code should never reach here”. Instead, write messages that will be meaningful to the poor soul who, a few weeks later, will be poring over megabytes of log files on a Sunday night trying to figure out why the system crashed
- Ensure that log messages have consistent severity levels (e.g. as recommended by RFC 5424 – Wikipedia has a nice table), so that meaningful alerts can be triggered
- Use a log aggregation system, like GrayLog2 (open source), so log files from multiple nodes in the same cluster, as well as nodes from different services can (a) be searched from a console and (b) viewed, time-aligned, on a single page (critical for troubleshooting). GrayLog2 can handle hundreds of millions of log events and terabytes of data.
- MEASURE: establish a base line for response time, resources consumption, errors – and trigger alerts when the metrics deviate from the baseline beyond a predetermined threshold
- Track that core functions – from a user perspective – complete, and log when, and ideally, why, they fail along with key parameters. E.g.: are users able to upload files to our system, are failures related to file size, time of day, location of user, etc?
- Log UX and operationally meaningful events to track how users actually use the system, what features are most used and track them over time. These metrics are critical for the Product Management team
- Monitor resource utilization and correlate with usage patterns. Quantify key usage parameters in order to scale the right resources in advance of the demand. For example, as traffic grows, the media server and the database servers may grow at the different rates.
- Integrate alarms from application errors into the Ops monitoring tools: e.g. too many “can’t connect” errors should trigger an Ops alert that our partner is down – slow response time on a single server in a cluster may indicate the disk is failing
Quality is not a one-time event, it is an everyday activity, because users change their behaviors, partners change their APIs, systems get full and slow down. What used to work yesterday, may not work today, or no longer be good enough for our customers. As a consequence, the concept the “test driven” development must be extended to the Production systems, and our code must be instrumented to provide metrics that confirm that everything works as desired, and alerts when they don’t. But that’s not sufficient, developers and QA engineers must also take the time to look at the data, not just when a fire drill has been called, but also on a regular basis to understand how the system is being used, and how resources are consumed as the system scales, and apply this knowledge to subsequent releases.
While it may possible to migrate a self-hosted architecture to the cloud with servers in identical configuration, it almost certainly will lead to a sub-optimal architecture in terms of performance, and higher costs, in some cases prohibitively so.
The common objectives for moving to the cloud are:
- Ability to scale transparently as the business grows
- Reduce costs
- Benefit from a word-class IT infrastructure without having to hire the talent
We’ll focus on the first two objectives as the third one is achieved – by nature – the moment you flip the switch to the cloud.
Memory Drives Pricing in the Cloud – not CPU
in the cloud, whether with Amazon EC2 or other vendors, the primary dimension driving pricing is the amount of memory (RAM) available in the server. In addition, CPU allocated is roughly proportional to the amount of RAM.
- A Small instance has (only) 1.7 GB of RAM – and 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit) and costs: $0.08 per hour On-Demand
- An Extra Large instance is 8 times bigger than a small instance and costs 8 times as much: 15 GB of RAM – and 8 EC2 Compute Unit and costs: $0.64 per hour On-Demand
- In order to get 32 GB of RAM, one needs to move to the High-Memory Double Extra Large Instance aka m2.2xlarge: $0.90 per hour On-Demand
Note that the prices quoted here are for US East (N. Virginia) zone. Prices for US West (Northern California) are more expensive (about 12% based on a few data points I correlated)
Database servers have unique requirements:
- They require fast I/O to disk. While Amazon recommends using networked storage, this is typically not practical from a performance perspective.
- So database servers also require large local disks: to hold the data
- Most databases require a fair amount of memory at least 16 GB. We use Cassandra, and they recommend 32 GB per server.
- They hate noisy neighbors (see previous blos). While virtualization technology does a fairly good job at partitioning CPU and RAM, it does a much poorer job at sharing I/O bandwidth. Having another virtual machine running on your database server can kill its performance, even if the neighbor does not do much, it can kill the I/O efficiency. All the tricks that databases use to optimize I/O performance assume that the database is in control of all I/O busses.
As a consequence, one should first of all use a reserved instance – simply because the cost of getting data in and out of the local disks makes it impossible to set-up / tear-down database servers “at will”.
Secondly, one should buy a large enough instance (e.g. m2.4xlarge) so that we are the only tenant on the server. This will cost $7,203 per year – based on Heavy Utilization Reserved Instances pricing, and get us 68.4 GB memory, 4 cores (8 virtual – with Intel Hyper-Threading) and 1.69 TB of local storage.
As Adrian Cockcroft from Netflix illustrates in his detailed post: Benchmarking High Performance I/O with SSD for Cassandra on AWS, moving to SSD instances for I/O and compute intensive systems can bring significant cost reductions. In his example, he compares a traditional system with 36 x m2.xlarge + 48 x m2.4xlarge instances at a cost of $772,806 (Total 3 Year Heavy Use Cost) – with a 15 x hi1.4xlarge system at a cost of $354,405 – a 54% savings.
As the article illustrates, selecting one versus the other requires careful understanding of the computational profile of the application, and some changes in the application’s architecture
Do I want to use Proprietary Amazon Solutions?
Following the logic that motivates us to move to the Cloud forces to consider using Amazon proprietary solutions: reduce need for sys admin talent, leverage out-of-the-box a scalable high-availability solution, etc
- Should I replace mySQL with RDS? Or Casssandra, HBase with DynamoDB?
- Should I replace my ActiveMQ (e.g.) message queue with SQS?
- … and similarly for AWS many products
These are excellent products, battle tested by Amazon. However, there are 2 very important considerations to examine:
- First, these products are obviously proprietary – making a move to another cloud provider like Rackspace or Joyent, will be take an extensive code rewrite. This may turn out to be impractical.
- Secondly, cost can be a (bad) surprise once the application is deployed live. For both RDS or SQS, pricing is driven by data bandwidth AND the number of operations performed using the service – which requires careful analysis to estimate ahead of time. For example, polling every 10 seconds to check whether new data is present in SQS generates 250K operations per month (assuming each check requires only 1 operation). This is fine if this function is performed by a few servers, but would break the bank if it’s performed by 100,000 end-user clients. This adds up to $25,000 ($0.000001 per Request).
Algorithm Tuning and Server Selection
More generally, Amazon offers seven families of servers: Standard, Micro, High-Memory, High-CPU, Cluster Compute, Cluster GPU, and High I/O (SSD). Porting an existing application will thus require an iterative process evaluating the following questions:
- How do I best match each of my system’s components with an Amazon instance types>
- Can I fine-tune, or even re-write, my algorithms to maximize RAM & CPU utilization? In particular, would I make the same memory vs computation trade-offs? Do I need this hash-table, or can I re-compute the query?
- How does my architecture evolve as I scale out? For example, do I need to replicate shared resources – like caches – or will sharding (e.g.) avoid this duplication of data – which will directly impact my cost since pricing is memory driven. An algorithm may work best using a approach favoring memory (and minimizing CPU) when running on a single server but it may be more cost-effective when optimized for memory when scaled out over many servers.
- How do new technologies like SSD impact my architecture? As the Netflix article illustrates, the cost impact can be radical, but it required substantial architecture redesign, not just a simple server replacement
In conclusion, moving from a hosted environment (where each server can be configured at will) to the cloud where servers come in pre-determined configurations requires not only an architecture review, but a sophisticated excel spreadsheet to compare the costs of various architectures. This upfront financial modeling is absolutely necessary in order to avoid unpleasant surprises as the business scales up.
The selection of a cloud service provider is a critical decision for any a software service provider. Cost is, naturally, a key driver in this selection. However, predicting the cost of running servers in the cloud is a project in, and of, itself, because the only way to build a reliable model of costs, is to go ahead and deploy our systems with the service providers.
Why is not possible to forecast costs with pen and paper?
The main reason that pricing is so hard to forecast is that our system architecture in the cloud will likely be different from the one currently running in our own datacenter: the server configurations are different, the networking is different, and most likely we want to take advantage of the new features that come “for free” with a deployment in the cloud: higher availability, geographical redundancy, larger scale, etc. We’ll cover this in details in an upcoming post.
Another reason why it is hard to predict costs is that we don’t really know what we are getting:
When one considers the primary attributes of a server: RAM, CPU, storage, I/O (network bandwidth) – only RAM and storage capacity are guaranteed by cloud vendors. Vendors provide varying degrees of specificity about CPU and other key characteristics. Amazon defines EC2 Compute Units: “One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”. Rackspace’s price sheet categorizes servers by available RAM and disk space (the more RAM, the more disk space). Their FAQ mentions the number of virtual cores each server receives, based on the amount of RAM allocated, but I could not find their definition of a virtual core. GoGrid, or Joyent provide similarly limited information.
As a side note, one needs to be aware that vendors typically refer to “virtual cores” – as opposed to real (physical) cores. A virtual core corresponds to one of the two hyperthreads that run on modern Intel processors since 2002. Conversely, a server with a quad-core Intel Xeon processor runs 8 virtual cores. You can read this 2009 post, plus the comments thread, for more specifics. While the data is dated, the observations are still relevant.
So, there is a lot that we don’t know about the servers on which we will run our system: CPU clock, size of LI, L2 RAM, I/O bus speed, disk spindle rotation speed, network card bandwidth, etc.
Furthermore, performance will vary across servers (since cloud vendors have a diverse park of servers of different age) and thus, each time a new image is deployed, it will land on a random server, with the same nominal specs (RAM, storage), but unknown other physical characteristics (CPU clock, I/O bandwidth, etc).
Another well-documented problem is that of noisy neighbors. While the hypervisors do a fairly good job at controlling allocation of CPU and memory, they are not as effective at controlling the multitude of other factors that affect performance. I/O in particular is very sensitive to contention. While VMware affirms that vSphere solves this problem, most (all ?) cloud vendors use open source hypervisors.
In any event, this problem is systemic and cannot be solved by the hypervisor. For example, we did a lot of research on the best configuration for our Cassandra servers (database for big data). One of the main performance optimizations driving Cassandra’s design is to maximize “append” (rather than update) operations, thus minimizing random movement of the read/write head of the disk, and thus maximizing disk I/O. Unfortunately, all this clever optimization goes out the window if we share the server – and thus the disks – with a noisy neighbor who is performing random read-write operations. I had the chance to discuss this a couple of months ago with members of the Cassandra team at Netflix (one of the largest users of Cassandra and almost 100% deployed on Amazon): they solve the issue by only using m2.4xlarge instances on AWS, which (today) ensures that they are the only tenant on the physical server – and don’t have any noisy neighbor.
Adding all this together makes it pretty clear that vendor comparison on paper is practically fruitless.
Let’s Try It Out
The only practical way to create a realistic budget forecast is to actually deploy systems on the selected cloud vendor(s) and “play” with them. Here are some areas to investigate and characterize, beyond simply validating functionality:
- Optimal server configuration for each server role (web, database, search, middle tier, cache, etc). We need to make sure that each server role is adequately served by one of the configurations offered by the vendor. For example, very few offer servers with more than 64 GB of RAM
- Performance at scale (since we only pay for the servers we rent, we can run full-scale performance tests for a few hours or days at relatively low cost – e.g. a few hundred dollars) – Netflix tested Cassandra performance, “Over a million writes per second”, on AWS for less than $600 and clusters as large as 288 nodes
- End-to-end latency (measured from an end-user perspective) – since latency will be impacted by the physical distribution of the servers
- Pricing model
For these tests to be meaningful, one needs to ensure that deployments are realistic: for example, across service zones and regions, if we plan on leveraging these capabilities – as they impact not only performance (due to increased network latency) but also pricing (data transfer charges).
In addition, each test must be run several (10 – 20) times – with fresh deployments – at different times of day – in order to have a representative sample of servers and neighbors.
As important as the technical performance validation, the pricing model must be validated as vendors charge for a variety of services in addition to the lease of the servers: most notably bandwidth for data transfers (e.g. across regions), but also optional services (e.g. AWS Monitoring or Auto-Scaling), as well as per operation fees (e.g. Elastic Block Store). The “per operation” fees can add up to very large amounts, if one is not careful. For example, see the Amazon SimpleDB price calculator – we have to run SimpleDB under real load in order to figure out what numbers to plug in. Overlooking this step can be costly.
Once the technical tests have been completed, and the system configuration validated,
I recommend at least a full billing cycle of simulated operations, in order to obtain an actual bill from the vendor from which we can build our pricing model.
A few conclusions have surfaced:
- One needs to clear about one’s motivations to migrate to the Cloud- different motivations will lead to different outcomes, for a given product
- It is almost impossible to predict the cost of a cloud-hosted system – without deploying a test system with the selected vendor. As a corollary, precise comparison shopping is almost impossible.
- It is almost impossible to design, let alone deploy, your system architecture – without prior hands-on experimentation with your selected vendor. Also, the optimal architecture once deployed in the Cloud is likely to be radically different than one deployed on your own servers.
- Some Cloud vendors are moving aggressively up the value chain by offering innovative software technologies on top of their infrastructure. They are thus becoming PaaS (Platform As A Service) vendors. For example, as we commented in a previous post “Is Amazon After Oracle and Microsoft?” Amazon is deploying an array of software technologies – combined with services – that are tailored specifically for the Cloud, and are technically very advanced
We expand each of these points in upcoming posts, starting with the first one today.
The main arguments advanced in favor of a cloud infrastructure are:
- Offload the system management responsibilities to the Cloud services provider:
This is more than an economic trade-off: managing systems for high-volume Internet applications is a complex task requiring a broad set of technical skills – where said skills are in permanent evolution. Acquiring all these skills typically requires multiple engineers with varied backgrounds: computer hardware, operating systems, storage, networking, scripting, security, etc. These system administrators have been in high-demand for the past couple of years, demand high compensation, and usually want to work for companies which offer challenging work … namely those with a very large number of systems. As a result, some companies are simply unable to hire the necessary system administration talent in-house, and are forced to move to the Cloud for this single reason.
- Leverage best practices established by Cloud vendors.
Cloud services providers have optimized every aspect of running a datacenter. For example, Facebook released the Open Compute project in 2011 for Server and Data Center Technology. RackSpace launched the OpenStack initiative in late 2010, to standardize and share software for Compute (systems management, Storage, Media, Security, as well as Identity and Dashboard. Even managing systems at a hosting provider requires constant tuning of system management tools – whereas a Cloud service provider will take on this burden
- Benefit from the economies of scale that the Cloud vendors have created for themselves
Building data centers, finding cheap sources of power, buying and racking computers, creating high-bandwidth links to the Internet, etc. are all activities whose cost drops with volume. However, to me, the impact of price is much smaller than that of pure skills. The aforementioned tasks are becoming more and more complex, to the point where only the largest companies are capable of investing enough to keep up with the state-of-the-art.
In particular, Cloud vendors offer high-availability and recoverability “for free” – namely: free from a technical perspective, but not from a financial one.
- Ability to rapidly scale systems up or down according to load
This is one of the main theoretical benefits of the cloud. However, it requires a few architectural components to be in place:
(a) the software architecture has to be truly scalable and free of bottlenecks. For example, traditional N-tier architectures were advertised to be scalable because web servers could be added easily. Unfortunately, the database rapidly becomes the throttling component as the load rises. Scaling up traditional database sub-systems, while maintaining high-availability , is both difficult and expensive.
(b) Tools and algorithms are required to detect variations in load, and to provision/decommission the appropriate servers. This requires a good understanding of how each component of the system contributes to the performance of the whole system. The complexity increases when the performance of components does not behave linearly with load.
(c) Data repositories are slow and expensive to migrate. For example, doubling the size of a Cassandra (noSQL database) cluster is time consuming, uses a lot of bandwidth (for which the vendor may charge) and creates load on the nodes in the cluster.
- Ability to create/delete complete system instances (most useful to development and testing)
The Cloud definitely meets this promise for the front-end and business logic layers, but if an instance requires a large amount of data to be populated, you must either pay the time & cost at each deployment or keep the data tier up at all times. This being said, deploying complete instances in the Cloud is still a lot cheaper and faster than doing it in one’s data center, assuming it can be done at all.
- The Cloud is cheaper:
This is a simple proposition, with a complex answer. As we’ll examine in the next blog: figuring out pricing in the cloud is a lot more complex than adding the cost of servers.
Appreciating the business and technical drivers that motivate a migration to the Cloud will drive how we approach the next steps in the process: system architecture design, vendor selection, and pricing analysis. As always, different goals will lead to different outcomes.
Amazon is quietly, slowly, but surely becoming a software vendor (in addition to being the largest etailer), with product offerings that compete directly, and in some cases, are broader than the “traditional” software vendors such as Oracle and Microsoft.
For example, a simple review of Amazon Products shows no less than 3 database options Amazon Relational Database Service (RDS), SimpleDB and DynamoDB (launched earlier this year), which offers almost infinite scale and reliability.
Amazon also offers an in-memory cache – ElastiCache. You can also use their SIMPLE services: Workflow Service (SWF) – e.g for business processes, Queue Service (SQS) – for asynchronous inter-process communications, a Notification Service (SNS) – for push notifications, as well as email (SES). Amazon calls them all “simple”, yet a number of startups have been built and gone public or been acquired in the past couple decades on the basis of a single of these products: PointCast, Tibco, IronPort, just to name a few.
This is not all … Amazon offers additional services in other product categories: storage, of course, with S3 and EBS (Elastic Block Store), Web traffic monitoring, Identification management, load balancing, application containers, payment services (FPS), billing software (DevPay), backup software, content delivery network, MapReduce … my head spins trying to name all the companies whose business is to provide just a single one of these products.
Furthermore, Amazon is not just packaging mature technologies and slapping a “cloud” label on them. Some of them, like DynamoDB, are truly leading edge. Yet, what is most impressive, and where Amazon’s offering is arguably superior to that of Oracle, Microsoft or the product category competitors, is that Amazon commits to supporting and deploying these products at “Internet scale” – namely as large as they are. This is not only a software “tour-de-force” but also an operational one – as anyone who has tried to run high-availability and high-throughput Oracle or SQL Server clusters can testify.
The costs of deploying software on the Amazon stack is another story … and the topic of a future post
This post presents a practical guide of what happens during a typical Agile iteration – a sort of play-by-play for each role in the team, day by day. Please open the attached spreadsheet which models the day-by-day activities of a 2-week Agile development iteration, and describes the main activities for each role during this 10-day cycle of work. In addition, we will highlight how to successfully string iterations together, without any dead time; as the success of any given iteration is driven by preparation that has to take place in earlier iterations.
This is intended as a guide, rather than a prescription. While each iteration will have its own pace – a successful release will follow a sequence not too different from the one presented here.
Each company is different, each project is different, each team is different, each release is different, and each interpretation of Agile is different. The following states the immutable principles to which I personally adhere.
- Once Engineering and Product Owner agree on the deliverables of an iteration, they are frozen for this iteration
- Engineering must deliver on time
- Features cannot be changed, added, or re-prioritized
- Only exception is a “customer down” escalation of a day or more
- Engineering delivers “almost shippable” quality code at the end of the iteration
- Each release is self-contained: all the activities pertaining to a given user story must be completed within the iteration, or explicitly slated for another iteration at the start of the iteration
- E.g.: QA, unit tests, code reviews, design documentation, update to build & deployment tools, etc
- Dev & QA engineers scope their individual tasks at the beginning of each iteration. The scope and deliverables of the iteration are based on these estimates.
- Engineers are accountable to meet their own estimates
The above implies that Engineers must plan realistically by
(a) accounting for all activities that will need to take place for this iteration, and
(b) accounting for typical levels of interruptions and activities not specifically related to the project (scheduled meetings, questions from support, beer bashes, vacations, etc).
Estimates must be made with the expectation that we are all accountable to meeting them. This sounds like a truism, except that it is rarely applied in practice.
Preparation and planning prior to an iteration are critical to its success. As the spreadsheet highlights, the Product Manager spends the majority of his/her time during a given iteration planning the next iteration, by
(a) Prioritizing the tasks to be delivered in the next iteration
(b) Documenting the user stories to the level of detail required by developers
(c) Reviewing scope with Project Manager and Tech Lead
The following must be delivered to Engineering at the start of a release. The Product Owner, Project Lead and Tech lead are responsible for providing
- “A” list of user stories to be implemented during the release
- Detailed specs of the “A” list user stories
- Design of the “A” list features sufficient to derive the coding and QA tasks necessary to implement the features
- Estimated scope for each feature – rolling up to a target completion date for the iteration
These estimates are “budgetary”. Final estimates are given by the individual engineers.
The whole team gets together and kicks-off the iteration: the PM presents the “A” list features to Eng, and the Tech Lead presents the critical design elements. Tasks are assigned tentatively.
During the rest of the day, engineers review the specs of their individual tasks, with the assistance of PM and Tech Lead. This results in tasks entered into Jira, with associated scope and individual plans for the iteration.
The Project Lead combines all tasks into a project plan (using artifacts of his/her choice) to ensure that the sum of all activities adds up to a timely delivery of the iteration. The Project Lead also identifies any critical dependency, internal and external, that may impact the project.
A delivery date is computed from the individual estimates, and the team (lead by Product Owner, assisted by Project Manager and Tech Lead) iterates to adjust tasks and date
Day 1 activities continue if necessary – resulting into a committed list of deliverables and a committed delivery date
The team, lead by Project Manager, also agrees on how the various tasks will be sequenced to optimize use of resources, and to front-load deliverables to QA as much as possible.
Developers start coding, QA engineers start writing test cases and/or writing automation tests
By Day 6, the Product Manager provides the V1 Spec of the next iteration.
V1 Spec is a complete spec of all the user stories that the Product Owner estimates can be delivered in the next iteration. Typically, V1 will contain more than can be delivered, in order to provide flexibility in case some user stories are more complex than originally thought to implement.
During the remainder of the release, the Tech Lead (primarily) will work with the Product Owner to flesh out the details of the next release, to design the key components of the next release to a degree sufficient to be able to (a) list out the tasks required to implement the user stories, (b) estimate their scope, and (c) ensure that enough details has been provided for developers and QA engineers.
During the discussions of the next release, the Project Lead will identify any additional resources that will need to be procured, whether human or physical.
Release to QA means more than “feature complete”. It means feature complete and tested to the best of the developers’ knowledge and ability (see below).
By Day 9, all bugs must have been fixed, so that the QA team can spend the last day of the iteration running full regression tests (ideally automated) and ensuring that all new features still work properly in the final build
By that time, the content and scope of the next release has been firmed up by Product Owner, Tech lead, and Project Manager, and task are tentative assigned to individual engineers.
At the end of the last day of the iteration, Eng demos all the new features to the PM, the CEO and everyone in the company we can enroll.
We then celebrate.
Tools and Tips
- Depending on the complexity of the user stories, the Tech Lead (and other developers) may need to spend all of their time doing design, and may not be able to contribute any code.
- It is sometimes more productive to write automation tests once a given feature is stable. As a consequence, the QA team may adopt a cycle where they test manually during the current iteration and then automate the tests during the next iteration (once the code is stable)
- Exceptions to “almost shippable” are things like performance and stress testing, full browser compatibility testing, etc.
- These tasks are then planned in the context of the overall release, and allocated to specific iterations
The duration of a given iteration is at the discretion of the team. It is strongly recommended that iterations last between 2 and 4 weeks. It is also recommended that the duration of iteration be driven by its contents, in order to meet the Golden Rules. There is nothing wrong with a 12- or a 17-day iteration.
Similarly, the starting day of the iteration is up to the team. Starting on a Wednesday offers several advantages:
- The iteration does not start on a Monday -). Mondays are often taken up by company & team meetings.
- Iteration finishes on a Tuesday rather than a Friday. Should the iteration slip by a day or two, it can be completed on Wednesday, or Thursday if need be. This means that the QA team is not always “stuck” having to work weekends in order to meet the deadline, nor do they have to scramble to make sure that developers are available during the weekend to fix their bugs, as would be the case if the iteration started on Monday
- By the second weekend of the iteration, the team will have good enough visibility into its progress, and determine whether work during the weekend will be required in order to meet the schedule.
The artifacts, format and level of details through which specs are delivered to Engineering is a matter of mutual agreement between Product Owner and Engineering, recognizing that Engineering is the consumer of the specs. As such, it is Engineering who determines the adequacy of the information provided (since Engineering cannot create a good product from incomplete specs).
Specs must be targeted for QA as well as Dev. In particular, they must be prescriptive enough so that validation tests can be derived from them. For example they may include UI mockups, flow charts, information flow diagrams, error handling behavior, platforms supported, performance and scaling requirements, as necessary.
While the QA team has the primary responsibility of executing the tests that will validate quality, developers own the quality of the software (since they are the ones writing the software). As a consequence, when developers release to QA, they must have tested their code to ensure that no bugs of Severity 1 or 2 will be found by QA (or customers) – unless they explicitly agree in advance with the QA team that certain categories of tests will be run by QA.
Regardless of who runs the tests, the “release to QA” milestone is only reached when enough code introspection and testing has been performed to warrant confidence that no Severity 1 or 2 bugs will be found.
Developers and QA can agree on how code will be released to QA. While the spreadsheet shows one Releate to QA milestone, this was done for clarity of presentation. In practice, it is recommended that developers release to QA as often as possible. Again, this should be driven by mutual agreement.
Furthermore, each developer must demonstrate to his/her QA colleague that the code works properly before the code is considered to be released. This demo is accompanied by a knowledge transfer session, where the developer highlights any known limitations in the code, areas that should be tested with particular scrutiny, etc.
Estimating Scope Accurately
One of the typical debates is whether time estimates should be measured in “ideal time” (no interruptions, distractions, meetings), or “actual time” (in order to account for the typical non-project-related activities). This is a matter of personal preference – what counts is that everyone in the team uses the same system.
I prefer to use “Ideal time”: each engineer keeps 2 “books” within an iteration: the actual iteration work – scoped in “ideal time”, and a “Other Activities” book, where all non-project-related activities are accounted for. This presents the advantages of (a) using a non-varying unit to measure the scope of tasks so that you can compare across people, project, time, and (b) having a means to track “non-productive time” on your project – and thus have data on which to drive decisions (e.g. pleading management for less meetings)