Migrating a Self-hosted Architecture to the Cloud

While it may possible to migrate a self-hosted architecture to the cloud with servers in identical configuration, it almost certainly will lead to a sub-optimal architecture in terms of performance, and higher costs, in some cases prohibitively so.

The common objectives for moving to the cloud are:

  • Ability to scale transparently as the business grows
  • Reduce costs
  • Benefit from a word-class IT infrastructure without having to hire the talent

 

We’ll focus on the first two objectives as the third one is achieved – by nature – the moment you flip the switch to the cloud.

Memory Drives Pricing in the Cloud – not CPU

in the cloud, whether with Amazon EC2 or other vendors, the primary dimension driving pricing is the amount of memory (RAM) available in the server. In addition, CPU allocated is roughly proportional to the amount of RAM.

For example, as of this writing, per the Amazon EC2 pricing and the Amazon EC2 Instance Types definitions:

  • A Small instance has (only) 1.7 GB of RAM – and 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit) and costs: $0.08 per hour On-Demand
  • An Extra Large instance is 8 times bigger than a small instance and costs 8 times as much: 15 GB of RAM – and 8 EC2 Compute Unit and costs: $0.64 per hour On-Demand
  • In order to get 32 GB of RAM, one needs to move to the High-Memory Double Extra Large Instance aka m2.2xlarge: $0.90 per hour On-Demand

Note that the prices quoted here are for US East (N. Virginia) zone. Prices for US West (Northern California) are more expensive (about 12% based on a few data points I correlated)

Database Servers

Database servers have unique requirements:

  • They require fast I/O to disk. While Amazon recommends using networked storage, this is typically not practical from a performance perspective.
  • So database servers also require large local disks: to hold the data
  • Most databases require a fair amount of memory at least 16 GB. We use Cassandra, and they recommend 32 GB per server.
  • They hate noisy neighbors (see previous blos). While virtualization technology does a fairly good job at partitioning CPU and RAM, it does a much poorer job at sharing I/O bandwidth. Having another virtual machine running on your database server can kill its performance, even if the neighbor does not do much, it can kill the I/O efficiency. All the tricks that databases use to optimize I/O performance assume that the database is in control of all I/O busses.

As a consequence, one should first of all use a reserved instance – simply because the cost of getting data in and out of the local disks makes it impossible to set-up / tear-down database servers “at will”.

Secondly, one should buy a large enough instance (e.g. m2.4xlarge) so that we are the only tenant on the server. This will cost $7,203 per year – based on Heavy Utilization Reserved Instances pricing, and get us 68.4 GB memory, 4 cores (8 virtual – with Intel Hyper-Threading) and 1.69 TB of local storage.

SSD

As Adrian Cockcroft from Netflix illustrates in his detailed post: Benchmarking High Performance I/O with SSD for Cassandra on AWS, moving to SSD instances for I/O and compute intensive systems can bring significant cost reductions. In his example, he compares a traditional system with 36 x m2.xlarge + 48 x m2.4xlarge instances at a cost of $772,806 (Total 3 Year Heavy Use Cost) – with a 15 x hi1.4xlarge system at a cost of $354,405 – a 54% savings.

As the article illustrates, selecting one versus the other requires careful understanding of the computational profile of the application, and some changes in the application’s architecture

Do I want to use Proprietary Amazon Solutions?

Following the logic that motivates us to move to the Cloud forces to consider using Amazon proprietary solutions: reduce need for sys admin talent, leverage out-of-the-box a scalable high-availability solution, etc

  • Should I replace mySQL with RDS? Or Casssandra, HBase with DynamoDB?
  • Should I replace my ActiveMQ (e.g.) message queue with SQS?
  • … and similarly for AWS many products

 

These are excellent products, battle tested by Amazon. However, there are 2 very important considerations to examine:

  • First, these products are obviously proprietary – making a move to another cloud provider like Rackspace or Joyent, will be take an extensive code rewrite. This may turn out to be impractical.
  • Secondly, cost can be a (bad) surprise once the application is deployed live. For both RDS or SQS, pricing is driven by data bandwidth AND the number of operations performed using the service – which requires careful analysis to estimate ahead of time. For example, polling every 10 seconds to check whether new data is present in SQS generates 250K operations per month (assuming each check requires only 1 operation). This is fine if this function is performed by a few servers, but would break the bank if it’s performed by 100,000 end-user clients. This adds up to $25,000 ($0.000001 per Request).

Algorithm Tuning and Server Selection

More generally, Amazon offers seven families of servers: Standard, Micro, High-Memory, High-CPU, Cluster Compute, Cluster GPU, and High I/O (SSD). Porting an existing application will thus require an iterative process evaluating the following questions:

  • How do I best match each of my system’s components with an Amazon instance types>
  • Can I fine-tune, or even re-write, my algorithms to maximize RAM & CPU utilization? In particular, would I make the same memory vs computation trade-offs? Do I need this hash-table, or can I re-compute the query?
  • How does my architecture evolve as I scale out? For example, do I need to replicate shared resources – like caches – or will sharding (e.g.) avoid this duplication of data – which will directly impact my cost since pricing is memory driven. An algorithm may work best using a approach favoring memory (and minimizing CPU) when running on a single server but it may be more cost-effective when optimized for memory when scaled out over many servers.
  • How do new technologies like SSD impact my architecture? As the Netflix article illustrates, the cost impact can be radical, but it required substantial architecture redesign, not just a simple server replacement

 

In conclusion, moving from a hosted environment (where each server can be configured at will) to the cloud where servers come in pre-determined configurations requires not only an architecture review, but a sophisticated excel spreadsheet to compare the costs of various architectures. This upfront financial modeling is absolutely necessary in order to avoid unpleasant surprises as the business scales up.

Want to Predict your Cost in the Cloud? Roll Up Your Sleeves!

 

The selection of a cloud service provider is a critical decision for any a software service provider. Cost is, naturally, a key driver in this selection. However, predicting the cost of running servers in the cloud is a project in, and of, itself, because the only way to build a reliable model of costs, is to go ahead and deploy our systems with the service providers.

 

Why is not possible to forecast costs with pen and paper?

The main reason that pricing is so hard to forecast is that our system architecture in the cloud will likely be different from the one currently running in our own datacenter: the server configurations are different, the networking is different, and most likely we want to take advantage of the new features that come “for free” with a deployment in the cloud: higher availability, geographical redundancy, larger scale, etc. We’ll cover this in details in an upcoming post.

 

Another reason why it is hard to predict costs is that we don’t really know what we are getting:

When one considers the primary attributes of a server: RAM, CPU, storage, I/O (network bandwidth) – only RAM and storage capacity are guaranteed by cloud vendors. Vendors provide varying degrees of specificity about CPU and other key characteristics. Amazon defines EC2 Compute Units: “One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”. Rackspace’s price sheet categorizes servers by available RAM and disk space (the more RAM, the more disk space). Their FAQ mentions the number of virtual cores each server receives, based on the amount of RAM allocated, but I could not find their definition of a virtual core. GoGrid, or Joyent provide similarly limited information.

 

As a side note, one needs to be aware that vendors typically refer to “virtual cores” – as opposed to real (physical) cores. A virtual core corresponds to one of the two hyperthreads that run on modern Intel processors since 2002. Conversely, a server with a quad-core Intel Xeon processor runs 8 virtual cores. You can read this 2009 post, plus the comments thread, for more specifics. While the data is dated, the observations are still relevant.

 

So, there is a lot that we don’t know about the servers on which we will run our system: CPU clock, size of LI, L2 RAM, I/O bus speed, disk spindle rotation speed, network card bandwidth, etc.

Furthermore, performance will vary across servers (since cloud vendors have a diverse park of servers of different age) and thus, each time a new image is deployed, it will land on a random server, with the same nominal specs (RAM, storage), but unknown other physical characteristics (CPU clock, I/O bandwidth, etc).

 

Another well-documented problem is that of noisy neighbors. While the hypervisors do a fairly good job at controlling allocation of CPU and memory, they are not as effective at controlling the multitude of other factors that affect performance. I/O in particular is very sensitive to contention. While VMware affirms that vSphere solves this problem, most (all ?) cloud vendors use open source hypervisors.

In any event, this problem is systemic and cannot be solved by the hypervisor. For example, we did a lot of research on the best configuration for our Cassandra servers (database for big data). One of the main performance optimizations driving Cassandra’s design is to maximize “append” (rather than update) operations, thus minimizing random movement of the read/write head of the disk, and thus maximizing disk I/O. Unfortunately, all this clever optimization goes out the window if we share the server – and thus the disks – with a noisy neighbor who is performing random read-write operations. I had the chance to discuss this a couple of months ago with members of the Cassandra team at Netflix (one of the largest users of Cassandra and almost 100% deployed on Amazon): they solve the issue by only using m2.4xlarge instances on AWS, which (today) ensures that they are the only tenant on the physical server – and don’t have any noisy neighbor.

 

Adding all this together makes it pretty clear that vendor comparison on paper is practically fruitless.

Let’s Try It Out

The only practical way to create a realistic budget forecast is to actually deploy systems on the selected cloud vendor(s) and “play” with them. Here are some areas to investigate and characterize, beyond simply validating functionality:

  • Optimal server configuration for each server role (web, database, search, middle tier, cache, etc). We need to make sure that each server role is adequately served by one of the configurations offered by the vendor. For example, very few offer servers with more than 64 GB of RAM
  • Performance at scale (since we only pay for the servers we rent, we can run full-scale performance tests for a few hours or days at relatively low cost – e.g. a few hundred dollars) – Netflix tested Cassandra performance, “Over a million writes per second”, on AWS for less than $600 and clusters as large as 288 nodes
  • End-to-end latency (measured from an end-user perspective) – since latency will be impacted by the physical distribution of the servers
  • Pricing model

 

For these tests to be meaningful, one needs to ensure that deployments are realistic: for example, across service zones and regions, if we plan on leveraging these capabilities – as they impact not only performance (due to increased network latency) but also pricing (data transfer charges).

 

In addition, each test must be run several (10 – 20) times – with fresh deployments – at different times of day – in order to have a representative sample of servers and neighbors.

 

As important as the technical performance validation, the pricing model must be validated as vendors charge for a variety of services in addition to the lease of the servers: most notably bandwidth for data transfers (e.g. across regions), but also optional services (e.g. AWS Monitoring or Auto-Scaling), as well as per operation fees (e.g. Elastic Block Store). The “per operation” fees can add up to very large amounts, if one is not careful. For example, see the Amazon SimpleDB price calculator – we have to run SimpleDB under real load in order to figure out what numbers to plug in. Overlooking this step can be costly.

 

Once the technical tests have been completed, and the system configuration validated,

I recommend at least a full billing cycle of simulated operations, in order to obtain an actual bill from the vendor from which we can build our pricing model.

Deploying to the Cloud? Hang on to your Trousers!

My team and I have spent the past months investigating a deployment to the Cloud with vendors such as Amazon, Rackspace, GoGrid … to name a few who provide Infrastructure As A Service (IaaS).

A few conclusions have surfaced:

  • One needs to clear about one’s motivations to migrate to the Cloud- different motivations will lead to different outcomes, for a given product
  • It is almost impossible to predict the cost of a cloud-hosted system – without deploying a test system with the selected vendor. As a corollary, precise comparison shopping is almost impossible.
  • It is almost impossible to design, let alone deploy, your system architecture – without prior hands-on experimentation with your selected vendor. Also, the optimal architecture once deployed in the Cloud is likely to be radically different than one deployed on your own servers.
  • Some Cloud vendors are moving aggressively up the value chain by offering innovative software technologies on top of their infrastructure. They are thus becoming PaaS (Platform As A Service) vendors. For example, as we commented in a previous post “Is Amazon After Oracle and Microsoft?” Amazon is deploying an array of software technologies – combined with services – that are tailored specifically for the Cloud, and are technically very advanced

We expand each of these points in upcoming posts, starting with the first one today.

The main arguments advanced in favor of a cloud infrastructure are:

  • Offload the system management responsibilities to the Cloud services provider:
    This is more than an economic trade-off: managing systems for high-volume Internet applications is a complex task requiring a broad set of technical skills – where said skills are in permanent evolution. Acquiring all these skills typically requires multiple engineers with varied backgrounds: computer hardware, operating systems, storage, networking, scripting, security, etc. These system administrators have been in high-demand for the past couple of years, demand high compensation, and usually want to work for companies which offer challenging work … namely those with a very large number of systems. As a result, some companies are simply unable to hire the necessary system administration talent in-house, and are forced to move to the Cloud for this single reason.
  • Leverage best practices established by Cloud vendors.
    Cloud services providers have optimized every aspect of running a datacenter. For example, Facebook released the Open Compute project in 2011 for Server and Data Center Technology. RackSpace launched the OpenStack initiative in late 2010, to standardize and share software for Compute (systems management, Storage, Media, Security, as well as Identity and Dashboard. Even managing systems at a hosting provider requires constant tuning of system management tools –  whereas a Cloud service provider will take on this burden
  • Benefit from the economies of scale that the Cloud vendors have created for themselves
    Building data centers, finding cheap sources of power, buying and racking computers, creating high-bandwidth links to the Internet, etc. are all activities whose cost drops with volume. However, to me, the impact of price is much smaller than that of pure skills. The aforementioned tasks are becoming more and more complex, to the point where only the largest companies are capable of investing enough to keep up with the state-of-the-art.
    In particular, Cloud vendors offer high-availability and recoverability “for free” – namely: free from a technical perspective, but not from a financial one.
  • Ability to rapidly scale systems up or down according to load
    This is one of the main theoretical benefits of the cloud. However, it requires a few architectural components to be in place:
    (a) the software architecture has to be truly scalable and free of bottlenecks. For example, traditional N-tier architectures were advertised to be scalable because web servers could be added easily. Unfortunately, the database rapidly becomes the throttling component as the load rises. Scaling up traditional database sub-systems, while maintaining high-availability , is both difficult and expensive.
    (b) Tools and algorithms are required to detect variations in load, and to provision/decommission the appropriate servers. This requires a good understanding of how each component of the system contributes to the performance of the whole system. The complexity increases when the performance of components does not behave linearly with load.
    (c) Data repositories are slow and expensive to migrate. For example, doubling the size of a Cassandra (noSQL database) cluster is time consuming, uses a lot of bandwidth (for which the vendor may charge) and creates load on the nodes in the cluster.
  • Ability to create/delete complete system instances (most useful to development and testing)
    The Cloud definitely meets this promise for the front-end and business logic layers, but if an instance requires a large amount of data to be populated, you must either pay the time & cost at each deployment or keep the data tier up at all times.  This being said, deploying complete instances in the Cloud is still a lot cheaper and faster than doing it in one’s data center, assuming it can be done at all.
  • The Cloud is cheaper:
    This is a simple proposition, with a complex answer. As we’ll examine in the next blog: figuring out pricing in the cloud is a lot more complex than adding the cost of servers.

Appreciating the business and technical drivers that motivate a migration to the Cloud will drive how we approach the next steps in the process: system architecture design, vendor selection, and pricing analysis. As always, different goals will lead to different outcomes.

Is Amazon After Oracle and Microsoft?

Amazon is quietly, slowly, but surely becoming a software vendor (in addition to being the largest etailer), with product offerings that compete directly, and in some cases, are broader than the “traditional” software vendors such as Oracle and Microsoft.

For example, a simple review of Amazon Products shows no less than 3 database options Amazon Relational Database Service (RDS), SimpleDB and DynamoDB (launched earlier this year), which offers almost infinite scale and reliability.

Amazon also offers an in-memory cache – ElastiCache. You can also use their SIMPLE services: Workflow Service (SWF) – e.g for business processes, Queue Service (SQS) – for asynchronous inter-process communications, a Notification Service (SNS) – for push notifications, as well as email (SES). Amazon calls them all “simple”, yet a number of startups have been built and gone public or been acquired in the past couple decades on the basis of a single of these products: PointCast, Tibco, IronPort, just to name a few.

This is not all … Amazon offers additional services in other product categories: storage, of course, with S3 and EBS (Elastic Block Store), Web traffic monitoring, Identification management, load balancing, application containers, payment services (FPS), billing software (DevPay), backup software, content delivery network, MapReduce … my head spins trying to name all the companies whose business is to provide just a single one of these products.

Furthermore, Amazon is not just packaging mature technologies and slapping a “cloud” label on them. Some of them, like DynamoDB, are truly leading edge. Yet, what is most impressive, and where Amazon’s offering is arguably superior to that of Oracle, Microsoft or the product category competitors, is that Amazon commits to supporting and deploying these products at “Internet scale” – namely as large as they are. This is not only a software “tour-de-force” but also an operational one – as anyone who has tried to run high-availability and high-throughput Oracle or SQL Server clusters can testify.

Given its breadth of products, its ability to operate them at Internet-scale with high-availability, Amazon could become the default software stack: a foundation on which to architect products, displacing the traditional stacks such as: .Net, LAMP, or {mySQL,Oracle}-Java-Apache-JavaScript

The costs of deploying software on the Amazon stack is another story … and the topic of a future post

Day-by-Day Model of an Iteration

This post presents a practical guide of what happens during a typical Agile iteration – a sort of play-by-play for each role in the team, day by day.

This post presents a practical guide of what happens during a typical Agile iteration – a sort of play-by-play for each role in the team, day by day. Please open the attached spreadsheet which models the day-by-day activities of a 2-week Agile development iteration, and describes the main activities for each role during this 10-day cycle of work. In addition, we will highlight how to successfully string iterations together, without any dead time; as the success of any given iteration is driven by preparation that has to take place in earlier iterations.

This is intended as a guide, rather than a prescription. While each iteration will have its own pace – a successful release will follow a sequence not too different from the one presented here.

Golden Rules

Each company is different, each project is different, each team is different, each release is different, and each interpretation of Agile is different. The following states the immutable principles to which I personally adhere.

  • Once Engineering and Product Owner agree on the deliverables of an iteration, they are frozen for this iteration
    • Engineering must deliver on time
    • Features cannot be changed, added, or re-prioritized
    • Only exception is a “customer down” escalation of a day or more
  • Engineering delivers “almost shippable” quality code at the end of the iteration
  • Each release is self-contained: all the activities pertaining to a given user story must be completed within the iteration, or explicitly slated for another iteration at the start of the iteration
    • E.g.: QA, unit tests, code reviews, design documentation, update to build & deployment tools, etc
  • Dev & QA engineers scope their individual tasks at the beginning of each iteration. The scope and deliverables of the iteration are based on these estimates.
    • Engineers are accountable to meet their own estimates

The above implies that Engineers must plan realistically by
(a) accounting for all activities that will need to take place for this iteration, and
(b) accounting for typical levels of interruptions and activities not specifically related to the project (scheduled meetings, questions from support, beer bashes, vacations, etc).

Estimates must be made with the expectation that we are all accountable to meeting them. This sounds like a truism, except that it is rarely applied in practice.

Day by Day

Before the Start of an Iteration

Preparation and planning prior to an iteration are critical to its success. As the spreadsheet highlights, the Product Manager spends the majority of his/her time during a given iteration planning the next iteration, by

(a)  Prioritizing the tasks to be delivered in the next iteration
(b)  Documenting the user stories to the level of detail required by developers
(c)  Reviewing scope with Project Manager and Tech Lead

Pre-requisites at the Start of a Release

The following must be delivered to Engineering at the start of a release. The Product Owner, Project Lead and Tech lead are responsible for providing

  • “A” list of user stories to be implemented during the release
  • Detailed specs of the “A” list user stories
  • Design of the “A” list features sufficient to derive the coding  and QA tasks necessary to implement the features
  • Estimated scope for each feature – rolling up to a target completion date for the iteration

These estimates are “budgetary”. Final estimates are given by the individual engineers.

Day 1 – Kick-Off

The whole team gets together and kicks-off the iteration: the PM presents the “A” list features to Eng, and the Tech Lead presents the critical design elements. Tasks are assigned tentatively.

During the rest of the day, engineers review the specs of their individual tasks, with the assistance of PM and Tech Lead.  This results in tasks entered into Jira, with associated scope and individual plans for the iteration.

The Project Lead combines all tasks into a project plan (using artifacts of his/her choice) to ensure that the sum of all activities adds up to a timely delivery of the iteration. The Project Lead also identifies any critical dependency, internal and external, that may impact the project.

A delivery date is computed from the individual estimates, and the team (lead by Product Owner, assisted by Project Manager and Tech Lead) iterates to adjust tasks and date

Day 2 – Deliverables are Finalized

Day 1 activities continue if necessary – resulting into a committed list of deliverables and a committed delivery date

The team, lead by Project Manager, also agrees on how the various tasks will be sequenced to optimize use of resources, and to front-load deliverables to QA as much as possible.

Developers start coding, QA engineers start writing test cases and/or writing automation tests

Day 6  – V1 Spec of the Next Iteration

By Day 6, the Product Manager provides the V1 Spec of the next iteration.

V1 Spec is a complete spec of all the user stories that the Product Owner estimates can be delivered in the next iteration. Typically, V1 will contain more than can be delivered, in order to provide flexibility in case some user stories are more complex than originally thought to implement.

During the remainder of the release, the Tech Lead (primarily) will work with the Product Owner to flesh out the details of the next release, to design the key components of the next release to a degree sufficient to be able to (a) list out the tasks required to implement the user stories, (b) estimate their scope, and (c) ensure that enough details has been provided for developers and QA engineers.

During the discussions of the next release, the Project Lead will identify any additional resources that will need to be procured, whether human or physical.

Day 7 – Release to QA

Release to QA means more than “feature complete”. It means feature complete and tested to the best of the developers’ knowledge and ability (see below).

Day 9 – Code Freeze

By Day 9, all bugs must have been fixed, so that the QA team can spend the last day of the iteration running full regression tests (ideally automated) and ensuring that all new features still work properly in the final build

By that time, the content and scope of the next release has been firmed up by Product Owner, Tech lead, and Project Manager, and task are tentative assigned to individual engineers.

Day 10 – Show & Tell

At the end of the last day of the iteration, Eng demos all the new features to the PM, the CEO and everyone in the company we can enroll.

We then celebrate.

Tools and Tips

Sequencing Iterations

  • Depending on the complexity of the user stories, the Tech Lead (and other developers) may need to spend all of their time doing design, and may not be able to contribute any code.
  • It is sometimes more productive to write automation tests once a given feature is stable. As a consequence, the QA team may adopt a cycle where they test manually during the current iteration and then automate the tests during the next iteration (once the code is stable)
  • Exceptions to “almost shippable” are things like performance and stress testing, full browser compatibility testing, etc.
    • These tasks are then planned in the context of the overall release, and allocated to specific iterations

Release Duration

The duration of a given iteration is at the discretion of the team. It is strongly recommended that iterations last between 2 and 4 weeks.  It is also recommended that the duration of iteration be driven by its contents, in order to meet the Golden Rules. There is nothing wrong with a 12- or a 17-day iteration.

Start on Wednesday

Similarly, the starting day of the iteration is up to the team. Starting on a Wednesday offers several advantages:

  • The iteration does not start on a Monday -). Mondays are often taken up by company & team meetings.
  • Iteration finishes on a Tuesday rather than a Friday. Should the iteration slip by a day or two, it can be completed on Wednesday, or Thursday if need be. This means that the QA team is not always “stuck” having to work weekends in order to meet the deadline, nor do they have to scramble to make sure that developers are available during the weekend to fix their bugs, as would be the case if the iteration started on Monday
  • By the second weekend of the iteration, the team will have good enough visibility into its progress, and determine whether work during the weekend will be required in order to meet the schedule.

Specs

The artifacts, format and level of details through which specs are delivered to Engineering is a matter of mutual agreement between Product Owner and Engineering, recognizing that Engineering is the consumer of the specs. As such, it is Engineering  who determines the adequacy of the information provided (since Engineering cannot create a good product from incomplete specs).

Specs must be targeted for QA as well as Dev. In particular, they must be prescriptive enough so that validation tests can be derived from them. For example they may include UI mockups, flow charts, information flow diagrams, error handling behavior, platforms supported, performance and scaling requirements, as necessary.

Release to QA

While the QA team has the primary responsibility of executing the tests that will validate quality, developers own the quality of the software (since they are the ones writing the software). As a consequence, when developers release to QA, they must have tested their code to ensure that no bugs of Severity 1 or 2 will be found by QA (or customers) – unless they explicitly agree in advance with the QA team that certain categories of tests will be run by QA.

Regardless of who runs the tests, the “release to QA” milestone is only reached when enough code introspection and testing has been performed to warrant confidence that no Severity 1 or 2 bugs will be found.

Releasing to QA

Developers and QA can agree on how code will be released to QA. While the spreadsheet shows one Releate to QA  milestone, this was done for clarity of presentation. In practice, it is recommended that developers release to QA as often as possible. Again, this should be driven by mutual agreement.

Furthermore, each developer must demonstrate to his/her QA colleague that the code works properly before the code is considered to be released. This demo is accompanied by a knowledge transfer session, where the developer highlights any known limitations in the code, areas that should be tested with particular scrutiny, etc.

Estimating Scope Accurately

One of the typical debates is whether time estimates should be measured in “ideal time” (no interruptions, distractions, meetings), or “actual time” (in order to account for the typical non-project-related activities). This is a matter of personal preference – what counts is that everyone in the team uses the same system.

I prefer to use “Ideal time”: each engineer keeps 2 “books” within an iteration: the actual iteration work – scoped in “ideal time”, and a “Other Activities” book, where all non-project-related activities are accounted for. This presents the advantages of (a) using a non-varying unit to measure the scope of tasks so that you can compare across people, project, time, and (b) having a means to track “non-productive time” on your project – and thus have data on which to drive decisions (e.g. pleading management for less meetings)

Click here to get the spreadsheet

Software Specification is a Process Not a Document (2 of 2)

Engineering depends on the business team to create actionable specifications early enough before a release, to control the scope to a level commensurate to resources and time available, and to use artifacts that are relevant to the information to be conveyed.

Timing is Everything

Product Management delivering complete specifications in a timely fashion greatly improves the productivity of the Engineering team (Complete being relative the type of specifications – as we discussed in the previous blog). The more precise the information provided at the start of each phase (scoping, release or iteration), the more efficient and accurate will the resulting development work be.

This sounds boringly obvious, but I have seen the contrary scenario over and over again, where business leaders grumble that the Engineering team is not productive, while failing to provide more than a PowerPoint level specification at the start of  releases. As a consequence, developers spend the first third to half of the release working with the Product Managers to define the specs, instead of writing code – or even worse, developers start writing code without spec, and then having to do it over once the specs have been thought through.

Scoping is a 2-way Commitment

Another pitfall to avoid is “scope-creep”. While the name itself would imply that it should be avoided at all costs (who wants to be creepy?), scope creep is an all-too-common occurrence

Scope creep, on the surface, appears to stem from good intentions (we want to meet every customer request – even last minute ones), yet it is one of the most demoralizing behaviors for the Engineering team – akin to continuously pushing back the finish line, after the start of a race.

In order to avoid scope creep, we (Engineering) need to remind the business team that based on the information provided during the scoping phase, Engineering reserved a set of resources for the duration of the release, and committed to deliver the feature set in the allotted time. This in turn creates an implicit contract that the scope of the release – will be bound by the amount of resources allocated to the release. While changes are expected as we get closer to the release start, and even once the release has started, the business team can’t forget that there are only 24 hours in a day, and that no matter how cool it would be to add another 25% functionality, asking the Engineering team for such an increase in scope flies in the face of the process: If we could really do 25% more, we’d have said so the first time during the scoping phase.

In summary, once Engineering  allocates resources for a release and commits to deliverables and schedules, the business team, in turn, must commit to keep the scope of the release commensurate to the resources allocated.

Use the Right Artifacts for the Job

As we replaced Waterfall development process with Agile Software development, we also replaced Market/Product Requirement Documents with User Stories. I have to admit that I don’t get that part, or rather that I find that sometimes user stories are the best vehicle to express customer requirements, and other times, straight requirements do a better job.

For example, when a workflow needs to be implemented, nothing beats a flow chart or a state diagram to define it – we can dispense with the user story on the 3×5 card.

Write Things Down

There is no dispute that face to face discussions are the fastest way to nail down a user story. Often the expected behavior is self-evident from the software implementation itself. However, we must remember that multiple constituencies need to reach common understanding on the software’s behavior: not only the Product Champion and developers, but also, QA, support, services, etc.

Again, there is no way that more than 2 people can reach the same understanding of how a workflow should perform, or what a report is meant to compute unless it is written down, preferably in pictorial form

Technical Risk Must Be Eliminated Prior to Scoping

The business team expects estimates that are fairly accurate – say within 10%. You can see eyes roll when you present  your estimates and then add that the estimate is accurate within 30% … and it’s a fair reaction. As a consequence, time must be invested in research, design and/or prototyping, in order to reach the desired level of accuracy. Sometimes, we need to invest the time to build a prototype in order to validate a design or an architecture. While this initially may appear to be a prohibitive price to pay, a much much higher price would be paid if one embarks on a release, only to miss the deadline by a month or more, because we found out that the original design was inadequate.

Managing Perceptions

Which scenario is best?:

(A)  Promise to deliver 12 features and end up delivering 10 – OR –

(B)  Promise to deliver 9 features and end up delivering 9

In my experience, Scenario (A) is a perceived failure, while (B) will be perceived as a success.

If you agree with me, then you will want to think hard about your iteration plan, and about what features you implement in which iteration. Naturally, the later the iteration within the release, the more likely it is that its features will not be implemented (either because of schedule slips, or changes in priorities). Consequently, plan low-impact features for the last release(s); this way you’ll have to option of jettisoning them if necessary while still nailing the committed schedule. Conversely, if you high-impact features for the end, your only choices will be to disappoint — by taking them out in order to meet the schedule, or to disappoint — by forcing a schedule slip.

In conclusion, software development is a team activity – not only within the Engineering team but also with the business team: Engineering depends on the business team to create actionable specifications early enough before a release, to control the scope to a level commensurate to resources and time available, and to use artifacts that are relevant to the information to be conveyed.

Software Specification is a Process Not a Document (1 of 2)

Software specification needs to be thought of as a process, rather than a document. The three phases of the process are: (1) Release Scoping, (b) Release planning / iteration sequencing and (c) in-depth user story specifications

At each of the companies where I have worked a debate has always raged about how to document  new products specifications. As VP of Engineering, I am frequently asked to produce a template for  Requirements Documents. On the other hand, Agile does away with requirements, in favor of user stories. This, in turn, is in conflict with the business team, who wants to know six months ahead of time what they can promise to customers.

The first step towards reconciling these various perspectives  is to understand that Software Specification is a Process not a document: the value of a specification comes mostly from the process of creating it, and less so, from the final artifact. For one, the final specification rarely captures the features that were excluded, nor the business justifications behind any given feature.

The Specification Process comprises 3 different phases with different purposes and different deliverables.

  1. The first phase is Scoping: this phase typically takes place weeks before the start of the release. The output of the scoping phase is an estimate from the Engineering team that a certain bag of features can be delivered by a given date, with a given set of resources.
  2. The second phase is the Release Planning, ideally starting(shortly) before the official start of the release, where the engineering lead, with input from the product manager, creates the release plan, breaks out the release into iterations, and defines the major features to be built in each iteration
  3. The third phase involves the detailed specification of the features/user stories for each iteration.

Scoping

In my world of enterprise software, the customers, and the business team, want to know months in advance what features will be available by when. Both the release date and the features are determined before the start of the project (sometimes weeks before) and must be met. This is not Agile, but it is reality – see my earlier blog “Setting Expectations about Formal Releases with the Business Team

In order to produce a reliable estimate of what will be delivered when, the Engineering team needs a complete list of features, with a degree of specificity that only needs to be good enough for the Engineering team to appreciate the degree of difficulty of each task.
For example, the spec for a user registration page on a web site could be as simple as:

  • User enters Username, Password first time, Password second time.
  • The Username must be unique
  • The 2 entries for the Password must be identical

… but it could get a lot more complicated

  • The password must meet “strength of security” criteria
  • As the user types in the password, the strength of security of the password  will be computed and displayed graphically
  • The registration server must handle up to 2,000 registrations per minute with a response time of 3 seconds or less
  • System availability must be 99.99% uptime

The two scenarios are vastly different. However, the Engineering team does not need to know a lot more than the bullets above to engage in a discussion with the business team about the scope of the project. If the application’s software stack has not already been validated for performance or reliability, the second project is going to take weeks, compared to hours for the first one. Even the little visual indicator of password strength can add days to the scope of the project (if AJAX needs to be added to the app, or if the team does not have a graphic designer readily available).

While the spec can be very short and still allow the Engineering team to provide scope estimates, one should not underestimate the time it will take to scope. For example, if system performance is significantly increased, scoping will involve design and probably prototyping.

The scoping estimates are typically done based on experience by comparing the new project to previous ones, estimating the number of functional points, etc.

Release Planning / Iteration Sequencing

Release planning, or iteration sequencing, is an overlooked and underrated activity, and yet it often signifies the difference between perceived success and failure. Agile suggests that the user stories most important to the customers should be developed first. This is indeed the primary guide in sequencing activities within a release. However, other important factors need to be considered. For example:

  • Eliminating technical risks for some of the important features
  • Confirming ease of use and usability by mocking up or prototyping key components of the user interface so that they can be shown to customers for feedback early in the release cycle, thus leaving time for modifications.
  • Integration of new libraries, tools, or partners
  • Performance validation

By going through the release planning exercise, the team drills down further in the specifications, gets a more refined appreciation for the scope of the project and thus confirms, or infirms, the original scoping estimate. If necessary, adjustments can be made before the project  starts. Early preventive action is always a good thing!
In addition, release planning is important to ensure availability of critical resources whether human, or physical.
Finally, a proper release plan will align the coding effort with the integration and testing strategy. For example, it is simpler to test API calls when you implement both sides of it, or to test a DAO, when you simultaneously code the UI front end for it.

“Intra-Release Specification”: Detailed User Stories

Once a release has started, detailed user stories must be provided to the Engineering team prior to the start of each iteration – so that the iteration can be scoped at the start of the iteration,by the developers, and the features can be implemented during the iteration.
While interactions between Product Management and developers are encouraged during the iteration, having well-thought out user stories ahead of the iteration greatly improves efficiency.

By understanding that specifying product requirements is a process, rather than a document, both business and engineering teams will work effectively, by delivering the proper level of information to each other at the right time. In the next blog, I’ll cover tricks and best practices of this process.

New Year’s One Wish – and One Resolution

New Year’s One Wish: Specs on Time
New Year’s One Resolution: Fostering More Innovation

It is that time of year again …. New Year’s resolutions. Well, I won’t talk – in this blog – about staying fit or a better work-life balance, but rather I’ll keep it focused on software engineering. I will also keep it simple and limit to one wish: specs on time — and one resolution: fostering more innovation.

New Year’s One Wish: Specs on Time

If I could wave a magic wand, and change just one thing in my professional life, it would be: Engineering receives specs 8 weeks prior to the official start of the release

Why?

As obvious as the answer is, in general, the Business team (product management and execs who are involved in the product roadmap) don’t seem to grasp that if Engineering gets incomplete specs when we are supposed to start coding, well … we can’t start coding – which means we will not be able to deliver as many features as we could have.

For example, let’s say that one release finishes July 31, and that –as is typically the case – we only start discussing the December release on August 1. The result is that the specs do not get finalized until the end of August. Thus rather than having the expected 5 months to implement the new release, we have four. Actually, we have much less. …

Why “8 weeks prior”?

A lot of work must take place between the time Engineering receives the spec, and the time all developers can start coding in earnest:

  • Scope the deliverables: to figure out what subset of the proposed list of features submitted by PM can actually be delivered in the imposed timeframe
  • Finalize the exact deliverables: Based on the “cost” (scope) provided by Engineering, the PM team determines the final list of deliverables for the release
  • Architecture and design: some features require thinking before coding J
  • Technology analysis: evaluate and validate any new third-party tools, libraries, or packages
  • Task breakdown and allocation to individual developers

Only when all these tasks have been completed – at least on the most important features – can development start in earnest.

8 weeks seems like a long time, and there is obviously flexibility in the number. The newer the functionality, the more uncertainty on the scope, and the more research required, the more lead time is needed by Engineering to plan out the release.

A good rule of thumb is: when the previous release is feature-complete is the time when PM and Engineering need to start planning the next release, specs in hand.

What Spec?

… Enough to be able to scope, and plan

In order for Engineering to commit to deliver certain features by a certain date, Engineering needs enough information about these features to be able to scope them – in other words, something more substantial than a handful of bullets on a PowerPoint slide, or a wiki.

There are many ways to write up requirements, and in the spirit of Agile Development, we don’t really want, nor need, everything written down upfront. On the other hand, Engineering needs a modicum of clarity and specificity about what needs to be built. This can be delivered via use cases, UI mock-ups, flow-charts, feature list as long as the Engineering team can appreciate the scope of work: e.g. feature enhancement vs brand new functionality? Is new technology required? Are there performance challenges? Are there new partners or systems with whom we need to interface? Etc.

New Year’s One Resolution: Fostering More Innovation

The day-to-day, week-to-week, month-to-month pressure on the Engineering team is “to deliver on-time, on quality and on budget” The vast majority, not to say all, demands are short-term and reactionary: requests from customers, or responses to competitive pressures.

However, the Engineering team also has the responsibility to innovate: to add to our products something that nobody else has thought of. If we don’t, then within a couple of years, a competitor, or a new startup, will, forcing us to react and catch up.

So, how does one drive innovation in an Engineering team, when there is never enough time to do the things that the business team requires? The answer is that we have to make, sometimes steal, the time, and we have to be efficient – both are admittedly easier blogged about than done.  Please send me your suggestions!

Making the Time to Innovate

Few companies are as profitable as Google, and can afford to grant a blanket 20% of their time to employees to do self-directed research. On the other hand, in a world where technologies become obsolete within a couple of years, it is suicidal not to heed Steve Covey’s 7th Habit: “Sharpen the Saw”. This applies to each of us as individuals, but also as a team.

In practice, this means each engineer must, on his/her own, make time to research new technologies and approaches. It also means that as a team, we need to organize time to discuss promising technologies.

The best approach that I have found is “Tech Talks” where members of the Engineering team present their recent work; articles, books or blogs they have read; or simply a new technique that they have invented.

Directed Innovation

“Directed innovation” is admittedly a contradiction in terms. Yet, a small company cannot afford to disperse its resources. It is thus incumbent on the leadership to present the right context for the Engineering team in which to explore, why our customers buy our product and what will make a true difference for our business: performance, scalability, usability, reliability, etc?

With this context, we can direct our investigations and leverage our efforts. Again, team work, and group discussions usually accelerate the growth of ideas.

Getting a startup to articulate its vision and strategy, particularly how these translates to technology needs, has traditionally been a challenge in most of the startups where I have worked. The mode of operation has typically been more reactive – following the orders that will make the quarter …

This is exactly why fostering more innovation is my one and only New Year’s resolution.

Planning – and Executing the Plan – are Part of the Job

Along with writing good code, planning and meeting the plan are part of an engineer’s responsibilities, in order for the product to be successful and the business to thrive

Being an engineer entails more than writing good code. It also requires being a good corporate citizen. We write and test software so that it can be used by our customers. The Engineering team is one of the teams that constitute the business. As such, we need to coordinate our activities with those of the other teams in the business: Marketing, Sales, Operations, and Support. We are dependent on these other teams for our software to find its way into the hands of our customers. We also depend on them for the business to survive. Let’s not forget that Engineering is an expense center, and that without the Sales team, there would not be any paycheck.

Our obligation to the other teams in the company can be summarized fairly simply: we need to deliver what we promised, on time. We thus need to be able to forecast within a reasonable horizon what we will be able to create, and then deliver against our forecast.

Planning is Difficult but Necessary

Some argue that writing software is a creative and innovative endeavor, which, as such cannot be predicted. The comparison is made with Civil Engineering where designing a new building is akin to applying well documented formulas and following well defined processes lending themselves to formulaic forecasting. While there is truth to the argument, it cannot be taken to the limit. It does not means that forecasting a software project is impossible, but rather that it is hard.
This being said, we don’t have a choice. As I often point out to my colleagues, sales people have to forecast every quarter, and one can argue that forecasting sales is eminently more challenging, since it relies on the behavior of people over whom we have very little control: our customers. Yet, no company can operate without a sales forecast, and forecasting is one of the skills that salespeople need to develop, along with their sales acumen. Engineers are in the same situation.

More specifically, the reasons we make plans are:

  • To forecast when a given release will be complete. This in turn will drive forecasts for sales projections, staffing assignment in services – which in turn drive financial projections, and how the company manages its expenditures – such as our salaries
  • To make strategic decisions: for example, if certain set of features take too long, or too many resources, we may decide to postpone their implementation, and allocate resources to another product or set of features.
  • To make our own decisions: by knowing how much work each task will take allows us to staff projects appropriately, and thus be as efficient as possible.  Over-staffing and under-staffing both have negative consequences that are easy to understand
  • To align internal resources: the most obvious example is that the QA team needs to know when a certain feature will be ready to be tested.
    The above illustrates how important it is to meet our commitments, once we have announced our plans. If we don’t meet our plans, we let other people down, and force them to scramble to make alternate plans. Yet, meeting one’s commitments is not only about working hard. It starts with making good plans.

Making Good Plans

How does one make good plans?

  • First and foremost: include everything (easier said than done but none the less critical)
    • Think through ALL the tasks that are required to complete the job: create a new Maven project, become familiar with the idiosyncrasies of a new software package, upgrade libraries to a new version, organize design reviews, code, unit tests, integration tests, performance tests, error recovery tests, security intrusion tests, documentation, training, etc.
    • Account for everything that happens in a typical day/week: e.g. Meetings, interrupts from Ops, support, or other
  • Be realistic: Engineers tend to be optimistic – make sure that you take into account that something at some point is going to go wrong
    • The best technique that I know is to use history as a reference. Have you typically been late/early on your past projects. Are there activities that you typically fail to account for?
  • Build some buffer – because it is important to meet the commitment (and if you don’t need the buffer, you’ll use the time to implement an extra feature, or start the next release early)

Tracking Progress

A tool like Atlassian’s Jira allows each developer to enter their tasks and the time for each task. It is critical that each developer enter their own time estimate. No task should be longer than 2-3 days. If it is, it is best to break it up. I have found it to be the right balance between having enough detail in the task to grasp its whole scope, while keeping the total number of tasks manageable.

It is important to think of a task as a complete project: including reviewing requirements, design, code, integration, testing, documentation, hand-off to QA. Of course, each of these tasks can be spelled out when their scope warrants it. Again, include the typical daily overhead in the estimates .
Once we have entered the tasks in Jira, it is critical to track them accurately. Don’t be shy about entering time beyond your original estimates if you are running late: your teammates, and your team lead, need to know — so that they can make alternate plans if necessary. Progress tracking tools are not meant to find faults, but for project management and communication: it is a much worse offense to your team to keep quiet about your being late, or struggling, on a task, than the fact of being late.  Being late is a problem that can be dealt with – keeping quiet is a professional fault that hurts the project even more than bad code.

One important note: a task is DONE when you won’t need to put any more work into it. In particular, this means a piece of code is not done until it has been fully tested and validated.

Agile Processes for Formal Releases

2-4 week milestones culminating in a show-and-tell where Engineering and Product Owner(s) engage in a discussion about priorities deliver a lot of the advantages of Agile methodologies, even without official buy-in from the management team

Engineering can  follow a mostly Agile methodology, even if the rest of the company does not. For example, you can still break up the development effort into 2-4 week sprints/milestones, even if the Product Owner does not indulge in reviewing priorities for each milestone. In fact, I contend that by having frequent end-of-milestone review, you will in effect elicit prioritization from the Product Management team.

2-4 Week Sprints/Milestones

Regular milestones (every 2 to 4 weeks) are essential for a several reasons, each sufficient in its own right

  • 2-4 weeks is the proper horizon for planning. While it is not impossible to make plans over longer horizons, the accuracy of these plans drops significantly when they extend beyond 4 weeks. Per Agile, the plan for each Sprint needs to be made “bottoms-up” by the developers who are working on the project
  • Commitment to the plan – Since the developers created the plan themselves, we can ask them to commit to its timely execution. Accuracy in estimating one’s work is a skill that each developer must fine-tune
  • Visibility of progress. By having an “almost shippable” release tested at the end of each sprint, we can all assess progress realistically. As the Manifesto for Agile Software Development states “Working software is the primary measure of progress.”. My measure of progress is binary – if a feature passes all the tests then it is 100% done, if not it is not done (0% complete).

With the rhythm of 2-4 week milestones, every one on the team can see the product being built, with the confidence that true progress is being made and the expectation that no nasty surprises are lurking at the tail of the project.
Show-and-Tell

Show-and-tell is the culmination of the milestone, where the project team demonstrates the new features to their colleagues inside and outside of Engineering. It is critical to advertise the Show-and-Tell outside of the Engineering team, including to the CEO, VP Sales, VP Marketing, etc. The benefits of Show-and-Tell sessions are multiple:

  • Rewards for the engineers: The show-and-tell is a perfect opportunity to acknowledge the contribution of each engineer on the project and offer them public recognitions
  • Avoid surprises at the end: the last thing you want when you have toiled away for 3-6 months on a project is to hear something like “Nice work guys … but this is not what I expected!”, whether it is from your own team, or from customers. Thanks to regular show-and-tell, there are no surprises. We also give the tools to the Product Management team to share these early releases with customers, as appropriate.
  • Reassure Management: By demonstrating regular forward progress to the management team, we can relieve some of their anxiety as to whether we will be able to meet our deliverables. Equally important, when the project was too ambitious to start with, Engineering can give early warning to the management team that alternative plans need to be made, whether it is to reinforce the Engineering team, or to manage customer expectations.

Just like the milestones ensure that the Engineering team can manage its progress without surprises., the Show-and-Tell perform the same function for the business team.
Furthermore, the “Show-and-Tell” give “fair warning” to the rest of the company that the release is on its way, so that the marketing and sales machines can rev up in anticipation of the completion of the release (rather than “wait-and-see” until it’s officially released).
Engage into Discussions about Prioritization

While the business team may not embrace the Agile methodology, by holding the Show-and-Tell events, we effectively engage the Product Management team in a discussion about prioritization. Even when their perspective is “Everything is important, everything must be delivered”, by witnessing the progress, a discussion naturally ensues about whether we need to enhance, or rework, what has already been built, what features should be developed next, and the reaction of customers to the early releases.
There is no magic formula to make these discussions happen, but, in my experience, it simply happens naturally.

More than ever, in today’s fast-changing environment, Engineering must be both predictable and adaptable. Predictable so that the rest of the company can operate efficiently (e.g. by starting to market a release before it is actually complete) and adaptable in order to respond quickly to changes in the market and/or the competition. Having frequent milestones, and show-and-tell, give visibility to progress and set the stage to review priorities , and adjust them if necessary, with minimal impact on the efficiency of the Engineering team.