How To Make The CEO-CTO Relationship Work, Part Two

Previously published on Forbes on July 5, 2019

The relationship between CEO and CTO is pivotal to the success of technology-driven companies. Yet, the personalities and working styles of these driven individuals can be different, which sometimes leads to suboptimal results. I had the experience of joining a company with an established CEO and of greeting a new CEO to my company, so I decided to write two letters to help CEOs and CTOs get on the same page.

This is the counterpart to my last article, “How To Make The CEO-CTO Relationship Work“: It’s the letter that I wish I had received from my CEOs and gives CTOs tips on how to operate and communicate most effectively in service to the CEO and the executive team.

Dear CTO,

I know you have a brilliant and creative mind and an impressive mastery of technology, along with a solid track record of developing world-class products. As you may have guessed, your technical skills alone will not suffice for your success as an executive and as a productive working partner to me. To ensure our joint success, I want to share advice with you about how we can most profitably combine our efforts.

Let’s start with a pair of obvious observations. First, your colleagues on the executive team, myself included, do not have a technical background. Second, the purpose of the company is to grow as rapidly as possible by delivering products that users want and to generate income.

These two realities may clash with your natural tendencies as a gifted creator, particularly when it comes to the technical sophistication of products. Developing the coolest, fastest and slickest product is not always the best business strategy — particularly if it takes a long time. We will need to develop a partnership that allows us to make decisions that include both business needs and technical options. Not every release needs to be perfect in terms of scalability, usability, security, and every other technical consideration. Yet every release must meet the company’s business objectives of the moment. In order to achieve this, you can learn to never say “no,” but rather to present trade-offs, and explain them in terms of their business impact rather than their technical features (which we don’t understand). For example, if we need to deliver on an aggressive schedule, we need you to inform us of what is feasible within the desired time frame in order to achieve the desired business outcome. Do we need to license technology, take away specific features or limit some aspects of the product?

In a similar vein, the team as a whole will benefit enormously if you hone a new kind of creativity, or rather add a new dimension to your technical creativity. This new dimension is one that meets the needs of our customers in new ways, that identifies new markets that we can expand into easily, and that drives the growth of the company. This is a rare talent — one that combines creative understanding of the market with technical innovation.

Your (non-technical) peers on the executive team need you to use language that they understand; we know that you’ve mastered the technical ins and outs. Also, don’t mistake us for your sounding board — rather, you can go to members of your team for that. What is meaningful to us is the impact on the business. Often, it simply boils down to this binary outcome: whether or not we will meet our sales projections for the quarter. Meeting our quarterly objectives is paramount — it ensures we get to “fight another day” — and for that opportunity, we may occasionally ask you to temporarily compromise on technical purity or the efficiency of the engineering team.

We also ask you to be strong. At times, the executive team may “groupthink” into an idea that’s really bad from a technical perspective. Should we do so, we’ll need you to stand your ground and find a way to communicate to us — in terms that we understand — the errors of our ways. Use the technical facts as a foundation to illustrate the business outcomes. You are the only person in the company who knows what it will take to deliver a certain product, what technology, team, methodology, tools, and so on are best suited, and ultimately how long it will take to deliver the product to our customers.

I will do my best to listen when these situations arise. Even so, however, this process is not easy: You don’t want to give up simply because you are in the minority. Perhaps the hardest part is that, once you are confident that the executive team understands both engineering costs and the business consequences of their proposal, you’ll need to let the team make the decision. A typical scenario is when an important new feature is prioritized ahead of a major software re-architecture. Shipping the new feature on the old architecture will require rewriting it once the new architecture is complete. Yet, sometimes this inefficiency is the “right call”: for example, if it makes lighthouse customers happy and blocks out the competition.

Finally, understand that we welcome your input on all topics — not just technology and engineering. I’ve worked with remarkable CTOs who were brilliant business strategists, marketers, and even salespeople. While we seek your input, the final decision belongs to the designated executive team member.

These skills and contributions are all essential to the success of our shared enterprise, and you should develop them while retaining the qualities that inspired us to hire you in the first place. While I have emphasized communications and business acumen, your top priority remains to be a world-class innovator and technical leader. I will help you acquire these new skills over time so that your influence can reach its full potential within the executive team and as a partner to me, but you should continue (and I can’t help you here) to be a world-class technologist.

I hope you will find these tips useful, and I look forward to building a strong partnership together.

Sincerely,

CEO

How To Make The CEO-CTO Relationship Work

Previously published on Forbes on June 17, 2019

The success of a venture-backed company usually depends on two main factors: its technical innovation and the velocity with which it introduces new products. In order to sustain these competitive advantages throughout their growth, companies must ensure that the delicate relationship between the CEO and the CTO is effective.

The CEO and CTO have a fluid relationship that changes over time. As the company grows, the relationship evolves because of the expansion of the executive team beyond the original founders. As the company grows, investors may also replace the CEO with “a real business person.” Sometimes, the CTO decides to leave the company and its politics to found yet another company.

I’ve experienced this rapidly shifting dynamic from both sides — as an outside CTO coming in to replace, or supplement, the founding CTO and welcoming a new CEO after the VCs replaced the founder CEO. In both scenarios, I have observed (and suffered from) misaligned expectations between the CTO/VP of engineering and CEO that lead to frustration and a lack of effectiveness on both sides.

With the benefit of hindsight, I have written two letters. The first, which I will present here, is one that I wish I would have written to my CEOs so they could have understood the nature of my job, my contribution and how to get the best out of me. The other, the letter that I wished I had received from my CEOs, is so they could have understood how to be most effective not only in leading the engineering team but also in understanding my role on the executive team.

Here’s the letter that I, as a CTO, wish I had written to my CEOs:

Dear CEO,

I want to thank you for placing your trust in me to be the new CTO of your incredible company. During the interview process, I thoroughly enjoyed our exchanges, and I was equally impressed by your past accomplishments, your business sense, your knowledge of the market and your drive.

Since you mentioned that you are “not technical,” yet you are responsible for leading a company whose success is highly dependent on the strength of its technology, I thought that I would take a running start in our relationship-building by sharing my thoughts on what will make our relationship effective.

My primary advice is that you allow me to do the things I am good at without second-guessing me. You hired me because I have proven more than once that I can build and lead a team of world-class engineers and launch world-class products into the market. While I expect to be challenged, like every member of the executive staff, when I say that developing a new feature will take three months, please don’t ask if it could be done in two weeks. I too want to win. The three-month figure will not come out of thin air, as my team and I will have spent time coming up with this number. If we ever need to build something with roughly the same features in two weeks, it will have to be an extremely watered-down version that we’ll call “demo-ware,” (which does have its place in certain circumstances), or we’ll need to pare the release down to one or two features.

For my team to succeed, I will also need you to work with the whole executive team to create an actionable product road map. By “actionable,” I mean that the priority of the features needs to be vetted by the business team and that the engineering team will need to be given the time to estimate the scope of major features so that the time frames published on the road map are realistic. If we follow this process, a sanitized version of the road map can be shared with the sales team and even customers.

The other major benefit of an actionable road map is that the engineering team can build a technology roadmap that will allow us to develop breakthrough features because we’ll have had time for research, experimentation and prototyping. Conversely, a road map that zigzags is not conducive to engineering efficiency because it wastes the time spent on design and planning work required for major features that are deprioritized. All of us in engineering understand that sometimes a major opportunity presents itself and that the whole company has to pivot to take advantage of it. We embrace those opportunities because we want to win just as strongly as you do. Yet the decision to pivot should consider the impact on engineering velocity as well as the new business potential.

Building a good product road map requires that we understand each other about schedule estimates: Loose requirements, changing priorities, a high velocity of development and accurate schedule estimates are not compatible. If you — and by extension, the business — require reliable schedule estimates, then engineering needs precise requirements that do not change, plus the time to work out a solid design from which a list of tasks and a schedule can be derived. If the nature of the business requires frequent changes of priorities, then let’s not bother with detailed estimates. Since it is a rare business that does not see priority changes, I strongly recommend that both the business and engineering teams embrace lean product and agile development methodologies.

Finally, at the risk of stating the obvious, engineers have different personalities than salespeople. When the engineering pen is quiet, it is not an indication of low morale. On the contrary, it shows that engineers are focused on writing code. I know that can be disconcerting to extroverts.

We’ll have to move fast in the journey we have undertaken together, and to do that, we need to communicate directly and trust each other. This letter is my attempt to do this, and if you’ve made it this far, there’s a good chance that we are at the start of a productive and fruitful partnership. I can’t wait.

Bernard Fraenkel

CTO

The letter I wish I had received from my CEOs will be published in a subsequent article.

Everything You Ever Wanted to Know About Technical Debt

Check out the white-paper I recently authored at the Silicon Valley Software Group.

Its main objective is to build a bridge between technical and non-technical executives to have rational discussions about technical debt, and then make rational decisions on how to tackle it.

Some of the main takeaways are:

  • Technical debt is on-going: Technical debt originates from a variety of sources, some legitimate, others less so, throughout the life of a product. This means that technical debt should be  integrated into the product roadmap process
  • There are different types of technical debt, characterized mainly by the risk they entail, and the cost to remedy. Consequently, there are different strategies to address different types of technical debt
  • Ranking the various types technical debt of a product on the two-dimensional plane risk vs cost-to-fix provides a good vehicle to foster dialogue, and decisions, about engineering priorities between technical and business executives.

For more details, please download the white-paper at: svsg.co/sme

(Boosting) Morale in Engineering

The recent article by  Jessica McKellar titled “This Is What Impactful Engineering Leadership Looks Like”, and the question “Any suggestions on how to inspire my team?” published on Everwise, prompted me to reflect on what impacts morale in Engineering teams.

At the risk of appearing to deflect my responsibilities as a VP of Engineering, I will assert that morale in Engineering is driven primarily by company culture. Consequently, in order to boost morale, my first priority is to focus outwards and educate the company leadership on how to create a culture that fosters productivity in Engineering.

In my experience, engineers, like most people, are motivated by a sense of purpose and accomplishment. Unrealistic deadlines imposed by the business teams, or constantly changing priorities, for example, will sap the moral of any team, no matter how capable, or charismatic, its leader.

Consequently, the answer to “How do you motivate your team?” is that I first eliminate everything that demotivates them – which is at least half the battle. Then I make sure that we employ the proper tools and methodologies, so that we are efficient collectively as well as individually. Only on rare occasions, do I metaphorically stand on a soap box and deliver a rousing motivational speech”.

ENGINEERS ARE SELF MOTIVATED

Does anyone really think that a professional football player needs a motivational speech before stepping on the field on Sunday? Heck no! He’s been waiting for that moment all week, all year! The rah-rah speech from the coaches or team captains that ESPN shows us, is just for the cameras. Said another way, if a player needs this pre-game sideline speech in order to go all out on the field, then he’s in the wrong business, and I certainly wouldn’t keep him on my team.

Well, it’s the same for Engineers.

“FIRST DO NO HARM” – ADDRESS THE COMPANY CULTURE

This list of “morale killers” will appear to be self-evident. Yet, I see these mistakes perpetuated over and over.

  • Imposing unrealistically aggressive schedules for releases – whether on purpose or not
  • Frequent (i.e. more than every 6 months) changes to the corporate strategy that nullify the existing product roadmap
  • Asking the Engineering team for an extra-ordinary effort to deliver a feature to win a major deal, only to fail to win the deal … more than a couple of times
  • Excluding engineers from customer meetings
  • Failing to publicly recognize accomplishments – whether collective or individual

One of the most counter-productive pattern is to purposely impose an unrealistic deadline based on the illusion that it will motivate engineers to work harder than they normally do. This pattern is ill informed for the following reasons:

  • Engineers may work longer hours when required, but it is unlikely that they will produce their best work during these long hours. It could even be counter-productive if a higher proportion of bugs is introduced.
  • Sustained long hours do not foster creativity, nor attention to details
  • The most aggressive schedule is accomplished by setting a realistically aggressive schedule at the onset. Just like a sprinter has to set progressively aggressive times as the season progresses, each release schedule has to be aggressive, yet achievable.
  • Unrealistic deadlines are rarely met. As a consequence, even if the team delivers an amazing product in an incredibly short amount of time, on release day, we all feel like we failed (since we did not meet the crazy deadline). It is hard to build on top of failures.
  • On the contrary, by setting realistic deadlines, and ensuring that we hit them, we build confidence in ourselves. Furthermore, our internal partners (e.g. Marketing, Sales), as well as our customers also start trusting us and our dates. Success beckons success.

PROVIDE THE PROPER ENVIRONMENT

Not only are Engineers driven by success, we also care about the products we build. We want to ship products on time, we want our users to be thrilled by the product and we want the company to grow. Consequently, my only job is to remove all impediments to these fundamental motivations. I thus focus on:

  • Providing clear strategy and tactics
    • Why are we doing what we are doing (vision, product roadmap, business context) as well as what are our immediate priorities.
    • Ensure that each team member has 1 –  and only 1 – top priority
  • Expecting, and nurturing, a culture of results and forward-looking attitude. Focus on the challenge at hand, rather than laying blame.
  • Making post-mortem reviews actionable: by deciding what we will do differently, and better, next time (rather than on an exhaustive list of things we did wrong) – and following up to ensure that we do do things differently the next time around
  • Making it “our team” rather than “my team” – by encouraging collaboration and ideation from everyone, particularly when it comes to development methodology. Adoption of best practices will be all the easier that recommendations come from peers.
  • Making it easier, simpler to ship products by creating product-focused teams, and limiting meetings to those that are determined useful by the team
  • Stimulating productivity by encouraging maximum use of tools and automation … and minimum number of meetings
  • Fostering team work by encouraging, even requiring, open and timely communications (good & bad news alike). Emphasize empathic cross-team communications (e.g. “be aware that the changes I had to make to the API have subtle implications for your component …”)

ALSO NURTURE THE INDIVIDUAL

In addition to removing impediments to productivity, and providing the right tools and environment for the Engineering team at large, one, naturally needs to address each individual’s motivations

  • Clarity of role: it must be made obvious to each engineer how their contribution feeds the success of the Engineering team and the company – both tactically and strategically
  • “Personalization”: understanding what drives each person in the team (technical, managerial challenges), how they prefer to communicate, their work style, etc
  • Responsibilities: ensure that everyone in the team is challenged to the best of their abilities (to the extent possible given the needs of the organization)
  • Personal rapport: team spirit is built from common aspirations, but also from one-to-one personal relationships, including with the VP of Engineering

Morale is a complex feeling that’s is not easy nurture in a team. It is much easier to destroy it, than to boost it. By removing the “morale killers” – typically originating from the company culture, one can bring a team to a level of enjoyment and productivity where only a little more effort brings a virtuous circle of improvement, when team members themselves drive further improvements.

Sprint 0 “vs” Agile

Members of my teams usually look at me funny when I state at the start of a project that we need to plan. The boldest ones may even venture: “We’re doing Agile, so we don’t need to plan”, implying that planning is synonymous to waterfall, and that planning is certainly incompatible, if not contrary to, Agile.

This is a mis-guided debate. It matters not whether planning is Agile, what really matters is whether it is a good Engineering practice, and, secondarily whether it can be blended with an Agile methodology.

WHEN TO PLAN

The need to plan arises whenever a complex set of features needs to be developed. Typically complexity arises because this new project is dissimilar to anything we have done before, the scope is large and/or we are dealing with “new stuff” (architecture, software framework, tool, people, performance, etc.)

The name Sprint 0

“Sprint 0” designates this planning phase … because the planning takes place before Sprint 1

However, it is partially a misnomer because it is not really a Sprint: it is not structured as a Sprint (the team may not have been formed yet), and its duration is not the typical sprint duration (it takes as long as it takes).

Analogy: Let’s go Hiking

Let’s say we’re going on a 5-day hiking trip in the wilderness. Before the hike, we will look at the map of the area, and profile of our hike (e.g. identify how much elevation we’ll need to climb) so as to distribute our daily efforts evenly across the 5 days [rough scoping]. In particular, we will identify places where to get water, and places to sleep each of the 4 nights, both really important [risk areas]. In addition, I’ll coordinate with my fellow hikers who is bringing the tents, who is buying/carrying what food, etc [roles & responsibilities]. Finally, I’ll copy the schedule of the park shuttle that will bring us back to where we parked the cars [overall schedule].

This is the equivalent of Sprint 0.

Planning is NOT Synonymous to Waterfall

The fundamental difference between a Sprint 0 plan and a Waterfall plan is that Sprint 0 plans JUST ENOUGH to eliminate risk, versus preparing a complete design and exhaustive task schedule.

Sprint 0 wants to eliminate surprises, such as unnecessary refactoring (e.g. because the UI team and the mid-tier teams have a different vision on how to build the API). The fact that both participate in the same daily scrum does not necessarily expose these differences.

The purpose of Sprint 0 is to, almost literally, identify “the lay of the land”: key features, roles, and major risk factors.

This plan leaves plenty of room to be Agile: going back to the hiking analogy: on day 2, we can decide that we’ll walk extra hard on day 3, so that we can stay 2 days at campsite 4 which is on the shores of a beautiful lake.  We could even decide to extend the trip by 1 day … as long as we ration our food accordingly.

The plan does not dictate at what time we get up, who cooks on what day, or what activities we’ll do on our “relax day” fishing, swimming, playing Frisbee, .… But the plan highlights a “relax day”, and thus the need to bring Frisbee, or fishing rods.

In addition, the plan sets yardsticks along the way so that we can measure our progress against our overall objective. For example, we’d better make sure that by day 3, we are past the mid-way point, if we want to finish our trip on day 5.

WHAT TO PLAN

The first activity of Sprint 0 is to review the main deliverables, both external (features) and internal (deploying a new framework, technical debt, performance). We also want to identify risks that could impact the technical solution or the schedule. It could be as trivial as having to finish a set of user stories before the key team member goes on vacation, or as complex as demonstrating that adding a cache does increase performance 10x.

We plan to a level of detail that gives us confidence that our design approach is solid and our schedule is realistic. How realistic depends on the needs of the business. Some companies commit to releases at a given date, others are fully agile.

Sprint 0 Deliverables

  • In InfoQ’s article: What is Sprint Zero? Why was it Introduced?one of the contributors: Mark Woyna, “uses Iteration Zero as a spike”:
  • “The planning team is responsible for producing 3 deliverables by the end of the planning iteration:
  • A list of all prioritized features/stories with estimates
  • A release plan that assigns each feature/story to an iteration/sprint
  • A high-level application architecture, i.e. how the features will likely be implemented”

To which, I add:

  • Design documentation relevant to the project: e.g. Interaction diagrams, Entity/Object definitions, APIs
  • A list of risks to monitor during the project: e.g. dependencies on external factors, critical results (e.g. validation of a new framework, or performance metric)
  • Detailed user stories for Sprint 1 – so that we can start Sprint 1 in earnest at the end of Sprint 0

Plan to a Judicious Level of Details

Contrary to Waterfall practices, we don’t make all the decisions during Sprint 0, we make the minimum number of decisions necessary to “eliminate risk”.

Obviously, risk is only completely eliminated when the project is complete, but in most projects there are some critical decisions that reduce risk significantly. For example, writing out the interaction diagrams for the major use case. This exposes the core assumptions about the main objects in the system and their responsibilities, clarifies whether interactions are synchronous or async, what message broker we use, etc. The whole point is that hashing out disagreements over a diagram is a lot more efficient, and less costly, that doing it once code has already been written.

ARGUMENTS AGAINST SPRINT 0

It’s Waterfallish

Scrum Methodology states “the common dysfunction called “Sprint Zero” is actually a contradiction in terms. Companies (and misinformed consultants and trainers) use this as a way to avoid changing waterfall habits.”

This argument totally misses the point – which is to have sound Engineering practices. Slapping a “Agile vs Waterfall litmus test” does not inform the discussion as to whether this particular practice is sound engineering.

Does Not Deliver Value to the Customer

The article from Scrum Alliance: What is Sprint Zero? presents “Scrum believes that every sprint should deliver potentially usable value (… by the customer)”.

The fact that Sprint 0 does not deliver value to the customer at the end of the Sprint is a myopic argument, which misses the more important benefit of Sprint 0, namely, that it improves the velocity of the team for all the other Sprints. Because we lay out a high-level path to success in Sprint 0, we walk a straighter, and faster line, during the remainder of the Sprints. Equally important, we avoid “critical failures”: where a significant portion of the code needs to be refactored because we incremented our way into a design rather than taking the time to think it through.

Another way of saying this is that Sprint 0 brings value to us, the team, by providing better visibility to the whole project. We “return” this value to the customer, by being more efficient and faster overall.

CONCLUSION

When thinking about Engineering best practices, let us not corner ourselves into debating labels, e.g. Agile vs Waterfall. To me it simply makes sense to take time to reflect, think and plan before embarking on a complex project to:

  • Evaluate key features to be implemented
  • Agree on key design and architecture designs, such as entities, APIs, protocols
  • Identify risk areas: schedule, resources, technology, performance
  • Map out tasks over time (a) to ensure that the project will be completed in a time frame commensurate with needs of the business and (b) set up yardsticks against which to calibrate our progress in the future
  • Lay groundwork for Sprint 1.

Basics of Performance Testing

At the risk of sounding harsh, I cannot remember the last time I interviewed a QA engineer who could clearly explain to me  what a Performance Test is. Even more worrisome, when they described how they usually run performance test, they could only describe the mechanics of the test (“I ran JMeter”). More often than not, the tests were wrong, and, in any case, they could not interpret the results (largely because they did not know what numbers to look at).

So here is a basic description of what a performance test is and the important steps to follow – as well as one of the most common mistakes.

Note for illustration purposes, I’ll use the term site (e.g. retail e-commerce site), but this post applies identically to SaaS services.

Typical Performance Test

The primary performance test that engineers want to run on a product is: “How many users will our product support?”. This typically translates to “How many concurrent users can the current code and infrastructure support with acceptable response time?”

Another performance question is to determine how many total users can be supported. This problem is rarer, and we won’t cover it here.

The first challenge is: “what is a concurrent user?” While the answer is intuitively obvious: “the number of people using the site at the same time”, we still need to define what “at the same time” means. In particular, it does NOT mean “exactly” at the same time, i.e. within the same micro-second.

This incorrect “within the same microsecond” interpretation actually leads to the most common error in performance testing, where testers set up N JMeter clients (where N is the target number of concurrent users), launch them at the same time, and measure the time it takes until the last clients receives a response. This test does not measure concurrent users – in real life, users arrive randomly on a site, not at the exact same time. This is illustrated in the two figures below. Both represent  1,000 users using the site during a period of one hour. The “Burst” figure illustrates the “bad” test when all 1,000 users hit the same in the first minute (and then no activity in the remaining 59 minutes). While the correct scenario is likely to be closer to the second picture, where during each of the 60 minutes an average of 16.66 users hit the site (1,000 users / 60 minutes). However, in any given minute the number of users can vary between 5 and 25 (for example).

random_vs_burst

Another way of expressing that a site has 1,000 visitors  per hour, is that a new visitor hits the site every 3.6 sec (3,600 seconds per hour / 1,000). Consequently, we can program JMeter to hit the site following a pseudo-random sequence of mean 3.6 sec and standard deviation 1 second (e.g.).

Depending on the length of our standard visitor visit, we will need to deploy multiple clients. For example, if each visit simulation takes a minute, we will need an average of 16.6 clients (60 sec per minute / 3.6 sec) to perform the simulation

How Many Concurrent Users to Use?

The excellent article: How do you find the peak concurrent users on your site? describes in details how we can find the peak number of visits during a given hour on the site (for performance testing, we care about visits – rather than visitors)

Of course, the key is to know our site well enough to identify the actual day(s) and hour(s) when we have the peak traffic.

However, we cannot stop there. The purpose of the test is to give us confidence that our site will be able to handle the load in the future. So the historic peak number of visits needs to be interpolated for the anticipated lifetime of the release we are certifying. E.g. if we want to be ready for the upcoming Black Monday, we need to take last year’s numbers and increase them based on the projected traffic from Sales & Marketing … plus an extra 25% as a safety buffer.

What is a Visit? What User Actions to Execute?

Now that we have figured out, how many visits we need to run, and how often, we need to figure out “what’s a visit”? There are 2 principal ways to answer the question:

  • a visit is a isolated user action
  • a visit is the simulation of an actual visit of a typical user.

Testing the Performance of a Isolated User Action

While we assume, and hope, that the visitors on our site don’t limit themselves to a single action when they visit, we may still want to simulate the performance of a single action under 2 generic conditions

  • this is new functionality and we want to ensure that it is fast enough
  • as part of performance regression testing – we want to ensure that the new release is no worse than the past releases

The question now is, what isolated user actions should we test? Here are some examples:

  • Login / Home Page / Landing Page: the 1st page that is loaded when a  user arrives on the site. It is important for it to be fast, since it is the “first impression” of the site for the user – and – it usually requires a lot of work server-side, since, by definition, we have no (or little – in the case of a returning user) cached data, so a lot of data needs to be computed or refreshed
  • All pages related to checkout and payment: don’t want to lose a purchase because the site is too slow
  • Pages which Analytics in Production identify as slow
  • New Pages / heavily re-factored pages that are content- and/or compute-heavy

Testing the Performance of a Typical Visit

A “typical visit” can be constructed either “synthetically”, or “by replay”.

A synthetic visit is one that we build based on the nature of the site and statistics gleaned from the Production system – such as the average number of pages visited. A typical visit for a retail site would entail: login – select a category – then a sub-category – browse a couple pages – select an item – checkout and pay.

A “replay” visit is based on tracking the actual navigation of a random user on the site (e.g. using log files).

It is important to note that I should be using the plural – we need to identify multiple “typical visits”. For example, in the example above, a second typical visit would involve a couple of searches rather than browsing by category/sub-categoy.

We can also breakdown our visits to focus on specific sections or functionality of the site: search, browsing, check-out – this makes it easier to interpret the results

Let’s Not Forget Background Load!

In order for our performance test to be realistic and representative of Production, we need to simulate all the types of traffic taking place on the site – qualitatively and quantitatively – during our experiment – because each type of traffic consumes shared resources differently – whether it’s cache, database resources, access to disk, etc.

What I call “background load”, is a simulation of the typical traffic on our site. The best way to simulate it is to record it and replay it – either using a proxy server, log files, or other tools – e.g. Improving testing by using real traffic from production

One exception: If we want to characterize the performance of a specific code path – and/or benchmark it against previous versions – then we should not run any background load

How Long Should We Run the Test?

When thinking about the timing of a performance test, we first need to address the “ramp-up” or “warm-up” phase. When we launch our performance test, our system also starts “cold”: typically it has an empty cache, a small number of threads running, a few database connections set up, etc – so to obtain a realistic measurement of performance we need to wait for the system to reach its steady-state. It’s not always straightforward to tell when the system reaches steady-state, but we can get clues from monitoring critical system parameters: CPU, RAM, I/O on key servers (e.g database). Also, response time should stabilize.

Secondly, because the simulated visitors access the system in a random fashion, we need to take our measurements over an extended period of time, so as to smooth out the randomness. An easy way to figure out the exact duration is to experiment: if the results we get after 10 minutes are the same – over 10 experiments – as those that we get over 1 hour – then 10 minutes is enough.

How Many Tests Do We Run?

Depending on the test, and the environment,  we may have to run it multiple times. For example, if we are running on AWS, even if we are running on reserved instances, the performance may vary between runs, depending on network traffic, load from other tenants of the physical servers, etc.

Interpreting the Results

We can look at the results in at least 2 different ways:

1/ In absolute: does the response time meet our target?

2/ Relatively to prior releases: is our response time no worse in this release than in prior releases

Performance Hygiene

Ideally, performance tests are automated and run regularly – at least after each Sprint. This allows us to catch performance regressions early.

Conversely, running performance tests after “code complete” is almost a waste of time, since this leaves us no time to remedy any serious performance issue, and thus puts us in a quandary: release a slow product on time vs release a good product late?

Common Mistake: Initial Conditions

Finally, we need to address a very common mistake, namely ignoring initial conditions.

To ground the discussion, let me give an example: we can all agree that the query “select * from TableA” will execute much faster if TableA has 1 row vs if TableA has 100M rows.

The same applies to performance testing. Just as we let the system reach steady-state before we start measuring performance, we also need to ensure that all assets impacting performance are fully loaded.

To be more specific, let’s say today is March 15, and I am working on the release that will be in Production for Black Friday in November. In order to have a meaningful performance test, I need to make sure I load my E-commerce database not just with the same number of customers that we have today (March 15), but with the projected number of customers we’ll have by the end of November! Similarly, with the number of products in the catalog, documents in the search database, etc.

This is a complex – but absolutely critical – step. Otherwise, our tests will tell us about the past, not the future.

The final consideration is to ensure that the initial conditions – as well as the test – are 100% reproducible. This means that the “initial conditions” need to be exactly reproducible: e.g. by restoring from a previously archived database, by using the same pseudo-random sequence to trigger visits, the same logs to replay background traffic, etc. Otherwise, we are simply running a different test each time. As a consequence, any benchmarking would be meaningless.

Summary

Performance testing is complex, and requires a lot of thought, careful planning and detailed work to produce results that are meaningful. Specifically, we need to:

  • Model one or more individual visit profiles constructed from traffic patterns on our site / in our service
  • Model visit rate based on our site’s Analytics and interpolate it to projected level during the lifetime of the release
  • Generate pseudo-random sequences to model users’ arrival on the site
  • Generate a model of background load and/or a mixture of individual visits that together are a good approximation of actual traffic
  • Make sure to give the site/service time to warm-up (i.e. reach steady state) before starting measurements, and run the test long enough to smooth out the pseudo-random patterns. Also run multiple tests if environmental conditions cannot be fully controlled
  • Finally, make sure to properly initialize the whole system – in a reproducible fashion – so as to account for all the data already present in the system.
  • Finally finally, ensure that all tests conditions are reproducible, and tests are automated so that they can be run regularly – preferably upon the completion of each sprint. This ensures that performance bugs are caught early, rather than 2 weeks before the target release date.

How to Prioritize New Features vs Bug Fixes

The most lively debates that I regularly encounter leading an Engineering team revolves around the allocation of resources between bug fixing and the development of new features: “Why doesn’t Engineering fix all the bugs?” exclaims a customer support person – “Why don’t we allocate all Engineering resources to New_Shiny_Feature_X?” wonders the salesperson whose major deal depends on this feature.

These are both absolutely legitimate questions! … It does not mean that their answer is easy.

The main challenge in satisfying these two rightful requests is that they compete for the same resources, and that different people within the company have strongly-held different perspectives. The same person can even switch camps in a matter of days. It all depends on the last sales call. Do we have a customer threatening not to renew until we fix “their bugs”, or do have a big deal pending on the delivery of a new feature?

As a consequence, it is imperative to create a business and technical framework that leads to decision making, where every stakeholder can not only express their perspective but also be satisfied about the decision process and thus about the decisions that come out of this process.

Framework for Decision Making

What’s more important? Or more precisely, what’s more important to implement in this release cycle?

  • New features driven by product roadmap and corporate strategy
  • Customer-driven enhancement requests
  • Bug fixes requested by existing customers
  • Paying down technical debt: upgrade architecture, refactor ugly code, optimize operational infrastructure, etc

The process to reach a decision is basically the same as for any business decision: we weigh how much income each item will generate and how much investment it will require.

Implementing a new feature, fixing a bug, enhancing a released feature or paying down technical debt demand the same activities: define requirements, design, code, test, deploy. They also all draw from the same pool of product managers, developers, QA and DevOps engineers. As a consequence, it is relatively easy to define the “investment” side of the equation.

Estimating the income side is a bit more complex, because it comes in multiple flavors. However, the process is the same as prioritizing the backlog of new features: we need to articulate the business case:

  • Expected revenue stream (new features & enhancements)
  • Reduction in subscription churn (enhancements & bug fixes, as well as new features)
  • Cost reduction (technical debt / architecture) through increased future development velocity
  • Customer satisfaction (bug & enhancements) which translates in better advocacy for the brand and churn reduction
  • Strategic objectives (market positioning, competitive move, commitment to win a major deal)

Each of these categories is important in its own right. Since they cannot all be translated into a common unit of measure (e.g. dollars), I recommend quantifying each of these elements relative to one another (e.g. using T-shirt sizes: S, M, L, XL, …) for each item on the list.

Practically, I create a matrix with rows listing each feature, bug, enhancement request, technical debt, and the following columns:

  • Short Description
  • Link to longer description (Jira, Wiki, …)
  • Summary business case
  • Estimated engineering effort
  • Estimated calendar duration
  • Expected increase in revenue (if any)
  • Expected cost reduction (if any)
  • Customer satisfaction impact
  • Strategic value

While this is not perfect – ideally we’d want to assign a single score for each item – this allows to (a) resolve the no-brainers (high-benefits at a low-cost or high-cost and low-benefits) (b) frame the discussion for the remainder against the business context of the company:

  • Are we in a tight competitive race where we need to show momentum in our innovation?
  • Do we have one, or more, major deals dependent on a given set of features?
  • Are our customers grumbling about our product quality, or worse threatening to leave?
  • Is our scalability at risk because of legacy code?
  • Are we being hampered in our ability to deliver new features by too much legacy code?

While this will not eliminate passionate debates at Product Council, it will hopefully bound them, particularly if we can first agree on high-level priorities for the business.

Why Not Have a Dedicated Sustaining Engineering Team?

There are two primary reasons why a Sustaining Engineering team is a bad idea: first, it “does not answer the question” of prioritization, and secondly, it is a bad practice as it creates a class of “second-class citizens” engineers.

Say you want to have a Sustaining Engineering team. How large should it be? 5%, 10%, 20%, 50% of all engineering? Why? Should its size remain constant? Or are we allowed to shift resources in and out depending on business priorities? Answering these questions requires the same analysis and decision making as I propose above, but is burdened by the inflexibility of a split organizations

Regardless of whom you assign to Sustaining Engineering, these engineers will be considered second-class by the self-proclaimed hotshots who get to work on new features. Worse, it promotes laziness with respect to quality from the “new feature team”: they know that Sustaining will clean whatever mess they leave. It is pervasive, and over time can even lead to cherry-picking of work, which means that Sustaining ends up completing the “new feature” work. For example, the “new feature” team releases a new product on Chrome (so that they can meet “their date”), but Sustaining gets to make it work on Internet Explorer.

A Useful Best Practice

Any bug older than 12 months, should be removed from the bug backlog. They should either be marked as “Won’t Fix”, or assigned to a secondary backlog list (which, I predict, will never be reviewed). The justification is simple: if a given bug has lived through a year’s worth of bug triages without rising to the top and being fixed, then it is almost certain that it will never be prioritized for resolution. Better to put it out of its misery. Furthermore, this will keep the bug backlog to a reasonable size and bug triage a manageable task. Finally, if for some reason, the visibility of this bug raises anew, it can be returned to the active backlog.

Conclusion

The adage “Software always has bugs” remains true, not because it is impossible to write perfect software (I argue that this IS possible), but rather because in a business context, quality is not an end in-and-of-itself. Don’t get me wrong high-quality is critical, but fixing ALL the bugs is not a requirement for business success.

As a consequence, only 1 criterion matters: “what moves the business forward the most effectively?”

Typically this means making customers happy. There are times when customers are happier if we fix bugs, at other times they prefer to see a new feature brought to market earlier. The answer depends on what drives their business. Do they prefer that we fix a bug that costs them an extra hour of work per day or that we launch a new feature that will allow them to grow their business by 10% in 6 months?

Notes from SF Data Mining Meetup: Recommendation Engines

Excellent talks on each of the presenting companies approach the design of their recommendation engines based on the specifics of their markets and users

Recommendation Engines

Thursday, Apr 4, 2013, 6:30 PM

Pandora HQ
2101 Webster Street, Suite 1650 Oakland, CA

200 Data Scientists Went

6:30 – 7:00pm Social and Food7:00 – 8:30pm Talks**8:30 – 9:00pm SocialWe’re excited to have three sets of speakers:1. Trulia: Todd Holloway will be giving a talk on Trulia Suggest.2. Rich Relevance: John Jensen and Mike Sherman will be giving their perspectives on recommendation engines.3. Pandora: Eric Bieschke will be giving his perspec…

Check out this Meetup →

Here are my notes on their respective technology stacks. Hadoop, Hive, Memcached, Java are used by all 3.

1. Trulia: Todd Holloway on Trulia Suggest.

  • Hadoop
  • Hive
  • R on each Hadoop Server
  • Memcached
  • Java

2. Rich Relevance: John Jensen and Mike Sherman

  • Hadoop
  • Hive
  • Pig
  • Crunch

Starting to deploy

  • Kafka
  • Storm

3. Pandora: Eric Bieschke

  • Python. Hadoop. Hive for  Offline processing
  • Memcached. Reddis: for near line & online
  • Java & PostgreSQL for online

Memcached: Used as key-value store in the sky  as long as you don’t care about losing data

Reddis: “Persistent Memcached”

Migrating a Self-hosted Architecture to the Cloud

While it may possible to migrate a self-hosted architecture to the cloud with servers in identical configuration, it almost certainly will lead to a sub-optimal architecture in terms of performance, and higher costs, in some cases prohibitively so.

The common objectives for moving to the cloud are:

  • Ability to scale transparently as the business grows
  • Reduce costs
  • Benefit from a word-class IT infrastructure without having to hire the talent

 

We’ll focus on the first two objectives as the third one is achieved – by nature – the moment you flip the switch to the cloud.

Memory Drives Pricing in the Cloud – not CPU

in the cloud, whether with Amazon EC2 or other vendors, the primary dimension driving pricing is the amount of memory (RAM) available in the server. In addition, CPU allocated is roughly proportional to the amount of RAM.

For example, as of this writing, per the Amazon EC2 pricing and the Amazon EC2 Instance Types definitions:

  • A Small instance has (only) 1.7 GB of RAM – and 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit) and costs: $0.08 per hour On-Demand
  • An Extra Large instance is 8 times bigger than a small instance and costs 8 times as much: 15 GB of RAM – and 8 EC2 Compute Unit and costs: $0.64 per hour On-Demand
  • In order to get 32 GB of RAM, one needs to move to the High-Memory Double Extra Large Instance aka m2.2xlarge: $0.90 per hour On-Demand

Note that the prices quoted here are for US East (N. Virginia) zone. Prices for US West (Northern California) are more expensive (about 12% based on a few data points I correlated)

Database Servers

Database servers have unique requirements:

  • They require fast I/O to disk. While Amazon recommends using networked storage, this is typically not practical from a performance perspective.
  • So database servers also require large local disks: to hold the data
  • Most databases require a fair amount of memory at least 16 GB. We use Cassandra, and they recommend 32 GB per server.
  • They hate noisy neighbors (see previous blos). While virtualization technology does a fairly good job at partitioning CPU and RAM, it does a much poorer job at sharing I/O bandwidth. Having another virtual machine running on your database server can kill its performance, even if the neighbor does not do much, it can kill the I/O efficiency. All the tricks that databases use to optimize I/O performance assume that the database is in control of all I/O busses.

As a consequence, one should first of all use a reserved instance – simply because the cost of getting data in and out of the local disks makes it impossible to set-up / tear-down database servers “at will”.

Secondly, one should buy a large enough instance (e.g. m2.4xlarge) so that we are the only tenant on the server. This will cost $7,203 per year – based on Heavy Utilization Reserved Instances pricing, and get us 68.4 GB memory, 4 cores (8 virtual – with Intel Hyper-Threading) and 1.69 TB of local storage.

SSD

As Adrian Cockcroft from Netflix illustrates in his detailed post: Benchmarking High Performance I/O with SSD for Cassandra on AWS, moving to SSD instances for I/O and compute intensive systems can bring significant cost reductions. In his example, he compares a traditional system with 36 x m2.xlarge + 48 x m2.4xlarge instances at a cost of $772,806 (Total 3 Year Heavy Use Cost) – with a 15 x hi1.4xlarge system at a cost of $354,405 – a 54% savings.

As the article illustrates, selecting one versus the other requires careful understanding of the computational profile of the application, and some changes in the application’s architecture

Do I want to use Proprietary Amazon Solutions?

Following the logic that motivates us to move to the Cloud forces to consider using Amazon proprietary solutions: reduce need for sys admin talent, leverage out-of-the-box a scalable high-availability solution, etc

  • Should I replace mySQL with RDS? Or Casssandra, HBase with DynamoDB?
  • Should I replace my ActiveMQ (e.g.) message queue with SQS?
  • … and similarly for AWS many products

 

These are excellent products, battle tested by Amazon. However, there are 2 very important considerations to examine:

  • First, these products are obviously proprietary – making a move to another cloud provider like Rackspace or Joyent, will be take an extensive code rewrite. This may turn out to be impractical.
  • Secondly, cost can be a (bad) surprise once the application is deployed live. For both RDS or SQS, pricing is driven by data bandwidth AND the number of operations performed using the service – which requires careful analysis to estimate ahead of time. For example, polling every 10 seconds to check whether new data is present in SQS generates 250K operations per month (assuming each check requires only 1 operation). This is fine if this function is performed by a few servers, but would break the bank if it’s performed by 100,000 end-user clients. This adds up to $25,000 ($0.000001 per Request).

Algorithm Tuning and Server Selection

More generally, Amazon offers seven families of servers: Standard, Micro, High-Memory, High-CPU, Cluster Compute, Cluster GPU, and High I/O (SSD). Porting an existing application will thus require an iterative process evaluating the following questions:

  • How do I best match each of my system’s components with an Amazon instance types>
  • Can I fine-tune, or even re-write, my algorithms to maximize RAM & CPU utilization? In particular, would I make the same memory vs computation trade-offs? Do I need this hash-table, or can I re-compute the query?
  • How does my architecture evolve as I scale out? For example, do I need to replicate shared resources – like caches – or will sharding (e.g.) avoid this duplication of data – which will directly impact my cost since pricing is memory driven. An algorithm may work best using a approach favoring memory (and minimizing CPU) when running on a single server but it may be more cost-effective when optimized for memory when scaled out over many servers.
  • How do new technologies like SSD impact my architecture? As the Netflix article illustrates, the cost impact can be radical, but it required substantial architecture redesign, not just a simple server replacement

 

In conclusion, moving from a hosted environment (where each server can be configured at will) to the cloud where servers come in pre-determined configurations requires not only an architecture review, but a sophisticated excel spreadsheet to compare the costs of various architectures. This upfront financial modeling is absolutely necessary in order to avoid unpleasant surprises as the business scales up.

Want to Predict your Cost in the Cloud? Roll Up Your Sleeves!

 

The selection of a cloud service provider is a critical decision for any a software service provider. Cost is, naturally, a key driver in this selection. However, predicting the cost of running servers in the cloud is a project in, and of, itself, because the only way to build a reliable model of costs, is to go ahead and deploy our systems with the service providers.

 

Why is not possible to forecast costs with pen and paper?

The main reason that pricing is so hard to forecast is that our system architecture in the cloud will likely be different from the one currently running in our own datacenter: the server configurations are different, the networking is different, and most likely we want to take advantage of the new features that come “for free” with a deployment in the cloud: higher availability, geographical redundancy, larger scale, etc. We’ll cover this in details in an upcoming post.

 

Another reason why it is hard to predict costs is that we don’t really know what we are getting:

When one considers the primary attributes of a server: RAM, CPU, storage, I/O (network bandwidth) – only RAM and storage capacity are guaranteed by cloud vendors. Vendors provide varying degrees of specificity about CPU and other key characteristics. Amazon defines EC2 Compute Units: “One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”. Rackspace’s price sheet categorizes servers by available RAM and disk space (the more RAM, the more disk space). Their FAQ mentions the number of virtual cores each server receives, based on the amount of RAM allocated, but I could not find their definition of a virtual core. GoGrid, or Joyent provide similarly limited information.

 

As a side note, one needs to be aware that vendors typically refer to “virtual cores” – as opposed to real (physical) cores. A virtual core corresponds to one of the two hyperthreads that run on modern Intel processors since 2002. Conversely, a server with a quad-core Intel Xeon processor runs 8 virtual cores. You can read this 2009 post, plus the comments thread, for more specifics. While the data is dated, the observations are still relevant.

 

So, there is a lot that we don’t know about the servers on which we will run our system: CPU clock, size of LI, L2 RAM, I/O bus speed, disk spindle rotation speed, network card bandwidth, etc.

Furthermore, performance will vary across servers (since cloud vendors have a diverse park of servers of different age) and thus, each time a new image is deployed, it will land on a random server, with the same nominal specs (RAM, storage), but unknown other physical characteristics (CPU clock, I/O bandwidth, etc).

 

Another well-documented problem is that of noisy neighbors. While the hypervisors do a fairly good job at controlling allocation of CPU and memory, they are not as effective at controlling the multitude of other factors that affect performance. I/O in particular is very sensitive to contention. While VMware affirms that vSphere solves this problem, most (all ?) cloud vendors use open source hypervisors.

In any event, this problem is systemic and cannot be solved by the hypervisor. For example, we did a lot of research on the best configuration for our Cassandra servers (database for big data). One of the main performance optimizations driving Cassandra’s design is to maximize “append” (rather than update) operations, thus minimizing random movement of the read/write head of the disk, and thus maximizing disk I/O. Unfortunately, all this clever optimization goes out the window if we share the server – and thus the disks – with a noisy neighbor who is performing random read-write operations. I had the chance to discuss this a couple of months ago with members of the Cassandra team at Netflix (one of the largest users of Cassandra and almost 100% deployed on Amazon): they solve the issue by only using m2.4xlarge instances on AWS, which (today) ensures that they are the only tenant on the physical server – and don’t have any noisy neighbor.

 

Adding all this together makes it pretty clear that vendor comparison on paper is practically fruitless.

Let’s Try It Out

The only practical way to create a realistic budget forecast is to actually deploy systems on the selected cloud vendor(s) and “play” with them. Here are some areas to investigate and characterize, beyond simply validating functionality:

  • Optimal server configuration for each server role (web, database, search, middle tier, cache, etc). We need to make sure that each server role is adequately served by one of the configurations offered by the vendor. For example, very few offer servers with more than 64 GB of RAM
  • Performance at scale (since we only pay for the servers we rent, we can run full-scale performance tests for a few hours or days at relatively low cost – e.g. a few hundred dollars) – Netflix tested Cassandra performance, “Over a million writes per second”, on AWS for less than $600 and clusters as large as 288 nodes
  • End-to-end latency (measured from an end-user perspective) – since latency will be impacted by the physical distribution of the servers
  • Pricing model

 

For these tests to be meaningful, one needs to ensure that deployments are realistic: for example, across service zones and regions, if we plan on leveraging these capabilities – as they impact not only performance (due to increased network latency) but also pricing (data transfer charges).

 

In addition, each test must be run several (10 – 20) times – with fresh deployments – at different times of day – in order to have a representative sample of servers and neighbors.

 

As important as the technical performance validation, the pricing model must be validated as vendors charge for a variety of services in addition to the lease of the servers: most notably bandwidth for data transfers (e.g. across regions), but also optional services (e.g. AWS Monitoring or Auto-Scaling), as well as per operation fees (e.g. Elastic Block Store). The “per operation” fees can add up to very large amounts, if one is not careful. For example, see the Amazon SimpleDB price calculator – we have to run SimpleDB under real load in order to figure out what numbers to plug in. Overlooking this step can be costly.

 

Once the technical tests have been completed, and the system configuration validated,

I recommend at least a full billing cycle of simulated operations, in order to obtain an actual bill from the vendor from which we can build our pricing model.