Why Digital Transforma­tions Fail – the Monolith Syndrome

Previously published on Silicon Valley Software Group Insights in March 2023.

A number of our engagements come from clients who experience a similar pattern of symptoms: release velocity is trending down, critical bugs pop up with each release, yet hiring more developers does not seem to improve anything. In parallel, the digital imperative, which has gained momentum over the past couple of years, whether imposed by the pandemic, or simply overall evolution, keeps building the pressure: consumers require a flawless digital experience. When the technology team does not deliver, the consequences for the business are painful: customers are disappointed, competition edges ahead and, even more heartbreaking, our clients are unable to capture the demand that their marketing has generated.

The goal of this post is to inform both CEOs and CTOs on how to diagnose what we term the “Monolith Syndrome”. As with any condition, early diagnosis vastly improves the chances of success. It is thus critical for CEOs and CTOs to know how to recognize this pattern, and take the necessary early actions. Further, it often falls on the CEO to identify the situation, because the CTO is usually consumed in trying to just keep up.

Symptoms

The symptoms of what we term the “Monolith Syndrome” look like this:

  • The application’s response time keeps degrading;
  • Outages are becoming more frequent;
  • As outages occur, new features requests do not get delivered. Customer complaints rise;
  • Re-prioritization of the product roadmap occurs before the main features of the previous roadmap are delivered (because they took too long);
  • Distrust between the executive and the technology teams grows.

Like any challenge, each company faces its own flavor of the “Monolith Syndrome”, yet to the experienced eye, the pattern is easily recognizable. More fundamentally, it is absolutely normal: it occurs when a company has grown into a new stage of maturity – where a new way of running the business, including the technology, is now necessary. Like most living organisms, when looking on a short time horizon, companies grow incrementally. However, when taking a step back, discrete stages become evident. On the technical front, transitioning between maturity stages call for what is called a “Digital Transformation”.

The Monolith Syndrome encapsulates scenarios of pain when the technology team cannot keep up with the needs of the business through “business as usual”.

There are multiple scenarios that require a digital transformation, the Monolith Syndrome is one of them. We will explore the others in subsequent posts.

Causes

From a technical perspective, the root causes of the “Monolith Syndrome” are often a combination of:

  • The architecture of the current codebase was developed more than five years ago, and has changed little since;
  • The code is built on a single codebase and uses a single database – hence the term “monolith”;
  • Development expediency has been the priority which has led to: poorly organized code, little documentation, few tests, and even fewer automated tools for QA, release and operational management;Critical areas of functionality are implemented in “dark code”: code that was written by developers who are no longer employed by the company, and which current developers are scared to touch, because the code is difficult to understand and there is no documentation.

The Monolith Syndrome encapsulates scenarios of pain when the technology team cannot keep up with the needs of the business through “business as usual”. We described the symptoms above in technical terms. Yet, the underlying cause is that the company has grown into a different maturity level – where “what got you here” no longer works.

To be clear, a monolithic codebase is usually the right way to go in the early stages of a company: there are a handful of developers, a manageable number of lines of code, and few features that are quick to test manually. Yet, at some point in the company’s growth, the nimbleness and expediency become a detriment rather than an asset. For example, it becomes cumbersome to develop, let alone release, when twenty-plus developers are writing code in a monolith: different developers’ new code interact with each other in a way that creates unforeseen bugs.

The underlying cause of the Monolith Syndrome is that the company has grown into a different maturity level, but not the technology team.

As a company battles through the Monolith Syndrome, the CEO and CTO have a heart-to-heart: the CEO asks “what do you need to develop new features faster?” – to which the CTO invariably answers “I need more engineers”, and then proceeds to build a “better monolith”, i.e continue to work on the same codebase with the same processes and tools. Yet with poor architecture, software organization, and documentation, the extra developers only create more confusion and barely accelerate development velocity. The root cause of this lack of progress is that the business side has gone through a change of paradigm, but not the technology team.

Again, this is why it is the CEO, who understands the business context, who needs to recognize the pattern.

The goal of the transformation is not to update to the latest and greatest technologies, but rather to identify the technologies most appropriate for the foreseeable needs of the business.

The Proper Mindset

In order for the transformation to be successful, everyone needs to have the proper mindset:

  • Recognize that this effort is the “price of success”. Understand that current architecture, code, tools, etc. were not a mistake – no one deserves blame. On the contrary, they were optimal for the previous stage of maturity. Now that the business has grown, and evolved, technology also has to transform to a more mature architecture.
  • The goal of the transformation is not to update to the latest and greatest technologies, but rather to identify the technologies most appropriate for the foreseeable needs of the business.
  • The transformation will require a set of skills that is typically not present in-house. Rare are the CTOs who have successfully led digital transformations. Hence, it is usually wise to enlist the help of technical leaders who do have this experience.

SVSG’s Framework

SVSG follows the following framework:

  • Re-align the technology to the business: understand the main stakeholder journeys (customer and employee), which have likely evolved since the current architecture was designed.
  • Design the architecture – and data models – before coding, based on the new stakeholder experiences, as well as needs for scale, resilience, security, etc.
  • Incorporate the full business context such as scale, security, resiliency, etc.
  • Design an incremental migration path from the current state to the desired state. For example, start by breaking up the monolith by creating one additional microservice, validating its design before moving one to a second microservice.
  • Evangelize that the transformation goes beyond architecture and code. The whole development process, from end to end, must align with the company’s new stage of growth.

Final Thoughts

Digital transformations are rare events in the life of a company. Technology leaders are usually selected and trained to design and build technology incrementally. Unless you have gone through it before, detecting that your company might be experiencing the Monolith Syndrome is an unusual, and difficult, challenge for both CTOs and CEOs; but when the symptoms arise, it’s important to act swiftly if the business is to keep up with its growth.

Growth Is A Feature: Five Immediate Actions CTOs Can Take When Growth Skyrockets

Previously published on Forbes Technology Council, July 22, 2020

The magic moment for which you have been working for so long has finally arrived: Usage of the product is accelerating — the company is taking off!

As a CTO, this is wonderful news and the validation of years of dedication. Having gone through this critical stage a few times, and having advised companies going through this transition many times, it has become clear that many companies forget that reaching success requires more than just “feeding the beast” with more and more new features.

Growth is a long game, which requires its own dedicated share of mind. Having worked so hard to pull ahead of the competition, making the proper investments now will ensure your market dominance. Focusing on team organization, alignment of success metrics, software architecture, quality, user experience and automation in parallel with new feature development may initially seem a distraction, but it soon pays off in increased efficiency and averted disasters.

1. Celebrate And Prepare The Team 

Because the pace of work will soon increase for everyone in the team, it is important to directly acknowledge your success in order to prepare the company mentally and organizationally for the future. 

In particular, it is important for everyone in the company to acknowledge that growth is a feature. This means that in addition to “doing one’s job,” everyone must invest additional time to support the growth. For example, more time will be spent interviewing candidates. In addition, developing new features will take longer than in the past because of higher demands in quality and reliability, among others. In this instance, be sure to allocate time for growth in your schedule and task estimates. Get help early — because consultants can bring in expertise on short notice.MORE FOR YOUTony Hsieh’s American Tragedy: The Self-Destructive Last Months Of The Zappos Visionary

2. Update Business Operational Metrics 

Most often, a high growth rate is not only generated by a growing number of users, but also by attracting new types of users. When “early majority” users join “early adopters,” they bring new ways of using the product, they navigate the product differently, have new favorite features, etc.

This new cohort of users is probably less emotionally invested in the product and, thus, needs a simpler onboarding process. They have lower tolerance for bugs and higher expectations for uptime, security and response time. For the development team, everything needs to go faster: page load, new features, new releases and new hires. While the cost of failure is higher, any outage impacts 10 times more users than last year.

You must make sure to review and update key success factors (KSF) with the whole business team to match the new needs of the business. For example, does quality now become as important as the rate of releasing new features? The conversation around KSFs — and the process of getting teams all across the business aligned — is more important than the actual numbers assigned to KSF. This is an ideal time to pay down technical debt in usage and conversion tracking tools, as well as analytics.

3. Improve Quality Tenfold

As a developer, there is nothing worse than being interrupted in the middle of developing a new feature to fix a critical bug from the previous release. As usage grows, bugs that were previously “acceptable” now gather enough customer ire to be classified as “must fix.” In addition, as the product reaches a broader market, new users may be less educated about, and less patient with, the product.

Rather than wait for the avalanche of bug requests to drown the development team, it is best to anticipate and raise the breadth and depth of testing in the development phase, pre-release. A 10-times increase in volume requires a 10-times improvement in quality to keep the same number of trouble tickets and, thus, keep the size of the support team from growing 10 times.

As the number of users increases, the definition of quality must be expanded to include ease of use, in addition to “absence of bugs.” Know — and instrument — your app. Instrument the code so that performance can be easily measured. Similarly, instrument the app in production to accurately track usage, as well as conversion, since new users may have different patterns.

4. Refactor To Match Dominant Use Case(s)

A typical growth strategy involves moving to new segments of the market. Frequently, a startup will target a beachhead of a broader market when launching the first version of the product. Over time, as the products capabilities expand, the market expands as well. As a corollary, the predominant use case at launch may no longer be the most favored once a company reaches the growth stage. In order to keep the product easy to use as new dominant use cases emerge, the user experience needs to be redesigned and the code needs to be refactored (and sometimes re-architected) to support these new use cases at scale.

Increasing modularization (i.e., breaking services into smaller independent services) and refactoring APIs is usually a good strategy to support new use cases. Other factors may motivate refactoring, including performance, scaling, ease of operations and even being able to scale the development team. Increased componentization will also make testing more efficient. Finally, calibrate the degree of modularization of the architecture to the traffic on the app. There are a limited number of companies that have the traffic that justifies going all out on microservices.

5. Automate

As the development team delivers more features faster, tasks that were done once a week must now be done several times a day. With this increased pace, manual tasks become more error-prone and affect the team’s velocity. Consequently, all processes must be considered for automation: testing, CI/CD, DevOps, SysOps and even security and business continuity.

For maximum efficiency, you can coordinate efforts around actions three through five in the same project, as they are mutually reinforcing.

With these tips, you should be well on your way toward embracing a mindset that not only continues to spur growth, but also embraces it.

Time Tested Engineering Leadership Principles

I put together the first three of these four leadership principles during my first VP of Engineering gig, twenty years ago. Thirteen companies later, and having shared it with hundreds of engineers, I feel it is time to share the secret J

These leadership principles have been honed (a) for Engineers and (b) in the context of startups, typically with fewer than 150 employees. No claim is being made outside of these parameters.

1.   I commit to give you more responsibilities than you can handle … and help you succeed

The vast majority of Engineers are highly motivated (see my previous blog on “(Boosting) Morale in Engineering). They are motivated by their career, naturally, yet they are primarily driven by a need to accomplish and an intense desire to learn.

Another way of articulating this commitment is: “I am going to challenge you, and let you work as hard as you want, and exercise as many of your skills as possible”. Engineers hate being bored. On the contrary, they work extra hard when challenged. So my job is to continuously provide new challenges to each engineer in my team, and remove any impediments to their desire to fulfill these challenges.

2.   I commit to give you clarity, both strategic & tactical

I work hard to ensure that everyone knows where we, as a company and as an Engineering team, are going, what our objectives are (strategic), and how we plan to get there (tactical).

In practice, I make sure, during our periodic 1on1 that each engineer understands how his/her own project and role align with the company mission, and Engineering’s product roadmap.

Included in this commitment is a promise to each member of the Engineering team that on any given day, his/her #1 priority is clear. As logical consequence, this implies that each engineer only has one #1 priority (I have seen a lot of companies where this logic is violated). Their manager, or I as last resort, will handle situations where, for example, 3 VPs are breathing down an engineer’s neck, each with their own “top priority”.

Having everyone in the team understand and share the same strategic context empowers developers to make correct micro-decisions every day. As a side benefit, this frees me and their managers to work on bigger problem.

 

Taking a step back, if I’ve communicated correctly my commitments 1 and 2, then everyone in the team is working at the maximum of their ability – and – all are working in the same direction. This is a good foundation for solid productivity.

Having made two commitments to everyone in the team, I ask for two in return.

3.   In return, I demand teamwork & 3-D communications

I put teamwork and communications in the same sentence because one is meaningless without the other. Teamwork can’t exist without meaningful communications, and if we communicate but don’t work together, we don’t go very far.

No interview question will ever suss out whether a candidate is a team player or not. Instead, I explicitly declare that they should not join my team if they are not a team player.

Team work is important because product development is a team effort. Every engineer interacts with product managers, UX designers, front-end engineers, middle-tier, backend, data, QA, tech support, etc. Poor interactions with other team members results in poor individual efficiency.

Teamwork means that “together, we succeed”. Teamwork is not merely about helping out a teammate who needs help. More importantly, being a team player means asking for help when we need it, so as not to delay the whole team.

3-D communications simply expands the definition of “team” beyond one’s daily scrum. We are all inter-dependent, and we each must ensure that information gets to the people who need it, no matter where their name sits in the org chart. Making sure information is received in a timely fashion, rather than waiting for questions to be asked, is incumbent upon each of us.

In particular, this means that everyone on my team has the responsibility to inform me if I am not meeting commitments #1 and #2 stated above. I don’t read minds, and I can only take corrective actions if someone lets me know that they are bored, confused, pulled in too many directions, or under-utilized, etc.

4.   At the end of the day, we need to be proud of our work

I added this fourth principle, a few years later. I had been working at a company for about a year, had delivered a handful of successful releases, yet sensed burn-out and loss of creativity in the team.

A startup demands almost contradictory qualities from its Engineering team: speed and creativity (quality is a given). Because the demand on speed is often explicit, while the demand on creativity is often implicit, it is easy to fall into the trap of focusing only on execution at the detriment of innovation, or even the beauty of the code.

Yet, if we continuously succumb to the mantra of “ship, ship, ship”, and give up trying to build something cool, then we start on a slippery downward slope towards creating “blah” products. There are always pressures to ship more features faster, but if each of us is not proud of the product we are releasing to our customers then our customers won’t be excited about the product, and we won’t be having fun at work. Life is too short for us to accept either of these issues.

Making It All Work

There is nothing new, or magic, about these four leadership practices. The magic is in their daily practice. They work for me because I force myself to apply them on a daily basis, and I remind my teammates of their existence, their rationale and their own commitments, whether when welcoming a new member, during a 1on1, during my weekly staff meetings, at exec staff, or monthly Engineering updates, or even at the water cooler.

DevOps-Driven Development

It is now time to add the concept of “DevOps-Driven Development” to our repertoire.

“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.

Definition and Motivation

First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).

How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post  thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.

Practicing DevOps-Driven Development

I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.

We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.

Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.

Here is a “thinking DevOps” check list:

Deployable

First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc

This implies:

  • Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
  • Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database

The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating  server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.

Monitor-able

If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.

Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:

Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.

Diagnosable

Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):

  • Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
  • All log statements used during the coding and QA phase must be removed
  • Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
  • The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely  meaningful information is output when troubleshooting is required
  • Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.

Security

This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).

Business Continuity

Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.

Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.

The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.

Scalable

The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.

This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc

This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).

Tunable

Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).

We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.

The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).

QA-able

As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).

Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.

Analytics – Growth Hacking – Usability

This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.

In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)

Even More So for Microservices

As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.

What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx,  Nirmata, etc)

Summary

Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.

More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.

QA does not stop in QA

Quality Assurance does not stop after the software receives the “thumbs up” from the QA team. QA must continue while the product is Live! … because QA is not perfect, and real users only exist on a Production system. We need to be humble and accept that our design, development and quality processes will not catch all the issues. Consequently, we must equip ourselves with tools that will allow us to catch these problems in Production as early as possible … rather than “wait for the phone to ring”

When the product exits QA, it simply means that we have we’ve run out of ideas on how to make the system fail. Unfortunately, this does not imply that the system, once in Production, will not fail. If we are successful and get a high volume of traffic, the simple law of large numbers guarantees that our users will find yet-never-thought-of ways to – unintentionally – make the system fail. These are part of the “unknown unknowns” as Mr. Donald Rumsfeld would say. Deploying the product on the production servers, and handing-off (abdicating?) the responsibility to keeping it up to the Ops team shows wishful thinking or naïveté, or both.

Why QA must continue in Production

There are a few categories of issues that one needs to anticipate in Production:

  • Functional defects: in essence, bugs that neither developers, nor QA caught – while this is the obvious category that comes to mind, it is far from being the only source of issues
  • User experience (UX) defects: Product works “as spec’d”, but users either can’t figure how to make the product work, or don’t like it. A typical example is a high abandon rate in a purchasing experience, or any kind of work flow, or a feature that’s never used, a button that’s never clicked.
    This is not reserved to new products, by improving the layout of a given page, we may have broken another feature on that same page
  • Performance issues: while we may have run performance, and load tests, in our QA environments, the real world always offers surprises. Furthermore, if we are lucky enough to have the kind of traffic that Google or Facebook have, there is no other way but to test and fine-tune performance in production
    Running tests on non-production systems requires to not only simulate the load of the system, but also to simulate the “weight” of existing data (e.g. in database, file system) as well as longevity to ensure that there is no resource leak (memory, threads, etc)
  • Operational issues: while all cloud applications are typically clustered for high-availability, there are other sources of failure than equipment failure:
  • External resources, such as partners, data feeds, can fail, or have bugs of their own, or simply not keep up their response time. Sometimes, the partner updates the API without notification.
  • User-provided data can be mal-formed, or in an unexpected format, or a new data format can be introduced after the launch of the product
  • System resources can be consumed at an unexpected rate. Databases are notorious for having non-linear response times based on load: as long as the load is under a given threshold response time is high, but once the load exceeds this threshold response time can deteriorate very rapidly.

 

A couple of examples:

  • At my previous company, weeks after the product had been launched, we started receiving occasional complaints that some of the user-created videos were not showing up in their timeline. After (reluctantly) poking around in our log files, we did find out that about 10% of the videos that had been uploaded to our site for the past 2 weeks (but not earlier) were not processed properly. Our transcoder simply failed. Worse, it failed silently. The root cause was a minor modification to the video format introduced by Apple after our product was released. Since this failure was occurring for a small fraction of our users, and we had no “operational instrumentation” in our code, it took us a long time to even become aware of it.
  • Recently, we launched a product that exchanges data with our partner. Their API is well documented, and we tested our product in their sandbox environment, as well as their production environment. However, after launch, we had reports of occasional failures. It turns out that users on our partner’s site were modifying the data in ways that we did not expect, and causing the API to return error codes that we had never seen. Our code duly logged this problem each time it occurred in our log files … among the thousands of other log events generated every minute

 

Performing QA on Production Systems

As I mentioned, the Google and Facebook of the world, do a lot (if not most) of their QA on Production systems. Because they run hundreds of thousands of servers, they can use a small subset to run tests will live user data. This is clearly a fantastic option.

Similarly, “A/B comparisons” techniques are typically used in Marketing to compare 2 different user experiences, where the outcome (e.g. a purchase) can be measured. The same technique can be applied in testing, e.g. to validate that a fix of an intermittent bug difficult to reproduce does work.

 

More generally, Production code needs to be instrumented:

  • To detect failures, or QoS (Quality of Service) degradations, with internal causes (e.g. database is slowing down)
  • To detect failures, or QoS degradations, with external causes (e.g. partner API times out a lot)
  • To monitor resource utilization for each service or application – at a finer grain than provided by Operations monitoring tools which are typically at the server level.

The point is that if a user can’t buy a book on our website because our servers crash under load – this is a bug. While the crash is not due to code written incorrectly, it is due to the absence of code warning us that the system was running out of steam … this is still a bug.

 

In order to monitor quality in Production, we need to:

  • Clean up the code that writes to log files: eliminate all logging used for code testing, or statements such as “the code should never reach here”. Instead, write messages that will be meaningful to the poor soul who, a few weeks later, will be poring over megabytes of log files on a Sunday night trying to figure out why the system crashed
  • Ensure that log messages have consistent severity levels (e.g. as recommended by RFC 5424Wikipedia has a nice table), so that meaningful alerts can be triggered
  • Use a log aggregation system, like GrayLog2 (open source), so log files from multiple nodes in the same cluster, as well as nodes from different services can (a) be searched from a console and (b) viewed, time-aligned, on a single page (critical for troubleshooting). GrayLog2 can handle hundreds of millions of log events and terabytes of data.
  • MEASURE: establish a base line for response time, resources consumption, errors – and trigger alerts when the metrics deviate from the baseline beyond a predetermined threshold
  • Track that core functions – from a user perspective – complete, and log when, and ideally, why, they fail along with key parameters. E.g.: are users able to upload files to our system, are failures related to file size, time of day, location of user, etc?
  • Log UX and operationally meaningful events to track how users actually use the system, what features are most used and track them over time. These metrics are critical for the Product Management team
  • Monitor resource utilization and correlate with usage patterns. Quantify key usage parameters in order to scale the right resources in advance of the demand. For example, as traffic grows, the media server and the database servers may grow at the different rates.
  • Integrate alarms from application errors into the Ops monitoring tools: e.g. too many “can’t connect” errors should trigger an Ops alert that our partner is down – slow response time on a single server in a cluster may indicate the disk is failing

 

Quality is not a one-time event, it is an everyday activity, because users change their behaviors, partners change their APIs, systems get full and slow down. What used to work yesterday, may not work today, or no longer be good enough for our customers. As a consequence, the concept the “test driven” development must be extended to the Production systems, and our code must be instrumented to provide metrics that confirm that everything works as desired, and alerts when they don’t. But that’s not sufficient, developers and QA engineers must also take the time to look at the data, not just when a fire drill has been called, but also on a regular basis to understand how the system is being used, and how resources are consumed as the system scales, and apply this knowledge to subsequent releases.

Day-by-Day Model of an Iteration

This post presents a practical guide of what happens during a typical Agile iteration – a sort of play-by-play for each role in the team, day by day.

This post presents a practical guide of what happens during a typical Agile iteration – a sort of play-by-play for each role in the team, day by day. Please open the attached spreadsheet which models the day-by-day activities of a 2-week Agile development iteration, and describes the main activities for each role during this 10-day cycle of work. In addition, we will highlight how to successfully string iterations together, without any dead time; as the success of any given iteration is driven by preparation that has to take place in earlier iterations.

This is intended as a guide, rather than a prescription. While each iteration will have its own pace – a successful release will follow a sequence not too different from the one presented here.

Golden Rules

Each company is different, each project is different, each team is different, each release is different, and each interpretation of Agile is different. The following states the immutable principles to which I personally adhere.

  • Once Engineering and Product Owner agree on the deliverables of an iteration, they are frozen for this iteration
    • Engineering must deliver on time
    • Features cannot be changed, added, or re-prioritized
    • Only exception is a “customer down” escalation of a day or more
  • Engineering delivers “almost shippable” quality code at the end of the iteration
  • Each release is self-contained: all the activities pertaining to a given user story must be completed within the iteration, or explicitly slated for another iteration at the start of the iteration
    • E.g.: QA, unit tests, code reviews, design documentation, update to build & deployment tools, etc
  • Dev & QA engineers scope their individual tasks at the beginning of each iteration. The scope and deliverables of the iteration are based on these estimates.
    • Engineers are accountable to meet their own estimates

The above implies that Engineers must plan realistically by
(a) accounting for all activities that will need to take place for this iteration, and
(b) accounting for typical levels of interruptions and activities not specifically related to the project (scheduled meetings, questions from support, beer bashes, vacations, etc).

Estimates must be made with the expectation that we are all accountable to meeting them. This sounds like a truism, except that it is rarely applied in practice.

Day by Day

Before the Start of an Iteration

Preparation and planning prior to an iteration are critical to its success. As the spreadsheet highlights, the Product Manager spends the majority of his/her time during a given iteration planning the next iteration, by

(a)  Prioritizing the tasks to be delivered in the next iteration
(b)  Documenting the user stories to the level of detail required by developers
(c)  Reviewing scope with Project Manager and Tech Lead

Pre-requisites at the Start of a Release

The following must be delivered to Engineering at the start of a release. The Product Owner, Project Lead and Tech lead are responsible for providing

  • “A” list of user stories to be implemented during the release
  • Detailed specs of the “A” list user stories
  • Design of the “A” list features sufficient to derive the coding  and QA tasks necessary to implement the features
  • Estimated scope for each feature – rolling up to a target completion date for the iteration

These estimates are “budgetary”. Final estimates are given by the individual engineers.

Day 1 – Kick-Off

The whole team gets together and kicks-off the iteration: the PM presents the “A” list features to Eng, and the Tech Lead presents the critical design elements. Tasks are assigned tentatively.

During the rest of the day, engineers review the specs of their individual tasks, with the assistance of PM and Tech Lead.  This results in tasks entered into Jira, with associated scope and individual plans for the iteration.

The Project Lead combines all tasks into a project plan (using artifacts of his/her choice) to ensure that the sum of all activities adds up to a timely delivery of the iteration. The Project Lead also identifies any critical dependency, internal and external, that may impact the project.

A delivery date is computed from the individual estimates, and the team (lead by Product Owner, assisted by Project Manager and Tech Lead) iterates to adjust tasks and date

Day 2 – Deliverables are Finalized

Day 1 activities continue if necessary – resulting into a committed list of deliverables and a committed delivery date

The team, lead by Project Manager, also agrees on how the various tasks will be sequenced to optimize use of resources, and to front-load deliverables to QA as much as possible.

Developers start coding, QA engineers start writing test cases and/or writing automation tests

Day 6  – V1 Spec of the Next Iteration

By Day 6, the Product Manager provides the V1 Spec of the next iteration.

V1 Spec is a complete spec of all the user stories that the Product Owner estimates can be delivered in the next iteration. Typically, V1 will contain more than can be delivered, in order to provide flexibility in case some user stories are more complex than originally thought to implement.

During the remainder of the release, the Tech Lead (primarily) will work with the Product Owner to flesh out the details of the next release, to design the key components of the next release to a degree sufficient to be able to (a) list out the tasks required to implement the user stories, (b) estimate their scope, and (c) ensure that enough details has been provided for developers and QA engineers.

During the discussions of the next release, the Project Lead will identify any additional resources that will need to be procured, whether human or physical.

Day 7 – Release to QA

Release to QA means more than “feature complete”. It means feature complete and tested to the best of the developers’ knowledge and ability (see below).

Day 9 – Code Freeze

By Day 9, all bugs must have been fixed, so that the QA team can spend the last day of the iteration running full regression tests (ideally automated) and ensuring that all new features still work properly in the final build

By that time, the content and scope of the next release has been firmed up by Product Owner, Tech lead, and Project Manager, and task are tentative assigned to individual engineers.

Day 10 – Show & Tell

At the end of the last day of the iteration, Eng demos all the new features to the PM, the CEO and everyone in the company we can enroll.

We then celebrate.

Tools and Tips

Sequencing Iterations

  • Depending on the complexity of the user stories, the Tech Lead (and other developers) may need to spend all of their time doing design, and may not be able to contribute any code.
  • It is sometimes more productive to write automation tests once a given feature is stable. As a consequence, the QA team may adopt a cycle where they test manually during the current iteration and then automate the tests during the next iteration (once the code is stable)
  • Exceptions to “almost shippable” are things like performance and stress testing, full browser compatibility testing, etc.
    • These tasks are then planned in the context of the overall release, and allocated to specific iterations

Release Duration

The duration of a given iteration is at the discretion of the team. It is strongly recommended that iterations last between 2 and 4 weeks.  It is also recommended that the duration of iteration be driven by its contents, in order to meet the Golden Rules. There is nothing wrong with a 12- or a 17-day iteration.

Start on Wednesday

Similarly, the starting day of the iteration is up to the team. Starting on a Wednesday offers several advantages:

  • The iteration does not start on a Monday -). Mondays are often taken up by company & team meetings.
  • Iteration finishes on a Tuesday rather than a Friday. Should the iteration slip by a day or two, it can be completed on Wednesday, or Thursday if need be. This means that the QA team is not always “stuck” having to work weekends in order to meet the deadline, nor do they have to scramble to make sure that developers are available during the weekend to fix their bugs, as would be the case if the iteration started on Monday
  • By the second weekend of the iteration, the team will have good enough visibility into its progress, and determine whether work during the weekend will be required in order to meet the schedule.

Specs

The artifacts, format and level of details through which specs are delivered to Engineering is a matter of mutual agreement between Product Owner and Engineering, recognizing that Engineering is the consumer of the specs. As such, it is Engineering  who determines the adequacy of the information provided (since Engineering cannot create a good product from incomplete specs).

Specs must be targeted for QA as well as Dev. In particular, they must be prescriptive enough so that validation tests can be derived from them. For example they may include UI mockups, flow charts, information flow diagrams, error handling behavior, platforms supported, performance and scaling requirements, as necessary.

Release to QA

While the QA team has the primary responsibility of executing the tests that will validate quality, developers own the quality of the software (since they are the ones writing the software). As a consequence, when developers release to QA, they must have tested their code to ensure that no bugs of Severity 1 or 2 will be found by QA (or customers) – unless they explicitly agree in advance with the QA team that certain categories of tests will be run by QA.

Regardless of who runs the tests, the “release to QA” milestone is only reached when enough code introspection and testing has been performed to warrant confidence that no Severity 1 or 2 bugs will be found.

Releasing to QA

Developers and QA can agree on how code will be released to QA. While the spreadsheet shows one Releate to QA  milestone, this was done for clarity of presentation. In practice, it is recommended that developers release to QA as often as possible. Again, this should be driven by mutual agreement.

Furthermore, each developer must demonstrate to his/her QA colleague that the code works properly before the code is considered to be released. This demo is accompanied by a knowledge transfer session, where the developer highlights any known limitations in the code, areas that should be tested with particular scrutiny, etc.

Estimating Scope Accurately

One of the typical debates is whether time estimates should be measured in “ideal time” (no interruptions, distractions, meetings), or “actual time” (in order to account for the typical non-project-related activities). This is a matter of personal preference – what counts is that everyone in the team uses the same system.

I prefer to use “Ideal time”: each engineer keeps 2 “books” within an iteration: the actual iteration work – scoped in “ideal time”, and a “Other Activities” book, where all non-project-related activities are accounted for. This presents the advantages of (a) using a non-varying unit to measure the scope of tasks so that you can compare across people, project, time, and (b) having a means to track “non-productive time” on your project – and thus have data on which to drive decisions (e.g. pleading management for less meetings)

Click here to get the spreadsheet

Software Specification is a Process Not a Document (2 of 2)

Engineering depends on the business team to create actionable specifications early enough before a release, to control the scope to a level commensurate to resources and time available, and to use artifacts that are relevant to the information to be conveyed.

Timing is Everything

Product Management delivering complete specifications in a timely fashion greatly improves the productivity of the Engineering team (Complete being relative the type of specifications – as we discussed in the previous blog). The more precise the information provided at the start of each phase (scoping, release or iteration), the more efficient and accurate will the resulting development work be.

This sounds boringly obvious, but I have seen the contrary scenario over and over again, where business leaders grumble that the Engineering team is not productive, while failing to provide more than a PowerPoint level specification at the start of  releases. As a consequence, developers spend the first third to half of the release working with the Product Managers to define the specs, instead of writing code – or even worse, developers start writing code without spec, and then having to do it over once the specs have been thought through.

Scoping is a 2-way Commitment

Another pitfall to avoid is “scope-creep”. While the name itself would imply that it should be avoided at all costs (who wants to be creepy?), scope creep is an all-too-common occurrence

Scope creep, on the surface, appears to stem from good intentions (we want to meet every customer request – even last minute ones), yet it is one of the most demoralizing behaviors for the Engineering team – akin to continuously pushing back the finish line, after the start of a race.

In order to avoid scope creep, we (Engineering) need to remind the business team that based on the information provided during the scoping phase, Engineering reserved a set of resources for the duration of the release, and committed to deliver the feature set in the allotted time. This in turn creates an implicit contract that the scope of the release – will be bound by the amount of resources allocated to the release. While changes are expected as we get closer to the release start, and even once the release has started, the business team can’t forget that there are only 24 hours in a day, and that no matter how cool it would be to add another 25% functionality, asking the Engineering team for such an increase in scope flies in the face of the process: If we could really do 25% more, we’d have said so the first time during the scoping phase.

In summary, once Engineering  allocates resources for a release and commits to deliverables and schedules, the business team, in turn, must commit to keep the scope of the release commensurate to the resources allocated.

Use the Right Artifacts for the Job

As we replaced Waterfall development process with Agile Software development, we also replaced Market/Product Requirement Documents with User Stories. I have to admit that I don’t get that part, or rather that I find that sometimes user stories are the best vehicle to express customer requirements, and other times, straight requirements do a better job.

For example, when a workflow needs to be implemented, nothing beats a flow chart or a state diagram to define it – we can dispense with the user story on the 3×5 card.

Write Things Down

There is no dispute that face to face discussions are the fastest way to nail down a user story. Often the expected behavior is self-evident from the software implementation itself. However, we must remember that multiple constituencies need to reach common understanding on the software’s behavior: not only the Product Champion and developers, but also, QA, support, services, etc.

Again, there is no way that more than 2 people can reach the same understanding of how a workflow should perform, or what a report is meant to compute unless it is written down, preferably in pictorial form

Technical Risk Must Be Eliminated Prior to Scoping

The business team expects estimates that are fairly accurate – say within 10%. You can see eyes roll when you present  your estimates and then add that the estimate is accurate within 30% … and it’s a fair reaction. As a consequence, time must be invested in research, design and/or prototyping, in order to reach the desired level of accuracy. Sometimes, we need to invest the time to build a prototype in order to validate a design or an architecture. While this initially may appear to be a prohibitive price to pay, a much much higher price would be paid if one embarks on a release, only to miss the deadline by a month or more, because we found out that the original design was inadequate.

Managing Perceptions

Which scenario is best?:

(A)  Promise to deliver 12 features and end up delivering 10 – OR –

(B)  Promise to deliver 9 features and end up delivering 9

In my experience, Scenario (A) is a perceived failure, while (B) will be perceived as a success.

If you agree with me, then you will want to think hard about your iteration plan, and about what features you implement in which iteration. Naturally, the later the iteration within the release, the more likely it is that its features will not be implemented (either because of schedule slips, or changes in priorities). Consequently, plan low-impact features for the last release(s); this way you’ll have to option of jettisoning them if necessary while still nailing the committed schedule. Conversely, if you high-impact features for the end, your only choices will be to disappoint — by taking them out in order to meet the schedule, or to disappoint — by forcing a schedule slip.

In conclusion, software development is a team activity – not only within the Engineering team but also with the business team: Engineering depends on the business team to create actionable specifications early enough before a release, to control the scope to a level commensurate to resources and time available, and to use artifacts that are relevant to the information to be conveyed.

Software Specification is a Process Not a Document (1 of 2)

Software specification needs to be thought of as a process, rather than a document. The three phases of the process are: (1) Release Scoping, (b) Release planning / iteration sequencing and (c) in-depth user story specifications

At each of the companies where I have worked a debate has always raged about how to document  new products specifications. As VP of Engineering, I am frequently asked to produce a template for  Requirements Documents. On the other hand, Agile does away with requirements, in favor of user stories. This, in turn, is in conflict with the business team, who wants to know six months ahead of time what they can promise to customers.

The first step towards reconciling these various perspectives  is to understand that Software Specification is a Process not a document: the value of a specification comes mostly from the process of creating it, and less so, from the final artifact. For one, the final specification rarely captures the features that were excluded, nor the business justifications behind any given feature.

The Specification Process comprises 3 different phases with different purposes and different deliverables.

  1. The first phase is Scoping: this phase typically takes place weeks before the start of the release. The output of the scoping phase is an estimate from the Engineering team that a certain bag of features can be delivered by a given date, with a given set of resources.
  2. The second phase is the Release Planning, ideally starting(shortly) before the official start of the release, where the engineering lead, with input from the product manager, creates the release plan, breaks out the release into iterations, and defines the major features to be built in each iteration
  3. The third phase involves the detailed specification of the features/user stories for each iteration.

Scoping

In my world of enterprise software, the customers, and the business team, want to know months in advance what features will be available by when. Both the release date and the features are determined before the start of the project (sometimes weeks before) and must be met. This is not Agile, but it is reality – see my earlier blog “Setting Expectations about Formal Releases with the Business Team

In order to produce a reliable estimate of what will be delivered when, the Engineering team needs a complete list of features, with a degree of specificity that only needs to be good enough for the Engineering team to appreciate the degree of difficulty of each task.
For example, the spec for a user registration page on a web site could be as simple as:

  • User enters Username, Password first time, Password second time.
  • The Username must be unique
  • The 2 entries for the Password must be identical

… but it could get a lot more complicated

  • The password must meet “strength of security” criteria
  • As the user types in the password, the strength of security of the password  will be computed and displayed graphically
  • The registration server must handle up to 2,000 registrations per minute with a response time of 3 seconds or less
  • System availability must be 99.99% uptime

The two scenarios are vastly different. However, the Engineering team does not need to know a lot more than the bullets above to engage in a discussion with the business team about the scope of the project. If the application’s software stack has not already been validated for performance or reliability, the second project is going to take weeks, compared to hours for the first one. Even the little visual indicator of password strength can add days to the scope of the project (if AJAX needs to be added to the app, or if the team does not have a graphic designer readily available).

While the spec can be very short and still allow the Engineering team to provide scope estimates, one should not underestimate the time it will take to scope. For example, if system performance is significantly increased, scoping will involve design and probably prototyping.

The scoping estimates are typically done based on experience by comparing the new project to previous ones, estimating the number of functional points, etc.

Release Planning / Iteration Sequencing

Release planning, or iteration sequencing, is an overlooked and underrated activity, and yet it often signifies the difference between perceived success and failure. Agile suggests that the user stories most important to the customers should be developed first. This is indeed the primary guide in sequencing activities within a release. However, other important factors need to be considered. For example:

  • Eliminating technical risks for some of the important features
  • Confirming ease of use and usability by mocking up or prototyping key components of the user interface so that they can be shown to customers for feedback early in the release cycle, thus leaving time for modifications.
  • Integration of new libraries, tools, or partners
  • Performance validation

By going through the release planning exercise, the team drills down further in the specifications, gets a more refined appreciation for the scope of the project and thus confirms, or infirms, the original scoping estimate. If necessary, adjustments can be made before the project  starts. Early preventive action is always a good thing!
In addition, release planning is important to ensure availability of critical resources whether human, or physical.
Finally, a proper release plan will align the coding effort with the integration and testing strategy. For example, it is simpler to test API calls when you implement both sides of it, or to test a DAO, when you simultaneously code the UI front end for it.

“Intra-Release Specification”: Detailed User Stories

Once a release has started, detailed user stories must be provided to the Engineering team prior to the start of each iteration – so that the iteration can be scoped at the start of the iteration,by the developers, and the features can be implemented during the iteration.
While interactions between Product Management and developers are encouraged during the iteration, having well-thought out user stories ahead of the iteration greatly improves efficiency.

By understanding that specifying product requirements is a process, rather than a document, both business and engineering teams will work effectively, by delivering the proper level of information to each other at the right time. In the next blog, I’ll cover tricks and best practices of this process.

Planning – and Executing the Plan – are Part of the Job

Along with writing good code, planning and meeting the plan are part of an engineer’s responsibilities, in order for the product to be successful and the business to thrive

Being an engineer entails more than writing good code. It also requires being a good corporate citizen. We write and test software so that it can be used by our customers. The Engineering team is one of the teams that constitute the business. As such, we need to coordinate our activities with those of the other teams in the business: Marketing, Sales, Operations, and Support. We are dependent on these other teams for our software to find its way into the hands of our customers. We also depend on them for the business to survive. Let’s not forget that Engineering is an expense center, and that without the Sales team, there would not be any paycheck.

Our obligation to the other teams in the company can be summarized fairly simply: we need to deliver what we promised, on time. We thus need to be able to forecast within a reasonable horizon what we will be able to create, and then deliver against our forecast.

Planning is Difficult but Necessary

Some argue that writing software is a creative and innovative endeavor, which, as such cannot be predicted. The comparison is made with Civil Engineering where designing a new building is akin to applying well documented formulas and following well defined processes lending themselves to formulaic forecasting. While there is truth to the argument, it cannot be taken to the limit. It does not means that forecasting a software project is impossible, but rather that it is hard.
This being said, we don’t have a choice. As I often point out to my colleagues, sales people have to forecast every quarter, and one can argue that forecasting sales is eminently more challenging, since it relies on the behavior of people over whom we have very little control: our customers. Yet, no company can operate without a sales forecast, and forecasting is one of the skills that salespeople need to develop, along with their sales acumen. Engineers are in the same situation.

More specifically, the reasons we make plans are:

  • To forecast when a given release will be complete. This in turn will drive forecasts for sales projections, staffing assignment in services – which in turn drive financial projections, and how the company manages its expenditures – such as our salaries
  • To make strategic decisions: for example, if certain set of features take too long, or too many resources, we may decide to postpone their implementation, and allocate resources to another product or set of features.
  • To make our own decisions: by knowing how much work each task will take allows us to staff projects appropriately, and thus be as efficient as possible.  Over-staffing and under-staffing both have negative consequences that are easy to understand
  • To align internal resources: the most obvious example is that the QA team needs to know when a certain feature will be ready to be tested.
    The above illustrates how important it is to meet our commitments, once we have announced our plans. If we don’t meet our plans, we let other people down, and force them to scramble to make alternate plans. Yet, meeting one’s commitments is not only about working hard. It starts with making good plans.

Making Good Plans

How does one make good plans?

  • First and foremost: include everything (easier said than done but none the less critical)
    • Think through ALL the tasks that are required to complete the job: create a new Maven project, become familiar with the idiosyncrasies of a new software package, upgrade libraries to a new version, organize design reviews, code, unit tests, integration tests, performance tests, error recovery tests, security intrusion tests, documentation, training, etc.
    • Account for everything that happens in a typical day/week: e.g. Meetings, interrupts from Ops, support, or other
  • Be realistic: Engineers tend to be optimistic – make sure that you take into account that something at some point is going to go wrong
    • The best technique that I know is to use history as a reference. Have you typically been late/early on your past projects. Are there activities that you typically fail to account for?
  • Build some buffer – because it is important to meet the commitment (and if you don’t need the buffer, you’ll use the time to implement an extra feature, or start the next release early)

Tracking Progress

A tool like Atlassian’s Jira allows each developer to enter their tasks and the time for each task. It is critical that each developer enter their own time estimate. No task should be longer than 2-3 days. If it is, it is best to break it up. I have found it to be the right balance between having enough detail in the task to grasp its whole scope, while keeping the total number of tasks manageable.

It is important to think of a task as a complete project: including reviewing requirements, design, code, integration, testing, documentation, hand-off to QA. Of course, each of these tasks can be spelled out when their scope warrants it. Again, include the typical daily overhead in the estimates .
Once we have entered the tasks in Jira, it is critical to track them accurately. Don’t be shy about entering time beyond your original estimates if you are running late: your teammates, and your team lead, need to know — so that they can make alternate plans if necessary. Progress tracking tools are not meant to find faults, but for project management and communication: it is a much worse offense to your team to keep quiet about your being late, or struggling, on a task, than the fact of being late.  Being late is a problem that can be dealt with – keeping quiet is a professional fault that hurts the project even more than bad code.

One important note: a task is DONE when you won’t need to put any more work into it. In particular, this means a piece of code is not done until it has been fully tested and validated.

Agile Processes for Formal Releases

2-4 week milestones culminating in a show-and-tell where Engineering and Product Owner(s) engage in a discussion about priorities deliver a lot of the advantages of Agile methodologies, even without official buy-in from the management team

Engineering can  follow a mostly Agile methodology, even if the rest of the company does not. For example, you can still break up the development effort into 2-4 week sprints/milestones, even if the Product Owner does not indulge in reviewing priorities for each milestone. In fact, I contend that by having frequent end-of-milestone review, you will in effect elicit prioritization from the Product Management team.

2-4 Week Sprints/Milestones

Regular milestones (every 2 to 4 weeks) are essential for a several reasons, each sufficient in its own right

  • 2-4 weeks is the proper horizon for planning. While it is not impossible to make plans over longer horizons, the accuracy of these plans drops significantly when they extend beyond 4 weeks. Per Agile, the plan for each Sprint needs to be made “bottoms-up” by the developers who are working on the project
  • Commitment to the plan – Since the developers created the plan themselves, we can ask them to commit to its timely execution. Accuracy in estimating one’s work is a skill that each developer must fine-tune
  • Visibility of progress. By having an “almost shippable” release tested at the end of each sprint, we can all assess progress realistically. As the Manifesto for Agile Software Development states “Working software is the primary measure of progress.”. My measure of progress is binary – if a feature passes all the tests then it is 100% done, if not it is not done (0% complete).

With the rhythm of 2-4 week milestones, every one on the team can see the product being built, with the confidence that true progress is being made and the expectation that no nasty surprises are lurking at the tail of the project.
Show-and-Tell

Show-and-tell is the culmination of the milestone, where the project team demonstrates the new features to their colleagues inside and outside of Engineering. It is critical to advertise the Show-and-Tell outside of the Engineering team, including to the CEO, VP Sales, VP Marketing, etc. The benefits of Show-and-Tell sessions are multiple:

  • Rewards for the engineers: The show-and-tell is a perfect opportunity to acknowledge the contribution of each engineer on the project and offer them public recognitions
  • Avoid surprises at the end: the last thing you want when you have toiled away for 3-6 months on a project is to hear something like “Nice work guys … but this is not what I expected!”, whether it is from your own team, or from customers. Thanks to regular show-and-tell, there are no surprises. We also give the tools to the Product Management team to share these early releases with customers, as appropriate.
  • Reassure Management: By demonstrating regular forward progress to the management team, we can relieve some of their anxiety as to whether we will be able to meet our deliverables. Equally important, when the project was too ambitious to start with, Engineering can give early warning to the management team that alternative plans need to be made, whether it is to reinforce the Engineering team, or to manage customer expectations.

Just like the milestones ensure that the Engineering team can manage its progress without surprises., the Show-and-Tell perform the same function for the business team.
Furthermore, the “Show-and-Tell” give “fair warning” to the rest of the company that the release is on its way, so that the marketing and sales machines can rev up in anticipation of the completion of the release (rather than “wait-and-see” until it’s officially released).
Engage into Discussions about Prioritization

While the business team may not embrace the Agile methodology, by holding the Show-and-Tell events, we effectively engage the Product Management team in a discussion about prioritization. Even when their perspective is “Everything is important, everything must be delivered”, by witnessing the progress, a discussion naturally ensues about whether we need to enhance, or rework, what has already been built, what features should be developed next, and the reaction of customers to the early releases.
There is no magic formula to make these discussions happen, but, in my experience, it simply happens naturally.

More than ever, in today’s fast-changing environment, Engineering must be both predictable and adaptable. Predictable so that the rest of the company can operate efficiently (e.g. by starting to market a release before it is actually complete) and adaptable in order to respond quickly to changes in the market and/or the competition. Having frequent milestones, and show-and-tell, give visibility to progress and set the stage to review priorities , and adjust them if necessary, with minimal impact on the efficiency of the Engineering team.