Digital Transformations – Preparing Your Data for AI

Previously published on Silicon Valley Software Group Insights in July 2023.

Digital transformation and data

In an earlier post, “Why Digital Transformations Fail – Future Proofing”, we advocate that digital transformations must “design for capabilities for which both a strong business case and well defined requirements exist”. We recommend “future proofing enough – but not too much”. Yet, in this post, we present what seems to be a contrarian view: to invest in data in a way that, at first sight, might be guilty of future-proofing.

A digital transformation is a critical event intended, among other goals, to position the company for a new stage of growth for the next 2-5 years. Typically, at the time when a company reaches a maturity level where scaling has become a strategic priority, the value of its data becomes meaningful. An intuitive explanation is that data has reached the critical mass where insights that go beyond intuition can be harvested. As a corollary, data needs to be properly architected in order to yield these insights. Furthermore, proper data collection and curation is a foundational prerequisite for building an Artificial Intelligence (AI) sub-system into the product.

Why think about data during a digital transformation?

Digital transformation empowers a company’s growth to a new stage of maturity – and new business practices. During this transformation, data generated by the product also evolves in three major directions.

The rearchitecture of data

First, data needs to be rearchitected along with the code. Often, data is the primary dimension, more important than code, that drives the architecture.

Localizing data to each microservice is a core design goal of any re-architecture project.
Optimizing performance of data access is often a key driver to increase scale. While code can be scaled horizontally almost ad infinitum, it is much more difficult to do so with data.
With growth, data has to meet more onerous security and compliance requirements – for example data locality to meet GDPR.

Future-proofing your data

When a company is small, the amount of data it holds is small. Insights can be derived by combing over a spreadsheet. As the company grows, and the data it holds becomes larger and more varied, business intelligence and data science tools can discover insights that intuition alone could not have imagined. Consequently, ensuring that data is consistent across the product, as well as with internal company data, will save a huge amount of time in the future. This is where “future-proofing” comes in. Even if there is no immediate plan to harvest product data, it is important to:

Have consistent data formats and meaning across the product, accompanied by data dictionaries.
Promote data sharing along with access and discoverability across all functions of the company, while maintaining proper security, so that each department can experiment with the data in order to gain more insights on its own operations.

Opportunity for increased revenue

Finally, data can be used to increase revenues – which we cover in the next section.

Examples of how data increases revenue

The power of data lies in the diversity of ways it can be applied. Rather than attempt to provide an exhaustive list of applications, this section is meant to provide examples that stimulate the imagination.

Improve decision making

Data can be universally used to improve decision making.

The simplest approach is to combine data gathered in the product with data from internal operations. For example, track which marketing campaigns are the most effective, predict demand and churn.

In addition, by instrumenting the product, product managers can track which features are used, or not, particularly to confirm that a newly introduced feature is seen and used by end users. Similarly product managers can track usage patterns to identify areas of the product that are confusing, or follow patterns that lend themselves to simplification. Finally, tracking usage patterns should confirm how users perceive the value of the product, and thus lead to pricing optimization.

Leveraging user analytics, growth marketers can directly drive revenue growth by using data generated from individuals’ interactions with the product to prompt them to purchase additional features relevant to their usage. For some companies, this is the primary driver of revenue growth.

Generate new sources of revenue

The examples below show various means to increase revenue, either by increasing engagement and the perceived value of the product (and thus increasing retention and the ability to raise prices), by increasing usage by better understanding users’ needs, or by monetizing the data directly.

Trend analysis and recommendation systems increase product and services unit sales by suggesting additional purchases based on purchase history, product similarity or purchases of users with similar profiles. While seasons or news are well understood influencers of purchase decisions, other trends can only be discovered through the application of machine learning.

AI-based language analysis allows a company to ‘read the minds of its users’ by analyzing all text-based and voice-based exchanges from users, as well as prospects, across all communication channels, internal or external to the company such as phone, email, chat, and social media. Companies can thus discover friction with existing features, as well as unmet needs.

Analysis of data aggregated across all of the company’s customers may reveal trends that are not visible at a smaller level, or local trends may be generalized – simply because the aggregated data pool is bigger and broader. As for all complex endeavors, a progressive approach, with measurable success milestones, is recommended. For example, a capability-driven progression could be:

1. Descriptive analytics: Document ‘what happened?’ (e.g. ‘Alert, a server crashed’, ‘N customers bought item X today’.) Nowadays, this capability is expected from any non-demo software.

2. Diagnostic analytics: Explain ‘why did it happen?’ (e.g. ‘what specific service/line of code caused the server to crash?’, ’What drove this customer to purchase item X?’) This is expected from mature software. It is important information to improve the product on both technical and business fronts.

3. Predictive analytics: Predict what will happen. Provide insights into the future. (e.g. ‘this service requires data to be cached’, ‘people who bought this product also bought this other product’.) Thanks to the insights derived from predictive analytics, companies can drive additional revenues and optimize costs. This technology is now widely available.

4. Prescriptive analytics: Forecast ‘how can we make it happen?’ (e.g. ‘automatically increase the compute capacity for a service based on intelligence gathered from data’, ‘automatically order more supplies, or buy more advertising based on algorithms and data’). Decisions are made faster, without a human in the loop, based on the data collected. Billion-dollar companies do this. For smaller companies, gaining and applying this expertise is a clear opportunity to differentiate themselves, and get the associated lift in revenues.

5. AI-driven operations: Discover unknown unknowns. (e.g. improve predictive analytics even further by applying AI algorithms to the company’s data, or leveraging generative AI, which trains its algorithms on vast amounts of data publicly available.) AI-driven operations is leading edge technology, which requires an internal team of experts as well as sustained investment over time to fine tune the technology to the company’s use cases. At the time of this writing, generative AI is an emerging technology, whose applications are yet to be fully discovered.Finally, provided the company obtains users’ consent, the company can sell its user-generated data.

Preparing for AI

In SVSG’s experience, it is dangerous to attempt to skip steps in the progression presented in the previous section for the simple reason that analytics always produce a result, but do not tell you whether the result is correct, or optimum. It is easy to make a prediction, it is much harder to make a good prediction.

Capture relevant data

A critical first step is to capture all the company’s relevant data, in a clean way, as described earlier in the section “Why think about data during a digital transformation?” The importance of clean data with correct meaning cannot be overstated. Incorrect data will lead to incorrect decisions (to state the often-overlooked obvious). The commonly accepted rule is that 80% of the cost of AI projects is spent in data preparation. Hence, the earlier tools and processes are put in place to curate data, the lower the cost.

Progressing through the first four levels of data-skills demonstrates the company’s skill at collecting and analyzing data correctly, and thus its readiness for AI.

Grow your AI talent

The second step is to acquire AI talent. AI is a different engineering field from software development. The best software engineer without AI education will not deliver quality AI capabilities. To be clear, both skills are needed, yet AI algorithm development is more akin to science. Once the AI team has figured out the algorithms (and the data required) to generate new revenue, then the software team jumps in to productize it.

In practice this means that time and resources for experimentation need to be budgeted for the AI team to research algorithms, tune them, and optimize them to the company’s use cases, and demonstrate business value. Naturally, as for any research project, success is not guaranteed.

AI research and development

Finally, investment in AI, both research and data operations, must be maintained. Unlike software, which can be left alone once it works, AI requires constant optimization as data and the people who generate it change. In addition, processes must be set up to ensure avoiding known side effects such as drift, and bias.

Data is a product

The examples above are not exhaustive by far, yet they illustrate the value of data. While the harvesting of data in a data warehouse, or data mesh, may lag the digital transformation effort, it is critical that during this digital transformation, data be properly architected primarily because the cost and time to do so after the fact is so much greater.

In practice, data must be treated as a product with its own product manager(s) and development team(s).The data team’s role is to:

Nurture data to ensure accuracy, completeness, as well as correctness.
Ensure quality so that data has well defined and has consistent meaning and format across the product and internal systems.
Provide access and tools to harvest the data across the whole organization, while maintaining security. Insights come from unplanned places and people.
Drive the company’s capabilities along the skill-progression outlined earlier. Exploring the potential use of AI requires enlisting qualified AI engineers, as well as patience in terms of time and budget for the research to demonstrate customer-value and business case.

Final thoughts

With the emergence of generative AI, ‘don’t forget data’ seems like a timid recommendation, yet for most companies, it is the necessary, difficult, first step in a pivot to a world that has become data-first.

Why Digital Transformations Fail – the Monolith Syndrome

Previously published on Silicon Valley Software Group Insights in March 2023.

A number of our engagements come from clients who experience a similar pattern of symptoms: release velocity is trending down, critical bugs pop up with each release, yet hiring more developers does not seem to improve anything. In parallel, the digital imperative, which has gained momentum over the past couple of years, whether imposed by the pandemic, or simply overall evolution, keeps building the pressure: consumers require a flawless digital experience. When the technology team does not deliver, the consequences for the business are painful: customers are disappointed, competition edges ahead and, even more heartbreaking, our clients are unable to capture the demand that their marketing has generated.

The goal of this post is to inform both CEOs and CTOs on how to diagnose what we term the “Monolith Syndrome”. As with any condition, early diagnosis vastly improves the chances of success. It is thus critical for CEOs and CTOs to know how to recognize this pattern, and take the necessary early actions. Further, it often falls on the CEO to identify the situation, because the CTO is usually consumed in trying to just keep up.

Symptoms

The symptoms of what we term the “Monolith Syndrome” look like this:

The application’s response time keeps degrading;
Outages are becoming more frequent;
As outages occur, new features requests do not get delivered. Customer complaints rise;
Re-prioritization of the product roadmap occurs before the main features of the previous roadmap are delivered (because they took too long);
Distrust between the executive and the technology teams grows.

Like any challenge, each company faces its own flavor of the “Monolith Syndrome”, yet to the experienced eye, the pattern is easily recognizable. More fundamentally, it is absolutely normal: it occurs when a company has grown into a new stage of maturity – where a new way of running the business, including the technology, is now necessary. Like most living organisms, when looking on a short time horizon, companies grow incrementally. However, when taking a step back, discrete stages become evident. On the technical front, transitioning between maturity stages call for what is called a “Digital Transformation”.

The Monolith Syndrome encapsulates scenarios of pain when the technology team cannot keep up with the needs of the business through “business as usual”.

There are multiple scenarios that require a digital transformation, the Monolith Syndrome is one of them. We will explore the others in subsequent posts.

Causes

From a technical perspective, the root causes of the “Monolith Syndrome” are often a combination of:

The architecture of the current codebase was developed more than five years ago, and has changed little since;
The code is built on a single codebase and uses a single database – hence the term “monolith”;
Development expediency has been the priority which has led to: poorly organized code, little documentation, few tests, and even fewer automated tools for QA, release and operational management;Critical areas of functionality are implemented in “dark code”: code that was written by developers who are no longer employed by the company, and which current developers are scared to touch, because the code is difficult to understand and there is no documentation.

The Monolith Syndrome encapsulates scenarios of pain when the technology team cannot keep up with the needs of the business through “business as usual”. We described the symptoms above in technical terms. Yet, the underlying cause is that the company has grown into a different maturity level – where “what got you here” no longer works.

To be clear, a monolithic codebase is usually the right way to go in the early stages of a company: there are a handful of developers, a manageable number of lines of code, and few features that are quick to test manually. Yet, at some point in the company’s growth, the nimbleness and expediency become a detriment rather than an asset. For example, it becomes cumbersome to develop, let alone release, when twenty-plus developers are writing code in a monolith: different developers’ new code interact with each other in a way that creates unforeseen bugs.

The underlying cause of the Monolith Syndrome is that the company has grown into a different maturity level, but not the technology team.

As a company battles through the Monolith Syndrome, the CEO and CTO have a heart-to-heart: the CEO asks “what do you need to develop new features faster?” – to which the CTO invariably answers “I need more engineers”, and then proceeds to build a “better monolith”, i.e continue to work on the same codebase with the same processes and tools. Yet with poor architecture, software organization, and documentation, the extra developers only create more confusion and barely accelerate development velocity. The root cause of this lack of progress is that the business side has gone through a change of paradigm, but not the technology team.

Again, this is why it is the CEO, who understands the business context, who needs to recognize the pattern.

The goal of the transformation is not to update to the latest and greatest technologies, but rather to identify the technologies most appropriate for the foreseeable needs of the business.

The Proper Mindset

In order for the transformation to be successful, everyone needs to have the proper mindset:

Recognize that this effort is the “price of success”. Understand that current architecture, code, tools, etc. were not a mistake – no one deserves blame. On the contrary, they were optimal for the previous stage of maturity. Now that the business has grown, and evolved, technology also has to transform to a more mature architecture.
The goal of the transformation is not to update to the latest and greatest technologies, but rather to identify the technologies most appropriate for the foreseeable needs of the business.
The transformation will require a set of skills that is typically not present in-house. Rare are the CTOs who have successfully led digital transformations. Hence, it is usually wise to enlist the help of technical leaders who do have this experience.

SVSG’s Framework

SVSG follows the following framework:

Re-align the technology to the business: understand the main stakeholder journeys (customer and employee), which have likely evolved since the current architecture was designed.
Design the architecture – and data models – before coding, based on the new stakeholder experiences, as well as needs for scale, resilience, security, etc.
Incorporate the full business context such as scale, security, resiliency, etc.
Design an incremental migration path from the current state to the desired state. For example, start by breaking up the monolith by creating one additional microservice, validating its design before moving one to a second microservice.
Evangelize that the transformation goes beyond architecture and code. The whole development process, from end to end, must align with the company’s new stage of growth.

Final Thoughts

Digital transformations are rare events in the life of a company. Technology leaders are usually selected and trained to design and build technology incrementally. Unless you have gone through it before, detecting that your company might be experiencing the Monolith Syndrome is an unusual, and difficult, challenge for both CTOs and CEOs; but when the symptoms arise, it’s important to act swiftly if the business is to keep up with its growth.

Technical Due Diligence For Companies On The Cusp Of High Growth

Published on Forbes Technology Council 12/27/2022

You are ecstatic: You just executed a term sheet with a startup, which, thanks to your large investment, will grow two to three times each year for the foreseeable future (i.e., two years). Now begins the hard work of ensuring that the CTO delivers the technology and features laid out on the product roadmap. Yet, sustaining high growth, defined (arbitrarily here) as growing revenues at more than 100% per year for at least two years, requires a different playbook than a more mundane growth rate. For example, bigger hardware may accommodate the first doubling of traffic, but the second or third will likely require substantially different software and data architectures, which need to be planned long in advance.

While it is not an investor’s job to identify or address these challenges, the return on investment will ultimately depend on how well and how timely the portfolio company manages them. This article provides pointers on what investors should know and look out for during technical due diligence, as well as post-investment.

The Difference Between High Growth And Regular Growth

In general, growing at a high rate raises four types of challenges.

• Tough Technical Challenges

Handling twice the traffic, with twice the amount of data stored, leads to a different category of problems, technically, compared to handling 10% more traffic. In addition, when you decide to build a new architecture because your traffic is doubling every year, you actually need to design for 10 times the traffic so that you do not go through the same exercise again each year.

• Incremental Changes No Longer Effective

Changes need to be performed in discrete steps. As illustrated above, as traffic surges, incremental measures (e.g., bigger hardware) will keep the business going for a while, but a new architecture needs to be analyzed, designed, implemented and deployed rapidly. Because this work is complex, it needs to start early—well before the real pain starts. Furthermore, the transition to the new architecture often presents a more complex challenge than the new architecture itself.

• The Need For Everything To Change At Once

Along with technical changes in the architecture and the tech stack comes the need to deliver more features faster. This, in turn, requires more engineers as well as a new team organization, along with new tools and new processes.

• Changing Nonfunctional Requirements (NFR)

As the company grows and acquires bigger customers, securing data, meeting regulatory compliance, protecting privacy, preventing downtime and ensuring business continuity take heightened importance. While security might not appear critical for a company managing $10 million worth of transactions, it becomes critical when $100 million flows through the platform. Growing companies often miss this because a slow evolution over time eventually adds up to a category-changing situation.

Where Technical Due Diligence Should Focus

The first step when reviewing a company prior to investment is to identify and quantify impediments to growth. For example, is the amount of technical debt such that even a minor increase in traffic or features will create serious risks of downtime? Do the CTO and the technical leadership have the talent and experience for the design and implementation of the next-generation architecture? Does the CTO have the business acumen, in addition to the technical expertise, to align technical operations with the evolving business?

Next, the plans for growth need to be examined. Are they aggressive enough in scope as well as technology to meet the anticipated growth? How well developed are the plans: Are they conceptual, or do detailed designs exist along with development plans? How robust is the new architecture design? Without detailed plans, the product roadmap is aspirational rather than achievable.

In our investigations, we often see parallel roadmaps for the product, technology and NFR, each assuming access to the same resources. This is a recipe for disaster; fuzzy resource plans lead to fuzzy budgets, misalignment with the CEO and confusion about the allocation of the newly invested funds. The worst case scenario is to find out six months after a deal has closed that the engineering budget needs a 25% increase to deliver the product roadmap because the resources to upgrade the architecture, scalability or security were double counted.

Recruiting and new employee onboarding are often overlooked activities, but when they’re performed poorly, they are a huge, yet hidden, drain on productivity. Because high growth often entails increasing the size of the team quickly, engineers must spend time interviewing prospects. When the recruiting process is poor, candidates do not meet standards, and desirable prospects accept offers from other companies.

As a consequence, engineers end up spending a lot more time in interviews, and building the team takes longer than it should, thus delaying the product roadmap. In addition, frustration builds because time spent in interviews is rarely factored in project scoping, causing delays in projects. Investing time upfront in building efficient recruiting and onboarding processes will be recovered many times over.

Companies rarely have everything figured out. The purpose of the review is not to give a “beauty contest” score but rather to determine whether critical changes need to take place before the company is ready to fully “step on the accelerator,” as well as how much these changes will cost and how long they will take. Getting technical debt to an acceptable level, hiring a new CTO, building a baseline of automated regression tests—all these projects can easily take one or two quarters and commensurately affect the growth rate and revenue.

Conclusion

High growth differs materially from traditional growth by the breadth and speed of the changes that are needed, thus requiring a different playbook. Investors need to know whether a company is ready from day one, whether it will require time to pay down technical debt and whether its growth plans are ready for execution. A lack of readiness can easily consume two quarters, which is a long time in the startup world. It may determine whether the company will dominate its market or get edged out by a faster competitor.

Lessons Learned From 50 Technical Due Diligence Reviews For Acquirers

Previously published on Forbes on August 12, 2022

Management teams seem to forget a critical rule when acquiring another company: The original product road maps of both acquiring and acquired companies must be delayed by at least one quarter. The reason is simple: Resources from both acquiring and acquired teams need to dedicate this time to merging the technology stacks, tools and processes of the two companies.

In a prior article, I covered the technical review needed prior to an investment. An acquisition requires additional work, which I’ll cover here.

The benefits of buying a company are easy to get excited about: New market segment, new customers to which to upsell the current product, new technology, etc. Yet, the effort and time needed to realize these benefits are often overlooked. Whether because of time pressures or over-exuberance, the acquiring management team often glosses over the intricacies of integration, oversimplifying the work needed, which results in a vastly underestimated budget, human resources and time.

In the worst case, the impact goes beyond delaying the benefits of the acquisition—because existing resources must be reallocated to the integration of the acquired company, the acquiring company’s original product road map itself is delayed, resulting in lower revenues. By engaging in thorough technical due diligence (tech DD) the acquiring management team can avoid these pitfalls.

Tech DD will force answers to tough questions on the future operation of the combined entities:

• Will the two products run side-by-side (simpler initially but likely costlier to operate), or will they merge into a single platform (challenging initial integration efforts and generating multiple long-term benefits)?

• What is the long-term technology stack—and how much effort will it take to get there? Even with similar technology stacks, framework versions have to be aligned, along with templates, design patterns, log aggregation, performance monitoring, etc. Tool stacks must be evaluated: code repository, CI/CD toolchain, identity framework, test automation, application monitoring and alerting, security, etc. There are often dozens of such evaluations to make.

• For each tool or framework that differs between the two companies, an analysis of “merge” versus “siloed” must be made comparing the upfront costs of merging versus the long-term savings. The absence of automated tests often increases the effort and risk of merging, whether it entails refactoring code or changing tools.

• On the other hand, keeping siloed not only duplicates costs but reduces knowledge sharing and increases the overall complexity in releasing features, as well as managing a more fractured team.

• On the operations side, migrating data centers is no easy task. The more a product leverages the services offered by a cloud provider, the more complex the migration is, whether it is for databases, container orchestration or management consoles.

• Unifying data is another challenge: Something as apparently simple as standardizing the attributes and representation of core entities in the system (e.g., a user) demands lengthy detailed analysis and code refactoring.

• Who will execute the technical integration? At least initially, the most valuable members of both teams are needed to make the critical evaluations. As a corollary, what projects will be neglected, and which new features will be delayed? How does this impact customers and projected revenues?

• Alternatively, outside contractors can be brought on to handle the temporary surge of work caused by the integration. In practice, because of the overhead of onboarding contractors, this approach works best if working with an existing partner—or one that the company intends to work with for the long term.

• How quickly, and through what processes, must the acquired company rise to the security and compliance requirements to those of the acquiring (larger) company?

• Were expectations properly managed? In the euphoria of the deal, double-dipping often happens. The sales team expects that the two companies’ road maps will be delivered unaltered, while the financial team expects cost savings from the two companies’ synergies. In addition, the integration budget is often severely underestimated.

As an illustration, imagine a company running on AWS with a tech stack based on Node.js and RDS/PostgreSQL acquiring a company running on Azure with a .NET tech stack. What is the cost/benefit of running the two products “as is” on separate software infrastructure, versus migrating to AWS and/or Node.js? An alternative might be to acquire a competitor of the target company that runs natively on an AWS/Node tech stack, if one exists, even if its business position is not as strong. A simpler integration will accelerate the time-to-market for the combined company, making up for the initial comparative disadvantage.

In short, the amount paid to transfer ownership of the acquired company may only be a fraction of the total cost of the acquisition. Other costs stem from additional resources, financial and human, needed for the integration and from revenue offsets from delays due to integration.

At a minimum, tech DD for an acquisition will present a more realistic view of the total cost of acquisition. While tech DD will only outline the myriad “merge” versus “siloed” technical decisions that will eventually need to be made, this will force a critical examination of the integration road map, along with refined estimates of the effort and time required. With this information, the management team can de-risk the decision to acquire, build post-deal milestones and accelerate the time-to-market of the combined products.

Seven Critical Technical Due Diligence Questions For Technology Investors

Previously published by Forbes on June 20, 2022

In the excitement of having signed a term sheet, investors may be tempted to consider technical due diligence (tech DD) as a formality to assuage their colleagues and limited partners. Tech DD, however, should be considered more than a defensive tool to avoid embarrassment and the loss of the money invested.

Tech DD, when performed correctly, can limit risk and ultimately increase an investment’s return by laying out the technology milestones critical to the success of the business. With proper tech DD, investors gain agency, and thus peace of mind, in shepherding a company’s growth.

While situations such as Theranos or WeWork are extreme, my organization has encountered “unexpected” situations in the course of tech DD projects, such as:

• A company running tens of thousands of users on the Ruby-on-Rails code that it demoed for its seed round.

• A company where the code had yet to be written for a large proportion of the advertised functionality.

• A founder/CTO who had reached his/her limit of expertise and was unlikely to be the right person to lead the company in its next stage of growth.

• A company with large amounts of legacy code running core functionality without any of the engineers who wrote the code still working for the company.

Being alerted to the scenarios above, along with the estimates of the time and effort required to put the company on a solid footing for scaling, allowed the investors to rebase the financial projections with more realistic time frames.

Seven Crucial Questions For Tech DD

None of the scenarios are intrinsically deal killers, yet they likely warrant action from investors pre- or post-investment. These, and countless other scenarios like them, can often be missed if tech DD is treated as a “check-the-box” exercise. In order to limit the risk of investments, as well as provide visibility on deliverables over the next couple of years, the following questions have proven to be particularly important:

1. How reliable is the delivery schedule of the product road map? Delays in the product road map are indicators of delayed revenues since delayed features make it harder to attract new customers. In addition, the efficiency of product and engineering in managing the product road map and the associated release schedule is critical to the overall development velocity of the company.

2. Will the technology handle the user growth over the next couple of years (taking into account the technology upgrades on the road map)? Has the technology team properly scoped the complexity, time and effort for the refactoring or re-architecting needed to reach the projected scale?

3. Are non-customer-facing aspects of technology aligned with the maturity, size and market of the company? Companies in high-growth mode can easily lose track of the product’s security, resiliency and business continuity. Similarly, it is difficult to ensure that tools and processes for QA, CI/CD, operations are upgraded in line with growth.

4. Does the tech team have a plan to maintain its velocity while scaling? This question should go beyond the software architecture and addresses how and when organization, tools, processes and metrics will adapt in engineering and operations.

5. Does a new CTO need to be hired (or other technical leaders)? Is the technology leadership team ready for the next phase? How well have they mapped out the next big set of projects?

6. Are all the technology projects in the budget? Do they have the proper funding, staffing and time estimates?

7. Does the company have uniquely differentiated intellectual property? Intellectual property is rarely about patents. Rather, investors want to know whether the company has built a “defensible competitive moat” through market research, unique use of available technologies, proprietary technology or algorithms (e.g., for data science or machine learning).

How Investors Can Leverage Tech DD Findings

The benefits to investors who embrace the tech DD process outlined above materialize in the form of one evaluation and two numbers.

• The ultimate evaluation is that of risk. Has the riskiness of the investment increased dramatically? It’s crucial to understand whether the investor will need to be more involved than planned in monitoring how well the company executes or possibly spend time supporting the management team.

• The first set of numbers is the quarterly revenue projections, and whether they need to be adjusted based on the information received during the review. A delay in features, or scalability, will likely delay revenues and thus ultimately the value of the company. In the worst case, the company could lose out to a more nimble competitor.

• The second number is the amount to be invested in the company. Does this number need to be adjusted to account for delayed revenues, increased costs from a larger than planned technology team or unanticipated development?

An important additional benefit of this effort occurs when investors review the tech DD findings with the company’s management team and align expectations. This reduces the likelihood of unpleasant surprises post-investment.

In terms of deliverables, investors should expect an overall assessment of the technology and the technical team’s ability to deliver the features, customer-facing and not, that underlie the product road map and thus the revenue projections.

Whether this assessment matches their own will determine whether their risk projection for the deal needs to be adjusted. In addition, investors should receive a quarter-by-quarter list of technology deliverables that are critical to the success of the company. With this information, investors improve the odds of the company meeting its plan by taking actions early, in collaboration with the company, to set it up on a path to success.

Lessons Learned From 50 Technical Due Diligence Reviews, Part 1

Previously published on Forbes on 3/18/2022

Over the past couple of years, I’ve led, in collaboration with other CTOs in my company, about 50 technical due diligence reviews, primarily for the benefit of venture capital firms and sometimes for M&A deals. The target companies ranged in maturity from early stage to a hundred million dollars in revenues.

Occurring after a term sheet has been signed and before the full contract is executed, a proper technical due diligence review is far more than an evaluation of a snapshot in the life of a company. It evaluates the ability of the target company’s technical team to deliver the technology that underlies the growth objectives of the company in the next two years.

Having performed these technical due diligence reviews across a variety of industries and company sizes has allowed us to empirically identify patterns, which I’m sharing here in a series of articles for the benefit of founders, CEOs, CTOs, investors and acquirers. My goal is to help each participant be more effective in these situations. In this first part of the series, I’ll start with founders, CEOs and CTOs.

1. Embrace the technical due diligence process.

First, technical due diligence is good news: It means that an investor, or acquirer, is committed to investing in your company. Furthermore, the presumption about the technology is positive—after all, it got you this far.

Don’t ruin this positive vibe by being coy with information or holding back on providing detailed information about your technology and what makes it unique under the guise of protecting the company’s intellectual property (Europeans companies seem more prone to doing this). NDAs protect you. Withholding information simply causes more questions and, thus, more emails and more time on Zoom.

In the worst case, resisting standard requests for information raises questions on what you have to hide. Said differently, it’s impossible for your technical due diligence review provider to provide good recommendations on something they haven’t seen.

In fact, the worst conclusion that we can report to our clients is that the target company doesn’t have any differentiated intellectual property. Consequently, rather than be secretive about your algorithms and technology, “sell” your review provider on how innovative you are. Impress them, and they’re likely to share their enthusiasm with your investors.

2. Use technical due diligence to your advantage.

There’s no right or wrong architecture. By definition, the current architecture is pretty good because it allowed your company to grow to this stage. During the many technical due diligence reviews we’ve performed, we’ve seen the same categories of problems solved successfully with different technical stacks. A good technical due diligence review provider should be polyglot and agnostic. What matters is its understanding of the strengths and weaknesses of the technical stack and how it needs to evolve based on how the business will evolve.

We also know from personal experience that nothing is ever perfect, particularly in a high-growth company. What matters is to demonstrate the awareness of what works and what doesn’t, as well as the decision-making process that’s guided trade-offs over time.

Granted, having to provide documents and answer questions for the technical due diligence review may seem like a huge waste of time. But how often do you get a chance to have experienced peers review your architecture, code, processes and tools? Most CTOs we work with tell us that they learn a lot from the questions that are asked. A question like, “What factors led to the selection of a given framework?” implicitly guides the discussion toward whether these factors will be relevant in the next two years and whether others need to be included as well.

3. Technical due diligence is all-encompassing.

Investors care less about your current state than whether you have a realistic assessment of it (including the problems you haven’t yet solved) and of what it will take to meet the growth numbers you posted in the pitch deck. A good technical due diligence review provider will evaluate the team (the CTO, specifically) as well as the technology.

Sharing how you think, your approach to problem-solving and how trade-offs are made among budget, features, time and resources shows that you’re open and confident. In addition, it lays a solid foundation for the working relationship with the investors over the next three to 10 years. This is similar to job interviews: The interviewer cares more about how you think about a problem than whether you’ve memorized the correct answer.

4. Technical due diligence is nonbinary.

As mentioned, technical due diligence occurs between the signing of the term sheet and that of the contract—in other words, after the deal has already been made. The investors, or acquirers, really want the deal to happen. They don’t ask our opinion about the deal. They just want us to help them paint a picture of what the technology journey will be over the next two years.

In cases in which we’ve suggested that the CTO has reached their peak or the software needs to be rewritten, this has rarely canceled a deal. Instead, investors may decide to reduce their pre-money valuation, increase their investment amount (e.g., to pay for the rewrite) or rework the business plan with the management team. Very rarely do they walk away from the deal, and to our knowledge, it’s never because of technology only.

5. Technical due diligence is forward-looking.

Technical due diligence is the start of the collaboration between the CEO, CTO and the future board members. As technical leaders, you’ll want to demonstrate that you understand the needs of the business and how to architect the technology, team, tools and processes to support these needs over the next two years.

As I’ll illustrate in a forthcoming article in this series, technical due diligence should be considered not as a test but as an opportunity to have a conversation about what lies ahead. Ideally, this conversation should take place internally prior to fundraising, which will likely result in a smooth technical due diligence review.

The CTO’s Yearly Checklist

Previously published on Forbes on 8/19/2020

In a startup, as in any adventure, one needs to raise one’s head toward the horizon once in a while to ensure that one is still headed in the right direction. Well-run companies typically hold quarterly executive off-sites, and at least once per year, the product road map is refreshed.

This is the perfect impetus to refresh everything in engineering: technology stack, tools, methodology, team and employee roles. Technology, tools or processes that used to work may become inadequate, or even break, as the company grows. A well-executed yearly review will identify the key challenges and opportunities for the following year, and thus allow you to identify the key decisions to be made inside engineering and to prepare for these decisions.

While the executive review of the product road map will focus on the execution part of the road map, it is equally important to lead an innovation review within the engineering team to ensure that you retain your technology leadership against the competition.

Finally, in order to have an effective yearly review, a lot of work must be done prior to the review (in order to inform the product road map decisions), as well as after it (in order to reflect the new product road map).

Before The Product Road Map Review

During the product road map review, the executive team will usually concentrate on customer-facing features and will ask for dates for key deliverables. In order to make this discussion as effective as possible, you need to research what the likely top requests will be. In addition, you need to identify technical debt, as well as noncustomer-facing features (quality, robustness, performance, business continuity, compliance/security) that must be addressed — and build a business case for each of these, along with timing and resource allocations.

Because your development capacity, velocity for paying technical debt back and customer-facing work are determined by the resources available, you need to negotiate your budget for the coming year, parallel to building our future plans. Conversely, making commitments to a product road map without a clear idea of resources available will lead to uncomfortable discussions later.

With a good idea of the major engineering projects in place, you can refresh your technology road map and discuss the new technologies you need to acquire in order to deliver next year — whether this technology is inside the product or part of your internal tools. For example, have there been any significant advances in AI, cloud computing or analytics that will improve your efficiency or increase your competitive differentiation?

Finally, a good retrospective of the team will complete the preparation for the annual review. Based on this year’s accomplishments and next year’s objectives, how does the team need to evolve? How do you need to evolve? Do you need to radically improve quality? Will your market demand a step up in security? Who on the team has delivered beyond expectations? Do you need to take new classes or get a mentor? A thorough retrospective should involve a broad consultation with people inside and outside the engineering team.

During The Product Road Map Review

Product road map review meetings — particularly when part of an executive off-site — are usually intense affairs with lots of passionate discussions (usually a good thing). As CTOs, we must accomplish two critical objectives:

1. Avoid committing to any delivery dates on the spot, unless we have absolute clarity on both requirements and resources availability. However, you must provide estimates of scope for key features to inform decisions on priorities.

2. Ensure that the most important deliverables on the road map have well-documented business cases, from which it will be straightforward to extract precise requirements.

After The Product Road Map Review

Even when the yearly product road map review does not bring major surprises, the aftermath always entails a lot of work, which consists of delivering the actionable product road map and figuring out the changes necessary to execute this road map — beyond writing the code.

An actionable product road map is a commitment from the engineering team to deliver certain features by certain dates. This implies that the budget has been finalized, requirements and resources are clear, and you have done a detailed-enough design and task breakdown to make these commitments with enough confidence and buffer that you will not disappoint your customers.

In parallel, you must solidify our plans to refresh how you innovate, as well as how you execute.

On the technical side, you need to complement the customer-facing product road map with your internal technology road map, your technical debt payback plan, and your tools and infrastructure upgrade plans.

Finally, and too often forgotten, the organization must be refreshed: Team structure, culture, metrics, methodology, communication processes, technical skills and talent all need to be reevaluated with the active contribution of the teams’ leaders.

This massive effort culminates with extensive communications: The product road map, once it has become actionable, is shared with the business teams inside the company. In addition, when sharing the road map with the engineering team, it is critical to highlight the planned improvements in engineering, which will make this road map realistic, along with associated growth opportunities for each individual. This communication must be well orchestrated through all-hands, team and individual meetings so that every single engineer continues to be motivated, challenged and rewarded by the year ahead.

Finally, you need to give your team the tools for success, whether building up your direct reports and delegating more, defining new challenges to feed your continued motivation, learning new ways to lead, or implementing new technologies.

It is a lot of work to properly prepare and execute this yearly review. Yet, like most planning exercises, it usually bears fruits from the process itself of thinking about the future. Going into a new year with a well-thought-out and well-communicated actionable product road map provides a guiding path for everyone inside, and outside, the engineering department.

For Machine Learning, It’s All About GPUs

Previously published in Forbes on December 1, 2017

Isn’t it curious that two of the top conferences on artificial intelligence are organized by NVIDIA and Intel? What do chip companies have to teach us about algorithms? The answer is that nowadays, for machine learning (ML), and particularly deep learning (DL), it’s all about GPUs.

In a previous article, I made the case to every CEO and CTO that “Machine learning allows us to make even better use of the data we have, as well as the data we don’t currently possess, and answer the questions we didn’t know we should ask.”

As more companies build AI-driven products, technology providers are responding to this demand by providing products that are computationally more powerful and easier to use and manage in production.

GPUs are driving the next wave of breakthroughs.

Why GPUs Are So Important To Machine Learning

GPUs have almost 200 times more processors per chip than a CPU. For example, an Intel Xeon Platinum 8180 Processor has 28 Cores, while an NVIDIA Tesla K80 has 4,992 CUDA cores. While a CPU core is more powerful than a GPU core, the vast majority of this power goes unused by ML applications. A CPU core is designed to support an extremely broad variety of tasks (e.g., render a webpage, drive word processors and enterprise software, manage peripherals) in addition to performing computations, whereas a GPU core is optimized exclusively for data computations. Because of this singular focus, a GPU core is simpler and has a smaller die area than a CPU, allowing many more GPU cores to be crammed onto a single chip. Consequently, ML applications, which perform large numbers of computations on a vast amount of data, can see huge (i.e., 5 to 10 times) performance improvements when running on a GPU versus a CPU.

Having recognized this fundamental fact a few years ago, the tech industry, particularly the ML crowd, has focused its efforts on taking advantage of the GPU. However, this is not a simple task. All layers of the compute stack have to be redesigned to take advantage of the GPU’s power.

Recent Developments For GPUs

NVIDIA has so far been the main provider of GPU chips for ML acceleration. The company has powered the AWS compute-optimized instances for the past year.

Furthermore, chip manufacturers are about to release chips that are architected specifically for ML from the ground up (rather than continuing to optimize GPUs, which were originally designed for graphics processing). NVIDIA is shipping the Tesla V100, which incorporates Tensor Cores designed specifically for DL, in addition to GPU cores. Google announced its Tensor Processing Unit (TPU) last year that powers its main services: Google Search, Street View, Photos and Google Translate. Finally, Intel announced this month its Nervana Neural Processor, which was also architected, in collaboration with Facebook, to optimize neural network computing.

Building The GPU Compute Stack

Having super-fast GPUs is a great starting point. In order to take full advantage of their power, the compute stack has to be re-engineered from top to bottom.

• Servers

A new category of servers needs to be built to feed the beast. This is necessary to send (and store) data to the GPU at the rate at which it is capable of consuming it, requiring up to 10x improvement in bandwidth.

NVIDIA just started shipping its DGX-1 server. Data throughput and storage have been optimized in order to take full advantage of the processing power of the eight Tesla-V100 processors included in the box.

Facebook recently announced its second generation of AI-hardware (“Big Basin”) to power its own core services: speech and text translations, photo classifiers and real-time video classification.

• Data Center

An article I wrote last month highlighted the impact of ML for cloud providers. Since then, new GPU-related developments have emerged.

Google just made its TPUs available on its compute platform.

Intel just announced its Nervana DevCloud, which is limited for the time being to research and experimentation.

Finally, a super-computing veteran of 45 years is entering the fray. Leveraging its decades of experience in high-performance computing (HPC), Cray will soon be offering its supercomputers for rent on Microsoft Azure. These servers can host a large number NVIDIA Tesla GPUs.

• Frameworks, Models And Algorithms

Optimized hardware requires optimized software. All cloud providers have optimized the major frameworks (Tensorflow, PyTorch, Caffe, MXNet) to their platform. Furthermore, GPU vendors are rewriting the major models and algorithms (NVIDIA Digits, Intel Nervana Graph) to take full advantage of the GPU’s power.

Through the GPU Open Analytics Initiative, companies such as MapD (DB, visualization) and H20 (ML) are rewriting fundamental technologies like databases and programming languages in order to eliminate data copies, which, if ignored, may significantly increase overall execution time.

Finally, some technologies have reached a degree of fidelity high enough to be offered as services: AWS, Google and Microsoft each offer various flavors of speech recognition, translation and synthesis. Similarly, China’s Megvii’s face recognition service has become very popular.

• The Edge

For some applications, the ML models that have been trained in the data center must be computed at the edge (i.e., close to the end user). In the case of autonomous driving, for example, the car’s brain is trained in the data center but must be run in the car.

Now that machine learning has become mainstream in the data center, dedicated products are being released for edge computing. For example, NVIDIA provides the Drive PXfamily of accelerator cards that host 1-4 GPUs, as well as multiple video and other sensor inputs. They can thus power anything from simple highway driving today to fully autonomous driving in the future.

A New GPU-Driven ML Landscape

From this whirlwind survey of innovation driven by GPUs, one can anticipate increases in processing power of two to five times over the next months, from which a second wave of machine learning breakthroughs is bound to emerge, allowing us to solve a brand-new class of challenges.

How Machine Learning Will Disrupt The Established Cloud Providers

Previously published in Forbes on October 24, 2017

In the past few years, new categories of products have emerged thanks to the extraordinary advances in machine learning (ML) and deep learning (DL). These new techniques power product recommendations, computer-aided diagnosis in medical imaging and self-driving cars, just to name a few.

Most ML and DL algorithms require compute profiles (hardware, software, storage, networking) that are significantly different from those optimized for traditional applications. Consequently, as more and more companies develop their own ML/DL solutions and deploy them to production, the demand for the ML-optimized compute resources will grow dramatically and create opportunities for new entrants to offer solutions that compete with today’s dominant cloud providers: Amazon AWS, Microsoft Azure and Google Cloud.

The ML/DL Cloud Is Different

In an article on Mesosphere’s blog page, Edward Hsu presented the case that web applications are now primarily data-driven. Consequently, a new set of frameworks (a.k.a. stacks), namely SMACK (Spark, Mesos, Akka, Cassandra, Kafka), must replace the traditional LAMP (Linux, Apache, MySQL, PHP) stack used to build web-based applications. In my view, rather than replacing LAMP, SMACK will coexist side by side with, and feed data to, traditional web-based based frameworks, which are still needed to present nice-looking webpages and interface with mobile phones.

Yet the main point is well-taken. We need to update Marc Andreesen’s famous line about how “Software is eating the world” to “Data is eating the world.” Let’s unpack this statement and derive the consequences.

Hardware

The disruption created by machine learning and deep learning extends well beyond the software stack into chips, servers and cloud providers. This disruption is rooted in the simple fact that GPUs are much more efficient processors for ML and DL than traditional CPUs.

Up until recently, the solution was to augment traditional servers with GPU add-on cards. We are now at a point where demand for ML/DL computing is such that special-purpose servers, optimized for ML/DL compute loads, are being built.

Data centers are also being re-architected to support the extremely large amount of data consumed by ML and DL. Imagine you are designing the brains for self-driving cars. You need to process thousands and thousands of hours of video (and other such signals as GPS, gyroscopes, LIDAR) to train your algorithms. The amount of data that a Tesla on the road records in one second is a million times larger than a tweet or a post on Facebook.

ML/DL data centers thus require both huge amounts of storage and extremely high bandwidth.

Software

The software side is even more complex. A new infrastructure stack, typically using machine learning-specific frameworks such as Tensorflow (originally developed by Google) or PyTorch (originally developed at Facebook), is required to shepherd data around and manage the execution of the compute jobs. Furthermore, open-source code libraries (pandas, scikit-learn, matplotlib) are used to implement the models (e.g., neural networks, data displays). These model libraries are critical because they are optimized to be both easy to use for algorithm research and offer high performance for use in production.

Finally, each vendor offers complete building blocks for specific use cases. For example, Amazon Lex, Google Cloud Speech and Microsoft Bing Speech provide speech recognition and can even recognize intent. Each has its own API and unique behavior, making the migration from one vendor to the other time-consuming.

New Entrants

In addition to the Big Three cloud providers (Amazon AWS, Microsoft Azure and Google Cloud) that have offered GPU-accelerated instances for a few years, new ML-optimized offerings have emerged:

• NVIDIA, which is already the dominant provider of GPUs that power the graphics cards that drive computer displays, recently introduced a portfolio of “purpose-built AI supercomputers” servers known as its DGX systems.

• Servers.com offers its Prisma Cloud with dedicated GPU-optimized servers.

• Rescale, one of the niche cloud providers that focuses on high-performance computing (HPC), just announced the availability of the latest generation of GPU-powered servers, along with high-bandwidth interconnect, to create high-performance multi-node clusters.

What’s At Stake

The Big Three cloud providers are the ones most immediately at risk to be disrupted by new entrants such as NVIDIA, Servers.com and Rescale. ML/DL innovation is still running at a torrid pace thanks to innovation in algorithms as well as compute efficiency. This is creating a small arms race where end users are constantly looking for the provider that can give that extra edge.

On one hand, end users are benefiting hugely from this arms race to provide the best software and hardware compute environment. On the other, this requires constant vigilance to keep abreast of the latest offerings. Even more importantly, when deploying ML/DL products to production, CEOs and CTOs need to pick the winner — or at least a future survivor — that will keep their edge for the next two to five years. This is not an easy task.

We will delve deeper into these two topics in future posts — stay tuned.

DevOps-Driven Development

It is now time to add the concept of “DevOps-Driven Development” to our repertoire.

“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.

Definition and Motivation

First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).

How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.

Practicing DevOps-Driven Development

I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.

We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.

Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.

Here is a “thinking DevOps” check list:

Deployable

First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc

This implies:

Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database

The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.

Monitor-able

If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.

Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:

Functionality (e.g Nagios, …) – is the service up and processing requests
Performance (e.g. New Relic, AppDynamics, Dynatrace, …)
Usability (e.g. MixPanel, Flurry, …)

Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.

Diagnosable

Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):

Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
All log statements used during the coding and QA phase must be removed
Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely meaningful information is output when troubleshooting is required
Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.

Security

This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).

Business Continuity

Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.

Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.

The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.

Scalable

The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.

This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc

This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).

Tunable

Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).

We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.

The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).

QA-able

As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).

Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.

Analytics – Growth Hacking – Usability

This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.

In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)

Even More So for Microservices

As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.

What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx, Nirmata, etc)

Summary

Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.

More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.

	Ely Shemer on Lessons Learned From 50 Techni…
	Lessons Learned From… on Lessons Learned From 50 Techni…
	DevOps Consult on DevOps-Driven Development
	devops training on DevOps-Driven Development
	Time Tested Engineer… on (Boosting) Morale in Engineeri…