How To Make The CEO-CTO Relationship Work, Part Two

Previously published on Forbes on July 5, 2019

The relationship between CEO and CTO is pivotal to the success of technology-driven companies. Yet, the personalities and working styles of these driven individuals can be different, which sometimes leads to suboptimal results. I had the experience of joining a company with an established CEO and of greeting a new CEO to my company, so I decided to write two letters to help CEOs and CTOs get on the same page.

This is the counterpart to my last article, “How To Make The CEO-CTO Relationship Work“: It’s the letter that I wish I had received from my CEOs and gives CTOs tips on how to operate and communicate most effectively in service to the CEO and the executive team.

Dear CTO,

I know you have a brilliant and creative mind and an impressive mastery of technology, along with a solid track record of developing world-class products. As you may have guessed, your technical skills alone will not suffice for your success as an executive and as a productive working partner to me. To ensure our joint success, I want to share advice with you about how we can most profitably combine our efforts.

Let’s start with a pair of obvious observations. First, your colleagues on the executive team, myself included, do not have a technical background. Second, the purpose of the company is to grow as rapidly as possible by delivering products that users want and to generate income.

These two realities may clash with your natural tendencies as a gifted creator, particularly when it comes to the technical sophistication of products. Developing the coolest, fastest and slickest product is not always the best business strategy — particularly if it takes a long time. We will need to develop a partnership that allows us to make decisions that include both business needs and technical options. Not every release needs to be perfect in terms of scalability, usability, security, and every other technical consideration. Yet every release must meet the company’s business objectives of the moment. In order to achieve this, you can learn to never say “no,” but rather to present trade-offs, and explain them in terms of their business impact rather than their technical features (which we don’t understand). For example, if we need to deliver on an aggressive schedule, we need you to inform us of what is feasible within the desired time frame in order to achieve the desired business outcome. Do we need to license technology, take away specific features or limit some aspects of the product?

In a similar vein, the team as a whole will benefit enormously if you hone a new kind of creativity, or rather add a new dimension to your technical creativity. This new dimension is one that meets the needs of our customers in new ways, that identifies new markets that we can expand into easily, and that drives the growth of the company. This is a rare talent — one that combines creative understanding of the market with technical innovation.

Your (non-technical) peers on the executive team need you to use language that they understand; we know that you’ve mastered the technical ins and outs. Also, don’t mistake us for your sounding board — rather, you can go to members of your team for that. What is meaningful to us is the impact on the business. Often, it simply boils down to this binary outcome: whether or not we will meet our sales projections for the quarter. Meeting our quarterly objectives is paramount — it ensures we get to “fight another day” — and for that opportunity, we may occasionally ask you to temporarily compromise on technical purity or the efficiency of the engineering team.

We also ask you to be strong. At times, the executive team may “groupthink” into an idea that’s really bad from a technical perspective. Should we do so, we’ll need you to stand your ground and find a way to communicate to us — in terms that we understand — the errors of our ways. Use the technical facts as a foundation to illustrate the business outcomes. You are the only person in the company who knows what it will take to deliver a certain product, what technology, team, methodology, tools, and so on are best suited, and ultimately how long it will take to deliver the product to our customers.

I will do my best to listen when these situations arise. Even so, however, this process is not easy: You don’t want to give up simply because you are in the minority. Perhaps the hardest part is that, once you are confident that the executive team understands both engineering costs and the business consequences of their proposal, you’ll need to let the team make the decision. A typical scenario is when an important new feature is prioritized ahead of a major software re-architecture. Shipping the new feature on the old architecture will require rewriting it once the new architecture is complete. Yet, sometimes this inefficiency is the “right call”: for example, if it makes lighthouse customers happy and blocks out the competition.

Finally, understand that we welcome your input on all topics — not just technology and engineering. I’ve worked with remarkable CTOs who were brilliant business strategists, marketers, and even salespeople. While we seek your input, the final decision belongs to the designated executive team member.

These skills and contributions are all essential to the success of our shared enterprise, and you should develop them while retaining the qualities that inspired us to hire you in the first place. While I have emphasized communications and business acumen, your top priority remains to be a world-class innovator and technical leader. I will help you acquire these new skills over time so that your influence can reach its full potential within the executive team and as a partner to me, but you should continue (and I can’t help you here) to be a world-class technologist.

I hope you will find these tips useful, and I look forward to building a strong partnership together.

Sincerely,

CEO

How To Make The CEO-CTO Relationship Work

Previously published on Forbes on June 17, 2019

The success of a venture-backed company usually depends on two main factors: its technical innovation and the velocity with which it introduces new products. In order to sustain these competitive advantages throughout their growth, companies must ensure that the delicate relationship between the CEO and the CTO is effective.

The CEO and CTO have a fluid relationship that changes over time. As the company grows, the relationship evolves because of the expansion of the executive team beyond the original founders. As the company grows, investors may also replace the CEO with “a real business person.” Sometimes, the CTO decides to leave the company and its politics to found yet another company.

I’ve experienced this rapidly shifting dynamic from both sides — as an outside CTO coming in to replace, or supplement, the founding CTO and welcoming a new CEO after the VCs replaced the founder CEO. In both scenarios, I have observed (and suffered from) misaligned expectations between the CTO/VP of engineering and CEO that lead to frustration and a lack of effectiveness on both sides.

With the benefit of hindsight, I have written two letters. The first, which I will present here, is one that I wish I would have written to my CEOs so they could have understood the nature of my job, my contribution and how to get the best out of me. The other, the letter that I wished I had received from my CEOs, is so they could have understood how to be most effective not only in leading the engineering team but also in understanding my role on the executive team.

Here’s the letter that I, as a CTO, wish I had written to my CEOs:

Dear CEO,

I want to thank you for placing your trust in me to be the new CTO of your incredible company. During the interview process, I thoroughly enjoyed our exchanges, and I was equally impressed by your past accomplishments, your business sense, your knowledge of the market and your drive.

Since you mentioned that you are “not technical,” yet you are responsible for leading a company whose success is highly dependent on the strength of its technology, I thought that I would take a running start in our relationship-building by sharing my thoughts on what will make our relationship effective.

My primary advice is that you allow me to do the things I am good at without second-guessing me. You hired me because I have proven more than once that I can build and lead a team of world-class engineers and launch world-class products into the market. While I expect to be challenged, like every member of the executive staff, when I say that developing a new feature will take three months, please don’t ask if it could be done in two weeks. I too want to win. The three-month figure will not come out of thin air, as my team and I will have spent time coming up with this number. If we ever need to build something with roughly the same features in two weeks, it will have to be an extremely watered-down version that we’ll call “demo-ware,” (which does have its place in certain circumstances), or we’ll need to pare the release down to one or two features.

For my team to succeed, I will also need you to work with the whole executive team to create an actionable product road map. By “actionable,” I mean that the priority of the features needs to be vetted by the business team and that the engineering team will need to be given the time to estimate the scope of major features so that the time frames published on the road map are realistic. If we follow this process, a sanitized version of the road map can be shared with the sales team and even customers.

The other major benefit of an actionable road map is that the engineering team can build a technology roadmap that will allow us to develop breakthrough features because we’ll have had time for research, experimentation and prototyping. Conversely, a road map that zigzags is not conducive to engineering efficiency because it wastes the time spent on design and planning work required for major features that are deprioritized. All of us in engineering understand that sometimes a major opportunity presents itself and that the whole company has to pivot to take advantage of it. We embrace those opportunities because we want to win just as strongly as you do. Yet the decision to pivot should consider the impact on engineering velocity as well as the new business potential.

Building a good product road map requires that we understand each other about schedule estimates: Loose requirements, changing priorities, a high velocity of development and accurate schedule estimates are not compatible. If you — and by extension, the business — require reliable schedule estimates, then engineering needs precise requirements that do not change, plus the time to work out a solid design from which a list of tasks and a schedule can be derived. If the nature of the business requires frequent changes of priorities, then let’s not bother with detailed estimates. Since it is a rare business that does not see priority changes, I strongly recommend that both the business and engineering teams embrace lean product and agile development methodologies.

Finally, at the risk of stating the obvious, engineers have different personalities than salespeople. When the engineering pen is quiet, it is not an indication of low morale. On the contrary, it shows that engineers are focused on writing code. I know that can be disconcerting to extroverts.

We’ll have to move fast in the journey we have undertaken together, and to do that, we need to communicate directly and trust each other. This letter is my attempt to do this, and if you’ve made it this far, there’s a good chance that we are at the start of a productive and fruitful partnership. I can’t wait.

Bernard Fraenkel

CTO

The letter I wish I had received from my CEOs will be published in a subsequent article.

The Art Of Technical Due Diligence

Previously published in Forbes on April 11, 2019

Technical due diligence (TDD) takes place once an investor (such as a venture capitalist, private equity manager or another company) has decided to invest in, or acquire, a technology company. Once they make this decision, they have limited time to dig into the company in order to ensure that its technology, its engineering team and its development velocity are as advertised.

As someone who has experienced this process from both sides — and whose company provides it — I understand the stress TDD sometimes causes. That’s because a poorly executed technical due diligence initiative can derail a deal and hurt the bottom line for investors — as well as the management team of the company receiving the investment — so it is worth understanding how to do it right as a CTO.

TDD Isn’t A Beauty Contest

After participating in dozens of TDD projects, I’ve learned that there are many ways to solve a technical challenge — including using frameworks or programming languages that I personally wouldn’t touch. It is thus critical to put aside one’s own ideas of technical purity and “the right way of doing things” during TDD and to have an open mind about how technology can be used. (You could even learn something along the way.)

Furthermore, it’s important to be clear about the purpose of TDD. If they’ve already made the decision to invest in a company, investors should assume that its technology is good enough today. What you want to know instead is whether the company can execute your business plan. A technical review should thus stay away from judging the beauty of today’s product architecture and take a more dynamic view to validate whether the technical team can deliver the future that the company has drawn for itself.

This dynamic perspective is all the more important because, for companies lucky enough to grow at a fast pace, life is messy. Code architecture is constantly evolving, and documentation is often incomplete and out of date. Concurrently, penetrating new markets and creating new major features often requires introducing novel technology or considerable re-architecting. You can reign in this apparent chaos during TDD through major core projects that temporarily do not produce user-facing features yet allow the engineering team to maintain velocity in the long run. A wise TDD will use these projects to discern between the “normal growth-driven chaos” and signs of any additional structure the company may need to reach a new stage of growth. An investor should expect one, two or even more of these fundamental projects in a two-year timeframe.

It’s All About The Product Road Map

The product road map is the engineering team’s commitment to the company to deliver specific features, products and capabilities on a given schedule. The management team, in turn, makes revenue projections based on the availability of these new features. Consequently, delays in the product roadmap can have a direct impact on the company’s revenue stream — and thus its valuation.

Beyond giving the product road map a simple thumbs up or down, your technical due diligence should provide actionable information about the upcoming 24 months, including critical dependencies, risk factors and major technical milestones that will usher in product milestones. As a TDD assessor, you should gather this information to track the success of your investment over the short- to mid-term.

In order to evaluate what the technical team must accomplish in order to execute the product road map, you should:

• Capture the business context of the road map

• Understand the business objectives for the next two years or more

• Evaluate today’s technical foundation to appreciate whether it can support future plans

• Internalize the future plans

• Evaluate the team’s ability to deliver these plans — and to mitigate risks

Understanding The Business Context Is Critical

Technology serves the business. It follows that you should assess technology in the business context of the company: Consider market (consumer, enterprise, or government), space (finance, health, social, tools and so on) and company maturity (five versus 10,000 enterprise seats and 1,000 versus 1 million daily users) as a few obvious dimensions. “Scalability” or “security” have very different meanings depending on the company’s business context — and so do the solutions. Similarly, you should evaluate talent, processes, tools and operational playbooks differently based on the business context.

Skills And Experience Matter

Nowadays, many companies use multiple technology stacks. As a consequence, if you’re a CTO performing TDD, you should be “multilingual” so you can evaluate all components of the technology.

To assess development velocity, your investigation should also show how well the code is written and organized and include an evaluation of the tools for test automation, continuous integration/continuous deployment, data center deployment, monitoring, alerting, business intelligence, data science and so on. In addition, assessing a company’s specific expertise in artificial intelligence has become a must in many industries.

As if all that was not enough, TDD assessors should understand engineers as well. It is critical to assess individual and collective talent on the team, as well as organizational dynamics and methodology.

Finally, because so many of the risks and critical milestones can depend on the maturity of the company, one of the most important skills that you can bring as a CTO performing TDD is the ability to identify the inflection points in the company’s growth, assess the impact on technology and translate insights to the technology team: For example, what new technology requirements will you have when the company has reached product-market-fit and enters the growth stage? For this work, there’s no substitute for “I’ve been there.”

The Good News Is Also Important

In parallel to identifying what could go wrong, it is critical to highlight the company’s unique strengths. This starts with its intellectual property (whether it’s patentable or not), it and includes unique sources of talent, internally developed tools and methodologies that increase development velocity and difficult-to-recreate data sets … all of which may have been overlooked by non-technologically-inclined investors. Ultimately, the balance of a company’s unique strengths and weaknesses that will determine its success, and a good due diligence report will highlight that.

For a company seeking investment, TDD may seem like an unnecessary hurdle; however, when it’s properly conducted, TDD adds value and insight for both the investor and the startup.

Prediction: Self-Driving Car Manufacturers Will Own The Car Insurance Business

Previously published in Forbes on July 2, 2018

Can you picture the day when your car insurance bill drops every month? This could very well happen as self-driving car manufacturers (SDCMs) take over the car insurance business.

As it turns out, SDCMs have several powerful incentives to do so.

Their primary motivation is to remove an adoption barrier to self-driving cars: The cost, and even the availability, of car insurance could be a deterrent when purchasing an autonomous vehicle for consumers, as well as the new generation of “taxi” companies. Today’s incumbent car insurance companies do not have statistical tables for accidents and fatalities for self-driving cars since self-driving cars are not yet in circulation. As a consequence, they are likely to be conservative and set high initial costs for insuring autonomous vehicles.

By contrast, SDCMs will have the next best thing to real-life statistics — they have data centers full of data not only about accidents but also about near misses (albeit for their own cars only). This means they can generate accurate statistics about accidents of their own cars as often as they want and thus estimate the cost to insure their cars. An SDCM will be able to turn a barrier to adoption into a potential sale.

Furthermore, by offering car insurance themselves, the self-driving car manufacturers not only remove a barrier to adoption to their product but they also project confidence in their product. In addition, SDCMs will improve a customer’s purchase experience by eliminating one painful step in the car purchase process (because who enjoys shopping for car insurance?), as well as eliminate a third party in the process. Even better, pricing for car insurance will be greatly simplified since the most important variable in the pricing equations — the human — will be taken out of the system. The price of insurance will be determined by the hardware and software installed in the car — not by the human driver. Whether it’s a 16-year-old who just passed their driver’s license exam, a soccer mom with 15 years of accident-free driving or a retired senior, the price will be the same assuming the technology in the car is the same in all three scenarios.

By the way, according to a US Market Research Report on automobile insurance by IBISWorldthe industry revenues totaled $259 billion in 2017. This is no small market, which, in and of itself, provides ample motivation for the self-driving car manufacturers to enter this market.

Since they will have actual real-time data on accidents and fatalities, SDCMs will radically drive down the cost of car insurance and make car ownership more affordable, thus expanding their market. Furthermore, reducing accidents is one of their primary business drivers to increase adoption. This will provide another incentive to drive down the cost of car insurance.

Since self-driving cars will only be commercialized once SDCMs have proved that they are safer than human-driven cars, at that point in time, SDCMs will be in a position to compute the exact probabilities of accidents and their cost because they will have all the data in their data centers. Beth Buczynski is correct in predicting in her article “With Self-Driving Cars, Auto Insurance’s Time Is Limited” that the cost of auto insurance will fall significantly over time and that consumers will no longer pay directly for auto insurance. However, auto insurance will not disappear, because self-driving cars, won’t eliminate all accidents. This liability will no longer be carried by consumers but either by the “robot-taxi” companies or by the SDCMs.

Most importantly, SDCMs will be able to offer car insurance from the get-go as soon as they market self-driving cars because they will be able to offer it at a much lower price than traditional insurance companies. SDCMs will need this cost reduction to help offset the additional cost of the autonomous driving equipment in order to reduce the total cost of ownership of their product.

Finally, since the price of insurance will be determined by how smart the autonomous driving system is, each time the car manufacturer (or the vendor of the autonomous driving software) publishes a new release, the cost of insurance could come down. I can’t wait.

For Machine Learning, It’s All About GPUs

Previously published in Forbes on December 1, 2017

Isn’t it curious that two of the top conferences on artificial intelligence are organized by NVIDIA and Intel? What do chip companies have to teach us about algorithms? The answer is that nowadays, for machine learning (ML), and particularly deep learning (DL), it’s all about GPUs.

In a previous article, I made the case to every CEO and CTO that “Machine learning allows us to make even better use of the data we have, as well as the data we don’t currently possess, and answer the questions we didn’t know we should ask.”

As more companies build AI-driven products, technology providers are responding to this demand by providing products that are computationally more powerful and easier to use and manage in production.

GPUs are driving the next wave of breakthroughs.

Why GPUs Are So Important To Machine Learning

GPUs have almost 200 times more processors per chip than a CPU. For example, an Intel Xeon Platinum 8180 Processor has 28 Cores, while an NVIDIA Tesla K80 has 4,992 CUDA cores. While a CPU core is more powerful than a GPU core, the vast majority of this power goes unused by ML applications. A CPU core is designed to support an extremely broad variety of tasks (e.g., render a webpage, drive word processors and enterprise software, manage peripherals) in addition to performing computations, whereas a GPU core is optimized exclusively for data computations. Because of this singular focus, a GPU core is simpler and has a smaller die area than a CPU, allowing many more GPU cores to be crammed onto a single chip. Consequently, ML applications, which perform large numbers of computations on a vast amount of data, can see huge (i.e., 5 to 10 times) performance improvements when running on a GPU versus a CPU.

Having recognized this fundamental fact a few years ago, the tech industry, particularly the ML crowd, has focused its efforts on taking advantage of the GPU. However, this is not a simple task. All layers of the compute stack have to be redesigned to take advantage of the GPU’s power.

Recent Developments For GPUs

NVIDIA has so far been the main provider of GPU chips for ML acceleration. The company has powered the AWS compute-optimized instances for the past year.

Furthermore, chip manufacturers are about to release chips that are architected specifically for ML from the ground up (rather than continuing to optimize GPUs, which were originally designed for graphics processing). NVIDIA is shipping the Tesla V100, which incorporates Tensor Cores designed specifically for DL, in addition to GPU cores. Google announced its Tensor Processing Unit (TPU) last year that powers its main services: Google Search, Street View, Photos and Google Translate. Finally, Intel announced this month its Nervana Neural Processor, which was also architected, in collaboration with Facebook, to optimize neural network computing.

Building The GPU Compute Stack

Having super-fast GPUs is a great starting point. In order to take full advantage of their power, the compute stack has to be re-engineered from top to bottom.

• Servers

A new category of servers needs to be built to feed the beast. This is necessary to send (and store) data to the GPU at the rate at which it is capable of consuming it, requiring up to 10x improvement in bandwidth.

NVIDIA just started shipping its DGX-1 server. Data throughput and storage have been optimized in order to take full advantage of the processing power of the eight Tesla-V100 processors included in the box.

Facebook recently announced its second generation of AI-hardware (“Big Basin”) to power its own core services: speech and text translations, photo classifiers and real-time video classification.

• Data Center

An article I wrote last month highlighted the impact of ML for cloud providers. Since then, new GPU-related developments have emerged.

Google just made its TPUs available on its compute platform.

Intel just announced its Nervana DevCloud, which is limited for the time being to research and experimentation.

Finally, a super-computing veteran of 45 years is entering the fray. Leveraging its decades of experience in high-performance computing (HPC), Cray will soon be offering its supercomputers for rent on Microsoft Azure. These servers can host a large number NVIDIA Tesla GPUs.

• Frameworks, Models And Algorithms

Optimized hardware requires optimized software. All cloud providers have optimized the major frameworks (Tensorflow, PyTorch, Caffe, MXNet) to their platform. Furthermore, GPU vendors are rewriting the major models and algorithms (NVIDIA DigitsIntel Nervana Graph) to take full advantage of the GPU’s power.

Through the GPU Open Analytics Initiative, companies such as MapD (DB, visualization) and H20 (ML) are rewriting fundamental technologies like databases and programming languages in order to eliminate data copies, which, if ignored, may significantly increase overall execution time.

Finally, some technologies have reached a degree of fidelity high enough to be offered as services: AWSGoogle and Microsoft each offer various flavors of speech recognition, translation and synthesis. Similarly, China’s Megvii’s face recognition service has become very popular.

• The Edge

For some applications, the ML models that have been trained in the data center must be computed at the edge (i.e., close to the end user). In the case of autonomous driving, for example, the car’s brain is trained in the data center but must be run in the car.

Now that machine learning has become mainstream in the data center, dedicated products are being released for edge computing. For example, NVIDIA provides the Drive PXfamily of accelerator cards that host 1-4 GPUs, as well as multiple video and other sensor inputs. They can thus power anything from simple highway driving today to fully autonomous driving in the future.

A New GPU-Driven ML Landscape

From this whirlwind survey of innovation driven by GPUs, one can anticipate increases in processing power of two to five times over the next months, from which a second wave of machine learning breakthroughs is bound to emerge, allowing us to solve a brand-new class of challenges.

 

 

How Machine Learning Will Disrupt The Established Cloud Providers

Previously published in Forbes on October 24, 2017

In the past few years, new categories of products have emerged thanks to the extraordinary advances in machine learning (ML) and deep learning (DL). These new techniques power product recommendations, computer-aided diagnosis in medical imaging and self-driving cars, just to name a few.

Most ML and DL algorithms require compute profiles (hardware, software, storage, networking) that are significantly different from those optimized for traditional applications. Consequently, as more and more companies develop their own ML/DL solutions and deploy them to production, the demand for the ML-optimized compute resources will grow dramatically and create opportunities for new entrants to offer solutions that compete with today’s dominant cloud providers: Amazon AWS, Microsoft Azure and Google Cloud.

The ML/DL Cloud Is Different

In an article on Mesosphere’s blog page, Edward Hsu presented the case that web applications are now primarily data-driven. Consequently, a new set of frameworks (a.k.a. stacks), namely SMACK (Spark, Mesos, Akka, Cassandra, Kafka), must replace the traditional LAMP (Linux, Apache, MySQL, PHP) stack used to build web-based applications. In my view, rather than replacing LAMP, SMACK will coexist side by side with, and feed data to, traditional web-based based frameworks, which are still needed to present nice-looking webpages and interface with mobile phones.

Yet the main point is well-taken. We need to update Marc Andreesen’s famous line about how “Software is eating the world” to “Data is eating the world.” Let’s unpack this statement and derive the consequences.

Hardware

The disruption created by machine learning and deep learning extends well beyond the software stack into chips, servers and cloud providers. This disruption is rooted in the simple fact that GPUs are much more efficient processors for ML and DL than traditional CPUs.

Up until recently, the solution was to augment traditional servers with GPU add-on cards. We are now at a point where demand for ML/DL computing is such that special-purpose servers, optimized for ML/DL compute loads, are being built.

Data centers are also being re-architected to support the extremely large amount of data consumed by ML and DL. Imagine you are designing the brains for self-driving cars. You need to process thousands and thousands of hours of video (and other such signals as GPS, gyroscopes, LIDAR) to train your algorithms. The amount of data that a Tesla on the road records in one second is a million times larger than a tweet or a post on Facebook.

ML/DL data centers thus require both huge amounts of storage and extremely high bandwidth.

Software

The software side is even more complex. A new infrastructure stack, typically using machine learning-specific frameworks such as Tensorflow (originally developed by Google) or PyTorch (originally developed at Facebook), is required to shepherd data around and manage the execution of the compute jobs. Furthermore, open-source code libraries (pandasscikit-learnmatplotlib) are used to implement the models (e.g., neural networks, data displays). These model libraries are critical because they are optimized to be both easy to use for algorithm research and offer high performance for use in production.

Finally, each vendor offers complete building blocks for specific use cases. For example, Amazon LexGoogle Cloud Speech and Microsoft Bing Speech provide speech recognition and can even recognize intent. Each has its own API and unique behavior, making the migration from one vendor to the other time-consuming.

New Entrants

In addition to the Big Three cloud providers (Amazon AWS, Microsoft Azure and Google Cloud) that have offered GPU-accelerated instances for a few years, new ML-optimized offerings have emerged:

• NVIDIA, which is already the dominant provider of GPUs that power the graphics cards that drive computer displays, recently introduced a portfolio of “purpose-built AI supercomputers” servers known as its DGX systems.

• Servers.com offers its Prisma Cloud with dedicated GPU-optimized servers.

• Rescale, one of the niche cloud providers that focuses on high-performance computing (HPC), just announced the availability of the latest generation of GPU-powered servers, along with high-bandwidth interconnect, to create high-performance multi-node clusters.

What’s At Stake

The Big Three cloud providers are the ones most immediately at risk to be disrupted by new entrants such as NVIDIA, Servers.com and Rescale. ML/DL innovation is still running at a torrid pace thanks to innovation in algorithms as well as compute efficiency. This is creating a small arms race where end users are constantly looking for the provider that can give that extra edge.

On one hand, end users are benefiting hugely from this arms race to provide the best software and hardware compute environment. On the other, this requires constant vigilance to keep abreast of the latest offerings. Even more importantly, when deploying ML/DL products to production, CEOs and CTOs need to pick the winner — or at least a future survivor — that will keep their edge for the next two to five years. This is not an easy task.

We will delve deeper into these two topics in future posts — stay tuned.

The Machine Learning Imperative

Previously published in Forbes on June 28, 2017

There’s no longer a debate as to whether companies should invest in machine learning (ML); rather, the question is, “Do you have a valid reason not to invest in ML now?”

Machine learning is here, and it’s finally mature enough to cause a major seismic shift in virtually every industry. For example, Matt Swanson, founder of SVSG, wrote an article last year about how chatbots will disrupt a $200 billion industry. While ML cannot solve every problem, it has demonstrated a game-changing impact in enough markets that every CEO and CTO must ask himself/herself whether they understand ML well enough to rule it out for their own business. While appreciating the rewards of ML may be difficult, we do know the risks: ML has already disrupted several industries, including e-commerceautonomous driving and customer engagement. The risk of ignoring ML today is one that is probably too large for any established company to take.

Machine Learning Changes The Game

While artificial intelligence grabs most of the spotlight in discussions about machine learning (primarily due to its easily graspable life-altering implications), it is but one of many disciplines in ML. Big data has demonstrated the enormous value of data: Netflix and Amazon recommend films and products based on our own purchase history and those of customers like us. Thus, big data has helped us answer questions we already knew to ask, questions such as, “What more can I sell to my customers?”

Machine learning allows us to make even better use of the data we have, as well as the data we don’t currently possess, and answer the questions we didn’t know we should ask.

Machine Learning Uses Data We Don’t Yet Have

Analytics and business intelligence extract information from structured data (i.e., data stored in databases: customer information, purchase history, etc.). But thanks to ML, we can now extract information from unstructured data such as texts, phone calls, images and videos.

Search engines used to return pages based the exact words of the query. ML takes this text analysis a few steps further. First, it extracts concepts out of words and associates pages that discuss the same concept with different words: A search for “artificial intelligence” will produce results that mention machine learning and robotics but not explicitly the words “artificial intelligence.” Beyond this, ML is now becoming proficient at sentiment analysis and determining intent in a given context. This means that ML can deduce, via our posts on social media, if we are happy or angry (sentiment analysis), for whom we are likely to vote for, or what purchase we are considering next (intent).

Similarly, ML techniques like natural language processing (NLP) and image categorization interpret and translate people’s speech as well as the content of images (e.g., facial recognition on Facebook).

This means that, thanks to ML, the huge amount of publicly available content — which, up until recently, was of little use — can now give us useful new insights.

Machine Learning Makes Better Use Of The Data We Have

Machine learning provides a new class of algorithms that manipulates structured data that we already possess. AWS has a nice blog, including code, on how to build a prediction engine for customer churn. BlackRock is using machines to manage funds.

In addition, data that every company gathers from its customers (emails, chats, comments, support requests, etc.) can now be analyzed by ML to extract accurate customer sentiment (satisfaction with the service, suggestions, identifying emergency requests). Even polls and surveys may be replaced by ML algorithms that can mine Facebook, Twitter and news sites to capture the sentiment of millions of people expressing themselves openly.

Machine Learning Answers Questions We Didn’t Know To Ask

At the risk of stating the obvious, the power of machine learning is that it learns. The more information provided, the faster it learns and the better it answers.

While traditional business intelligence techniques can tell us how often products A and B are purchased together, these techniques fail in the face of a massive organization such as Amazon, which sells over 368 million products. However, ML can digest the flow of purchase transactions and identify patterns of joint purchases. ML can even use these predictions to automatically make purchase decisions (see German e-commerce merchant Otto as an example).

Furthermore, by leveraging data we don’t have — such as stock market indices, weather data, political news and government statistics — we can correlate external events with our business data and thus enrich the accuracy of our predictions and decisions.

Why Now?

The rapid growth of machine learning leads to uncertainty, which may entice business leaders to hesitate in utilizing it. Yes, machine learning is complex, but it is also a powerful force of disruption. Because ML is still developing, it presents an opportunity to pull ahead of the competition by taking advantage of this maturation period. The choice is simple: disrupt or be disrupted.

It will take some time to ascertain what use cases are relevant to your company, so it is important to start this investigation now. ML is complex and challenging to master, yet the tools for machine learning are all readily available to you and are already being employed by AmazonGoogle and  Microsoft.

The journey to machine learning must start now.

Everything You Ever Wanted to Know About Technical Debt

Check out the white-paper I recently authored at the Silicon Valley Software Group.

Its main objective is to build a bridge between technical and non-technical executives to have rational discussions about technical debt, and then make rational decisions on how to tackle it.

Some of the main takeaways are:

  • Technical debt is on-going: Technical debt originates from a variety of sources, some legitimate, others less so, throughout the life of a product. This means that technical debt should be  integrated into the product roadmap process
  • There are different types of technical debt, characterized mainly by the risk they entail, and the cost to remedy. Consequently, there are different strategies to address different types of technical debt
  • Ranking the various types technical debt of a product on the two-dimensional plane risk vs cost-to-fix provides a good vehicle to foster dialogue, and decisions, about engineering priorities between technical and business executives.

For more details, please download the white-paper at: svsg.co/sme

Time Tested Engineering Leadership Principles

I put together the first three of these four leadership principles during my first VP of Engineering gig, twenty years ago. Thirteen companies later, and having shared it with hundreds of engineers, I feel it is time to share the secret J

These leadership principles have been honed (a) for Engineers and (b) in the context of startups, typically with fewer than 150 employees. No claim is being made outside of these parameters.

1.   I commit to give you more responsibilities than you can handle … and help you succeed

The vast majority of Engineers are highly motivated (see my previous blog on “(Boosting) Morale in Engineering). They are motivated by their career, naturally, yet they are primarily driven by a need to accomplish and an intense desire to learn.

Another way of articulating this commitment is: “I am going to challenge you, and let you work as hard as you want, and exercise as many of your skills as possible”. Engineers hate being bored. On the contrary, they work extra hard when challenged. So my job is to continuously provide new challenges to each engineer in my team, and remove any impediments to their desire to fulfill these challenges.

2.   I commit to give you clarity, both strategic & tactical

I work hard to ensure that everyone knows where we, as a company and as an Engineering team, are going, what our objectives are (strategic), and how we plan to get there (tactical).

In practice, I make sure, during our periodic 1on1 that each engineer understands how his/her own project and role align with the company mission, and Engineering’s product roadmap.

Included in this commitment is a promise to each member of the Engineering team that on any given day, his/her #1 priority is clear. As logical consequence, this implies that each engineer only has one #1 priority (I have seen a lot of companies where this logic is violated). Their manager, or I as last resort, will handle situations where, for example, 3 VPs are breathing down an engineer’s neck, each with their own “top priority”.

Having everyone in the team understand and share the same strategic context empowers developers to make correct micro-decisions every day. As a side benefit, this frees me and their managers to work on bigger problem.

 

Taking a step back, if I’ve communicated correctly my commitments 1 and 2, then everyone in the team is working at the maximum of their ability – and – all are working in the same direction. This is a good foundation for solid productivity.

Having made two commitments to everyone in the team, I ask for two in return.

3.   In return, I demand teamwork & 3-D communications

I put teamwork and communications in the same sentence because one is meaningless without the other. Teamwork can’t exist without meaningful communications, and if we communicate but don’t work together, we don’t go very far.

No interview question will ever suss out whether a candidate is a team player or not. Instead, I explicitly declare that they should not join my team if they are not a team player.

Team work is important because product development is a team effort. Every engineer interacts with product managers, UX designers, front-end engineers, middle-tier, backend, data, QA, tech support, etc. Poor interactions with other team members results in poor individual efficiency.

Teamwork means that “together, we succeed”. Teamwork is not merely about helping out a teammate who needs help. More importantly, being a team player means asking for help when we need it, so as not to delay the whole team.

3-D communications simply expands the definition of “team” beyond one’s daily scrum. We are all inter-dependent, and we each must ensure that information gets to the people who need it, no matter where their name sits in the org chart. Making sure information is received in a timely fashion, rather than waiting for questions to be asked, is incumbent upon each of us.

In particular, this means that everyone on my team has the responsibility to inform me if I am not meeting commitments #1 and #2 stated above. I don’t read minds, and I can only take corrective actions if someone lets me know that they are bored, confused, pulled in too many directions, or under-utilized, etc.

4.   At the end of the day, we need to be proud of our work

I added this fourth principle, a few years later. I had been working at a company for about a year, had delivered a handful of successful releases, yet sensed burn-out and loss of creativity in the team.

A startup demands almost contradictory qualities from its Engineering team: speed and creativity (quality is a given). Because the demand on speed is often explicit, while the demand on creativity is often implicit, it is easy to fall into the trap of focusing only on execution at the detriment of innovation, or even the beauty of the code.

Yet, if we continuously succumb to the mantra of “ship, ship, ship”, and give up trying to build something cool, then we start on a slippery downward slope towards creating “blah” products. There are always pressures to ship more features faster, but if each of us is not proud of the product we are releasing to our customers then our customers won’t be excited about the product, and we won’t be having fun at work. Life is too short for us to accept either of these issues.

Making It All Work

There is nothing new, or magic, about these four leadership practices. The magic is in their daily practice. They work for me because I force myself to apply them on a daily basis, and I remind my teammates of their existence, their rationale and their own commitments, whether when welcoming a new member, during a 1on1, during my weekly staff meetings, at exec staff, or monthly Engineering updates, or even at the water cooler.

DevOps-Driven Development

It is now time to add the concept of “DevOps-Driven Development” to our repertoire.

“Test-driven” development, which originated around the same time as Extreme Programming and Agile Development, encourages us to think about testing as we architect our software and plan our tasks. Similarly, a “DevOps-Driven Development” approach, ensures that we consider operational implementation as well as deployment process during the design phase. To be clear, DevOps thinking needs to augment (and not replace) testing strategy.

Definition and Motivation

First a definition: I am using the word DevOps here as a shortcut to include both DevOps (build and deployment tools) and Ops (IT/data center Operations).

How many times have you heard “ … but it works on my machine!!” from a developer whose code was found to have a bug in the QA environment or, worse, in production? We all agree that these situations are a horrible waste of time for all involved, most of all customers. This post  thus advocates that DevOps-thinking, just as quality-thinking, must occur at the design phase and continue throughout the development of the software until the software is released to production, and even after it has been released in production.

Practicing DevOps-Driven Development

I have always advocated: “If you don’t know how to test it, you don’t know how to design it.” (Who Owns Quality? Part 3), to articulate the fact that “quality cannot be debugged out, it has to be designed in”. Similarly, if we want to know – before our customers call us – when our code crashes in Production, or becomes unusably slow, then we must build into our code the proper instrumentation and administration capabilities.

We now must add this mantra “If you don’t know how to deploy it and manage it in Production, you don’t know how to design it”.

Just like we don’t allow code to be merged into Trunk (main branch) without complete unit tests, code cannot be merged into Trunk without correct deployment scripts, release notes, and production instrumentation.

Here is a “thinking DevOps” check list:

Deployable

First of all, we must ensure that the code deploys successfully not only in Production but in all environments: Dev, QA, Stage, etc

This implies:

  • Developers write/update release notes: e.g. highlighting any changes required in the configuration of the environments: open new port, add a column in database, a new property in config files, etc
  • Developers in collaboration with DevOps team update deployment scripts, e.g. to account for a new executable, or schema changes in the database

The management of Config/Property files is beyond the scope of this blog, but I strongly recommend the “Infrastructure as code” approach: i.e. fully automating  server/image configuration for deployment and, managing configuration, deployment scripts and application property files under source code control.

Monitor-able

If we want to detect problems before our (irate) customers call us, our code needs to be monitor-able – not only at the physical server level, but also each virtual machine, service and process, as well as networking and storage systems.

Monitor-ability needs surpass keeping track of CPU load, disk space and network bandwidth. We, developers, (should) know what parameter(s) indicate when our system is mis-behaving, whether it is a queue exceeding a given size, or certain operations timing out. As a consequence, we must publish these parameters to interfaces compatible with Ops monitoring tools, of which there are several categories:

Furthermore, by making performance metrics easily observable, we ensure that each new release maintains (or improves) the performance of the prior release.

Diagnosable

Despite our best intentions, we must humbly assume that at some point our code will crash, or seriously mis-behave, and thus require troubleshooting. In the worst case, Development will be called in (usually in the wee hours of the night) to assist the Ops team. As any one who has had to figure out why a given system intermittently crashes will attest, having log files capture meaningful information prior to the incident is invaluable. Having to add logging statements after-the-fact is a painful process. Consequently, a solid Logging Hygiene is critical (and worthy of a dedicated post):

  • Log statements must be written in a format compatible with the log management system (Splunk, GrayLog2, …)
  • All log statements used during the coding and QA phase must be removed
  • Comprehensive Operations-focused logging must be added to document all operations that may fail due to environmental and data-related problems: out-of-memory, disk full, time out, user not found, access denied, etc. These are not bugs, but failures due to either environment (e.g. a server or connection is down) or incorrect data (e.g. the user has been deleted).
  • The hierarchy of logging levels must be enforced so that in normal operations log files are kept small, and conversely  meaningful information is output when troubleshooting is required
  • Log statements must include all the information necessary to bind all operations across various services that are related to a single user-level transaction (e.g. clicking on a link to a new page, adding an item to cart) – more details below in “Tunable”.

Security

This again is worthy of its own post, but code that is deployed to Production must both support the security practices implemented by the Ops team (e.g. Authentication protocols, networking infrastructure), and ensure that the code itself is secure (e.g. no SQL injection, buffer overflow, etc).

Business Continuity

Business continuity is often overlooked, but we must ensure that any persistent data is stored in a storage system that is backed up by the Ops team. In other words, if we add a new database, we’d better ask the Ops team to add it to their backup scripts.

Similarly, if our infrastructure is deployed (or even just deployable) across multiple data-centers, our code must support this though configuration.

The above requirements represent the basic DevOps requirements that any developer must address before even thinking that his/her code is ready to release. The following details additional practices that are highly recommended, but not strictly necessary.

Scalable

The code must be designed so that the Ops team can scale it in the datacenter without needing help from Development.

This may involve deploying the code to a bigger server. This implies that the code can be configured (and documented for the Ops team) to make use of the expanded resources, whether it is number of cores, RAM, threads, I/O, etc

This may also involve adding instances to a cluster. Consequently, the code must be discoverable (the load balancer must find out that a new instance has been added/subtracted), as well as cluster-aware (e.g. stateless).

Tunable

Because it is so hard to simulate all real-life user activities and behaviors in non-production environments, we must provide tools to the Ops team to tune the performance of our code through configuration rather than code deployment (e.g. size of JVM, number of threads, queue sizes, hash table size, etc).

We must thus provide the metrics to observe performance. Let’s take the example of response time: depending on the complexity of the application a user request may be handled by tens, or even hundreds of services. In order to allow the Ops team to build a timeline of the interactions between all the services involved, each log entry must carry at least one tag that identifies the root transaction that generated the request. Otherwise it is impossible to determine whether the performance degradation comes from a given service, or a unique server, or even from the network infrastructure.

The same tagging will be used to troubleshoot failures (e.g. to discover why a given service fails intermittently).

QA-able

As I mentioned in an earlier blog, QA does not stop in QA: we have to anticipate “unknown unknowns”, i.e. usage (or performance) scenarios that we have not modeled in our QA environments. By definition, there is not much we can do other than ensuring that our code is easy to trouble-shoot (see above) and that logs and associated data can be made available easily and rapidly to developers and QA team (e.g. by giving them access to the log management console).

Sometimes this requirement is more complex than it sounds, e.g. when user data must be deleted or obfuscated for privacy or security reasons. Again, this should be thought through before code is deployed.

Analytics – Growth Hacking – Usability

This last requirement stems from Marketing and Sales rather than Operations, but it is equally important since it drives revenue growth.

In most companies, marketing and sales rely on usage reports to drive new marketing campaigns, pricing, product offerings and even new features. As a consequence, any new feature must integrate with the Analytics infrastructure whether via integration with usage tracking applications (e.g. Mixpanel, Flurry, …) or simply log management consoles (Splunk, GrayLog2, …). However, I highly recommend using separate logging infrastructure for operations monitoring and for usage analytics, if only because usage analytics requires additional data that is not useful for Operations monitoring (e.g. the time a user spends on a page is extremely valuable for usage analytics but irrelevant for Operations)

Even More So for Microservices

As we migrate towards a microservices architecture, early “DevOps thinking” becomes even more critical. As the “Microservices: Four Essential Checklists when Getting Started” advises: “Microservices introduces a lot of moving parts that were previously non-existent in a monolithic system”.

What was a monolithic application running in a single virtual machine can morph into 5, 10 or even 20 microservices. Consequently, Development, DevOps and Ops must collaborate on microservices infrastructure tools: service registration, scaling up/down each service independently, health monitoring, error detection, etc. to provide visibility on the status of these 20 microservices as a whole. This challenge has even prompted dedicated product categories (SignalFx,  Nirmata, etc)

Summary

Only with a holistic approach to product architecture can we ensure customer satisfaction with software that works the first time, and all the time. Deployment and operations management concerns, just like testability, must be addressed at design time, so that these capabilities are meshed natively into the code rather than “bolted on” after the fact. Failing to do so will likely impact the delivery schedule, or worse, create outages in production.

More importantly, there is so much we can learn from observing how our code behaves in Production: operational efficiency, stability, performance, usability, that we would do a disservice to ourselves if we did not avail ourselves of this valuable information to drive further improvements to our product.