What AI means for Enterprise IT?

12 min readAug 30, 2018

1. Vision of Self-driving IT

Imagine having Self-driving Enterprise IT where machine intelligence completely replaces human Ops in managing business-critical application deployments — will this become a reality or is it a pipe dream? To answer this question, we draw parallels to the evolution of self-driving cars that define machine intelligence on a scale of zero to five — Level 0 is all human-driven, while Level 5 is all machine driven (no human). In this article, we define a similar scale for the evolution of Self-driving IT. Using this scale, we characterize the current state-of-art, and describe how the technology trends are making Self-driving Enterprise IT a reality!

One of the key tasks for Enterprise IT Management today is to ensure that Service Level Goals for business-critical applications are continuously met. This is a critical tasks for any Enterprise since downtime, data breaches, slow performance, and other risks can lead to millions of dollars in losses, as well as priceless impact on the brand value. The human Ops activities today can be broadly classified into three buckets:

Reactive fire-fighting: Diagnosing and resolving ongoing issues
Proactive problem avoidance: Actively tracking known issues available across structured and unstructured knowledge sources
Optimize costs: optimizing Opex and Capex (typically not a first order consideration)

CIOs today are under constant pressure to align IT with the core business differentiator of their Enterprise. This translates to finding cheaper alternatives to activities required for existing application deployments. Investing in Ops talent is becoming increasingly expensive and difficult, given skill-set scarcity — in security domain alone, 1.5 million jobs are expected to be unfilled by 2019. Automation of Enterprise IT provides CIOs the best of both worlds in terms of complete control of application platforms (in contrast to cloud deployments) without breaking the bank on Opex.

Automation needs to be considered an evolution rather than a revolution. Self-driving cars provide an inspiring analogy with a standards definition for six levels of autonomy — the levels are briefly listed below (for more details, checkout the standards document available here):

Level 0 — No Driving Automation
Level 1 — Driver Assistance:
Level 2 — Partial Driving Automation
Level 3 — Conditional Driving Automation
Level 4 — High Driving Automation
Level 5 — Full Driving Automation

We define a similar scale to track the evolution of autonomy in Enterprise IT management. In this approach, existing management tasks are modeled as a variant of the OODA loop, and composed of Monitor-Analyze-Recommend-Act. The levels of autonomy define the sub-tasks of the loop that are machine automated:

Level 0 — Monitor: Ability to automate collection and aggregation of telemetry and logs across the lifecycle of the application deployment.
Level 1 — Monitor + Analyze: Ability to automate the analysis of monitored information. The goal is to provide insights to human Ops on patterns, anomalies, intrusion access, predicted usage, etc.
Level 2 — Monitor + Analyze + Recommend (for reactive fire-fighting): Ability to automate the generation of an actionable insight based on the analysis of monitoring data. In this stage, the focus is on diagnose of existing issues and recommending corrective actions.
Level 3 — Monitor + Analyze + Recommend (for proactive problem avoidance): Ability to recommend actions to avoid Ops fires. Proactive recommendations requires deeper domain knowledge w.r.t. analyzing system context as well as unstructured information.
Level 4 — Monitor + Analyze + Recommend + Act (repetitive tasks): Ability to learn patterns from human Ops, and automating repetitive as well as previously seen tasks.
Level 5 — Monitor + Analyze + Recommend + Act (on all tasks): This is the ultimate nirvana where machines completely take-over from humans.

The current state-of-art of Enterprise IT Management is approaching Level 2 — the industry is focusing on using AI and Deep Learning to automate root-causes diagnosis to provide actionable recommendations. There are several technology advancements that are promising in moving the state-of-art to Level 5 in the coming years:

Monitoring capability: Over the years, the technologies to collect and aggregate large amounts of telemetry information and logs from deployed applications has matured significantly.
Orchestration capability: Emerging platforms such as Kubernetes provide a logical specification for application deployment with intelligence for scheduling, scaling, fault-tolerance, self-healing. The ability to manage multiple versions with ability to roll-back significantly reduces the cost of an incorrect decision, lowering the bar for incorporating automation.
Reasoning capability: Advancements in AI-based reasoning are making it possible to automate not just for known, but also for unknown scenarios.
Reinforcement Learning capability: In contrast to monolithic applications, micro-services implementing a single functionality make it easy for machine intelligence to learn and correlate observable patterns.
Operations Data visibility: Machine learning algorithms thrive on data. The growing adoption of cloud infrastructure makes operational data potentially available across thousands of customer deployments.

In summary, the vision of Self-driving Enterprise IT should not be considered a 0-or-1 binary, but rather a continuum along the 0–5 scale. The evolution journey has already started, with advancements in technology building blocks making Level 5 a reality in the coming years! We expect the advancements to be further propelled by adoption needs of Enterprises facing shrinking IT budgets, raising costs of human domain experts, and increasing skill-set scarcity.

2. Use-cases for AI in Enterprise IT

As software eats every vertical market, Enterprise IT is transforming from the cost-of-doing-business to a business differentiator. CIOs are busy with “digital transformation” focussing on digitizing new and existing business workflows, as well as extracting competitive insights from the growing amount of data. In meeting these objectives, CIOs are actively evaluating different on-prem or cloud-based platforms that provide agility in delivering new applications, as well as meets functionality expectations. The key metric is to reduce human IT effort from day-to-day management, with the focus on innovation instead of fire-fighting. The promise of “human-like intelligence” of AI starts resonating with CIOs who are under pressure to do more with less.

AI really is a collection of techniques. The term intelligence is broadly defined as the ability to learn knowledge and reason with it for problem-solving. The branch of AI that has gained a lot of popularity today is machine learning. AI has been in the making for 60 years, and now being compared to become as essential as electricity. Skeptics claim that learning techniques such as neural networks are not new, so why now? There are three trends that are coming together: Increasing availability of cheap parallel compute; growing data corpus; and advancements in algorithms for Machine Learning and Data Science. As CIOs evaluate AI-based technologies, a few success metrics would be: 1) Reducing opex costs by automating a subset of manual tasks; 2) Reduce losses due to data loss, security attacks, downtime, etc.; 3) Cost optimizations by better resource usage, etc.

While it appears that any problem can be solved using Machine Learning, the reality is that there is a limited categories of questions that are best suited for machine learning today. The usefulness of applying Machine Learning depends on how precisely the question is formulated, and whether there is data available to support the answers. In the simplified sense, the popular question categories are as follows:

Anomaly detection: Is the latency usually high?; Is the CPU utilization abnormal?
Clustering (Find similar patterns): What is the load during different times of the day?; What is the performance for this workload pattern?
Classification: What is the category of error?; What percentage of workloads will be affected?
Regression (Predict outcomes): What is the expected latency for this workload?; What is the overload value for this resource?
Re-enforcement learning (Learn action impact): What will be the impact of changing this parameter?; Given a specific value of this knob, what will be the impact?

Enterprise IT Management can be broadly defined as ensuring Service Levels for business-critical applications (that are either internal or customer-facing). Service Level objectives are typically defined in terms of measurable metrics for security, performance, availability, scaling. There are different day-to-day activities involved, depending on the lifecycle of Enterprise IT: Day 0, Day 1, Day 2 is a common terminology used to refer activities during the initial deployment, configuration and optimization, and maintenance activities respectively. There are best-practices such as ITIL that define activities involved in various Enterprise IT management tasks. A few key categories are:

Capacity Planning: Both initial planning as well as ongoing scaling of deployments
Continuous Monitoring: Continuously tracking telemetry information as well as logs
Configuration Management: Ensuring correctness of configuration parameters as well as optimization
Change Management: Ensuring timely patching, upgrades, service validation testing, etc.
Root-cause Diagnosis: End-to-end analysis of issues impacting applications

In summary, Enterprise IT management is ripe for disruption. The democratization of building blocks for AI, and machine learning in particular will make this space increasingly active. The key to a winning solution is deeply understanding the IT workflows today, and being realistic about the strengths of AI!

In applying ML in production, a significant amount of time is spent in aggregating, cleaning, transforming, and understanding data. Once the ML model is created, the process of deploying, monitoring, and iterating versions of these models in production is ad-hoc and error-prone today. This section covers the roadblocks associated with ML in production.

To summarize Scully et al. “only a fraction of the code … is actually doing machine learning. A mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code.” ML in production is different from academia:

Systems come before algorithms. In academic machine learning, accuracy take priority, at the expense of long run times. In industry, faster is always better and slower has to be justified, meaning accuracy can often take a back seat.
Objective functions are messy. Academic machine learning is all about optimizing objective function. Clean objective functions do not exist, and typically there are many and conflicting functions requiring a Pareto multiple-objective approach (make an improvement to one without negatively affecting the others).
Understanding-optimization trade-off. A process of coming up with hypotheses, testing them and improving the system. Understanding is often more important than better results. Experiments drive understanding.

Change Anything, Change Everything

The biggest challenge in managing ML projects is change management:

Data distributions changes: Data is not constant, but continuously evolving. The distributions of data in training versus production can be quite different.
Feature changes: The importance of input features on the model can change over time.
Type of ML algorithm: Based on changes in data and features, the most optimal ML algorithm may be different over time.
Dependencies on other modules: Depending on the sources of data pipelines, there can be indirect dependencies on other modules that can be affected.
SLA constraints: Depending on the application of the ML algorithm, there may be constraints w.r.t. prediction time, CPU and memory resources that may change with the scale of data.
Indirect User affects: The predictions or recommendations that the ML model provides to users can change the user behavior, and effectively modify the data distribution.

Dealing with change is analogous to untangling a giant spagetti-ball. Coupling between models, configuration, data distributions, data sources, business logic, are all not well isolated. The provenance and dependencies are not well-documented leading to debugging nightmare.

Pipeline Jungles

Real-world data travels through a pipeline of jungles before being fed to ML models, with a lot of glue code to get data in/out. Following is an example of one such pipeline jungle. The systems are complex and no one person understands all of it.

Messy Data

Real-world data is noisy and comes with the following data analysis challenges:

Missing data: Some of the input parameters may be missing.
Incorrect labels: Labels associated with the data can have errors which will result in incorrectly training the model.
Nested: Features can be a part of a compound data-structure requiring fragile parsing or extraction logic.
Sparse: Only a small subset of samples may have interesting values.

Rudimentary Software Engineering for ML

Today, traditional programs are delivered via a software engineering process to develop, test, deploy, and update — essentially a process to get the program from the developer’s laptop, to running in production. What is an equivalent process for a ML project? While there is a lot of material available on ML algorithms, the equivalent software process for deploying ML in production is not well-documented, as well as techniques followed today are relatively ad-hoc and rudimentary compared to traditional Software engineering. Basic building blocks for regression testing, checks-and-bounds on production deployments, managing multiple versions of the models, data disentanglement are rudimentary.

4. Innovative pricing for AI-based solutions

The popular pricing models for Enterprise IT Management software are pay-per-managed-instance, pay-per-seat, or pay-per-GB-of-collected-metrics. The goal of AI-based automation is to improve productivity of human Ops. Are the existing pricing models accurately reflecting the value delivered by AI-based solutions to Enterprises? Also, are these pricing models feasible for the software vendor? This section explores the topic of pricing solutions for AI-based Enterprise IT Management.

The nirvana for Enterprises is to make IT costs completely on-demand, correlated directly to business revenue. The industry has been evolving towards this ideal goal with the growing adoption of SaaS and Cloud computing. Pay-as-you-go pricing is a de-facto today for Enterprise IT operations in the cloud. Even for on-premise deployments, perpetual license models are being forced to become more usage-oriented. The logical progression for pay-as-you-go models is an even fine grained unit of consumption — AWS Lambda, Azure Functions, GCP Cloud Functions, are examples of a pay-per-execution model where the pricing is based on number of invocations of a function — simply put, a serverless model where the customer does not pay for a running compute instance that is idle.

The popular pricing models for Enterprise IT Management software are pay-per-managed-instance, pay-per-seat, or pay-per-GB-of-collected-metrics. In the context of AI-based automation, each of these models have limitations:

Pay-per-seat: This pricing model is linked with the number of human Ops using the software. By definition, this model contradicts the value proposition. As productivity improves, the need for human Ops should reduce, which reduces the number of seats required — by adding intelligence for productivity improvements, software vendors will actually reduce their license revenue! This model is broken.
Pay-per-managed-instance: Traditionally, there was a direct correlation between the size of the cluster, and the amount of human effort required for management. Given the changing landscape of IT automation, this correlation is becoming increasingly questionable. For instance, using puppet/chef for deployments, the effort to deploy a single instance versus 1000 instances is only incremental. While this pricing model is feasible for software vendors, my personal experience is that customers perceive as over-paying using this model.
Pay-per-GB-of-collected-metrics: This model is being successfully used for solutions that collect logs and metrics for analysis. Customers consider the pricing intuitive since it correlates to the costs involved in persisting the monitored data, and effort for analysis. While this is a good model for data analysis automation, it becomes less intuitive for complete AI-based automation that includes analysis, optimizing and planning across known solutions, and providing a recommendation. Planning and optimization techniques are significantly more expensive to develop, and software vendors will not be fully compensated if they only rely on the pay-per-GB model.

The right pricing model is one of the key elements of the startup discovery process — should it be cost-based, value-based, or competition-based? Today, most startups adopt competitive pricing models, since customers are already familiar, and tend to be less questioning (one less adoption friction to overcome!). To conclude this post, we provide our experiences on cost- and value-based pricing given our experiences developing a AI-based service as well as interactions with customers:

Cost-based Pricing Strategy (i.e., what it takes for software vendors): AI-based systems typically create deep neural networks or similar requiring heavy compute resources. The machine learning model can be applied for a customer cluster deployment of 10 nodes or 1000 nodes. The base cost of creating such models are significant, with the cluster size adding only a very incremental overhead. As such, pricing based on number of instances aligns well with cost-based pricing only for larger customer deployments (sweet-spot of deployment size will vary.
Value-based perspective (i.e., what the customer gets): The key metric for the customer is Ops productivity. Ideally, if a specific task takes a human 2 hours, the AI-bot can a charged on a per-operation basis, as some fraction of the human cost. But, defining an intuitive benchmark for human-time is non-trivial. Alternatively, the pricing can be structured as recruiting bots on the Ops team. Similar to humans, bots can vary in proficiency between apprentice versus an expert. For instance, for a capacity planning task, an apprentice bot can analyze historical load patterns to provide statistical distributions, while an expert bot can not only provide the distributions, but can optimize across all known alternatives and provide a recommendation. Software vendors can price the apprentice and expert bots differently (essentially a feature-based tiered model, but hopefully more intuitive).

What AI means for Enterprise IT?

1. Vision of Self-driving IT

2. Use-cases for AI in Enterprise IT

4. Innovative pricing for AI-based solutions

Written by Sandeep Uttamchandani