Artificial Intelligence for IT operations (AIOps) is all about adding context to large volumes of data in order to refine the conclusions reached in such areas as performance analysis, IT service management (ITSM), and IT operations management (ITOM). It has particular value in event correlation, event analysis, anomaly detection, root cause analysis, natural language processing, automation, and diagnostics.
AIOps platforms use a variety of ways to harness AI, big data, and machine learning to analyze IT data. The ultimate goal is the discovery of patterns that can be used to predict incidents, spot emerging behavior, determine root causes, and drive automation.
AIOps is fairly new to the networking and IT management party. The field is largely emerging from the early adopter stage. However, Gartner predicts that by 2023, 40% of DevOps teams will have added AIOps capabilities to ongoing application and infrastructure monitoring efforts.
As a result, the big ITSM and ITOM vendors such as Splunk and BMC are rushing to add AIOps capabilities to their existing suites. That’s why the market is largely dominated by smaller companies and startups. In all likelihood, the next year or two will see some of the firms covered in this guide gobbled up by larger players.
Table of Contents
- Key AIOps Platform Features
- AIOps Use Cases
- Top AIOps Companies
There are many AIOps platforms out there as well as broader ITOM and ITSM suites that are introducing AIOps functionality. The key features to look for include:
- Data ingestion: AIOps tools must make it easy to ingest data from multiple sources. These sources vary from organization to organization. In some cases, broad ingestion is required that encompasses infrastructure, networks, applications, cloud monitoring tools, and existing management and monitoring tools. But in other cases, domain-specific data ingestion will be more important as a means of taking care of global issues on Salesforce, Azure, AWS, etc.
- Machine learning for real-time and historical analysis.
- Storing and access to data to be analyzed.
- The ability to provide recommendations, and suggest courses of action based on analysis.
- Automation of actions based on analyses and IT policy setting.
AIOps is deployed for a variety of reasons. These include:
- Rapid growth in IT data and vast numbers of alerts from different monitoring tools
- Constant change of IT architectures making it hard to maintain observability
- Demand for automation of recurring tasks to relieve the work burden and move from reactive to proactive/predictive maintenance
- On overabundance of noise and false alarms from IT systems
- The need for speed in detection and resolution of abnormal conditions, behavior, and threats
- Detection of trends that may result in outages, or that may impact the business
- Digital transformation initiatives that demand data is gathered from a larger set of more sources and require integration of that data as well as its rapid analysis
- As a complement to DevOps adoption – AIOps adds visibility and automation so that DevOps can be effectively used to speed development.
We have reviewed the market for AIOps and judged the following to be among the top platforms available, in no particular order. These companies are among the early innovators in the AIOps field, and score well in analyst reports in terms of functionality and maturity.
Dynatrace addresses the fact that cloud complexity has expanded beyond the scope of manual management. The company provides a software intelligence platform beyond infrastructure and application monitoring that includes the user experience and business outcome key performance indicators (KPIs). The Dynatrace AI engine, Davis, automatically processes billions of dependencies in real-time, continuously monitors the full stack for system degradation and performance anomalies, and delivers root-cause determination, prioritized by business impact.
- The Dynatrace platform can harness and unify complex multi-clouds with out-of-the-box support for all major cloud platforms and technologies.
- Over 500 integrations out-of-the-box and further support for custom integrations through open APIs.
- The ability to drive automation in everything from development and releases to cloud operations, business processes, and application security. There is no need to configure, script, or know which applications or cloud platforms are running.
- Its AI-engine, Davis, sits at the core of the platform, not bolted on, so it doesn’t need to learn, and it doesn’t need to make statistical guesses.
- Dynatrace modules for infrastructure monitoring, applications and microservices, application security, digital experience, business analytics, and cloud automation extend the Dynatrace platform for use cases such as time-to-market, efficiency, and cost of ownership.
Datadog Watchdog is a machine learning engine that identifies unknowns within cloud infrastructure, applications, and logs, discovers and alerts on root cause, and helps teams prevent issues before they impact users. Watchdog alerts on abnormal symptoms, accelerates troubleshooting by providing context, and connects the dots across an environment to provide root causes. Watchdog Alerts bring to light symptoms (anomalies, outliers, and other problem areas) without any manual work required to specify thresholds.
Another aspect of the solution, Watchdog Insights, surfaces meaningful signals with context within existing workflows, so issues can be understood in relation to the entire cloud environment. The last component, Watchdog RCA (root cause analysis), determines the underlying causes of issues to ensure the full impact of an issue or outage is addressed, and future incidents are prevented.
- Datadog Watchdog covers applications, infrastructure, and logs.
- It identifies issues with Kubernetes, bringing to light information from Datadog’s monitoring platform.
- A broad technology and cloud coverage across the platform allows it to identify issues from the entire stack and correlate disparate issues.
- No manual effort or setup from users required; it finds symptoms, insights, and root causes automatically.
- Accelerates the troubleshooting process without being disruptive, reducing alerts.
Applied Intelligence from New Relic puts AI-assisted incident response in the hands of IT to detect, understand, and resolve incidents. It helps IT to eliminate guesswork and solve problems faster with automatic insight into the probable root cause of incidents. IT can see why each open issue occurred, which services and systems were impacted, and what actions are needed for resolution. Integration with incident management tools speeds remediation workflows and keeps incidents in sync across ITSM and observability tools.
- Automatically spot anomalies based on signals like throughput, errors, and latency across all applications, services, and log data.
- Zero configuration needed.
- Notifications to Slack and other collaboration tools.
- Troubleshoot faster with anomaly analytics to prevent potential problems before they impact customers.
- Reduce the flood of redundant alerts by up to 80% with automatic grouping of alerts and events from any source; instead of alert storms across multiple tools, engineers see one issue with all information needed to take action.
- Events are automatically correlated based on time, context from alert messages, and relationship data across systems.
- Pre-trained machine learning models eliminate steep learning curves.
- Trigger remediation workflows through two-way integrations with incident management tools like ServiceNow and PagerDuty.
Also read: SD-WAN is Important for an IoT and AI Future
Broadcom inherited some of its AIOps technology from the acquisition of CA Technologies. Built on the Broadcom Automation.ai platform, it correlates data across users, applications, cloud-native architecture, hybrid infrastructures, and network services then applies machine learning, advanced analytics, and automation to deliver visibility and insight. The goal is to turn data into action to drive continuous improvement, speed service delivery, increase IT efficiency, and accelerate innovation. It also makes it possible for operations teams to optimize service levels, operations, and business outcomes.
- Full-stack observability into the digital experience
- App-to-network monitoring visibility
- Support for cloud-native architectures
- Alignment to business services with measurable KPIs to prioritize issues and relate performance to business goals
- Domain-centric and domain-agnostic AIOps
- Broadcom offers deep domain expertise in network, application, and infrastructure
- Site reliability analytics and automation capabilities that provide cross-correlated BizOps insights into release deployment events and associated build metrics within the context of overall health and KPIs
- Unified, scalable AI-driven network monitoring for traditional, SDN, and cloud networks
- Algorithmic root-cause analytics along with intelligent automation to automatically identify and remediate issues.
Moogsoft delivers an enterprise-class, cloud-native platform that empowers customers to drive adoption at their own pace at lower cost. It reduces noise despite the presence of huge data volumes, to enable IT to detect and fix outages rapidly. Its enrichment features add context to ingested alerts from various data sources to provide actionable insights. Correlation makes logical connections between data from anywhere in technology stacks.
- Only elevates critical situations, so IT can resolve incidents before they cause outages.
- Reduces alert volumes via a consolidated monitoring panel and correlation of similar events to minimize actionable alerts.
- Aggregates all apps, services, and infrastructure alerts to a single console for increased agility, fewer alerts and faster resolution times.
- Partnership with Datadog for incident management.
- Identifies anomalies that are outside normal operating behavior and impede the customer experience.
- Users can build their own integration to any data source for full observability.
- Collaboration across teams to resolve complex multi-service incidents.
AppDynamics by Cisco is a way to empower IT to address the problems posed by real-time applications and the demand for business agility and responsiveness. Aimed especially at multi-cloud environments, it offers real-time performance monitoring backed by machine learning. Harnessing Cisco’s domain expertise in networking and storage, it offers detailed insight into IT operations.
- Detect issues with real-time monitoring before they impact customers
- Gain end-to-end visibility to plan with accuracy, migrate to the cloud with confidence, and validate success
- Impact end-users and drive business results through enhanced application performance
- Reduce mean time to resolution with machine learning-based root-cause analysis
- Correlate software and business KPIs to diagnose performance issues
- Works in public, private, or multi-cloud environments
- Low-overhead monitoring agents
- Secure architecture and granular, role-based access controls.
Zenoss has been designed to optimize application performance in simple infrastructures as well as complex multi-cloud IT deployments. It collects and analyzes metrics, streaming data, dependency data, events, logs, and agent data. Machine learning is combined with real-time, dynamic models of IT services and applications to perform root-cause analysis. It also offers performance status and the status of all systems and applications at any point in time.
- Real-time modeling helps IT to gain awareness of end-to-end infrastructure-related risks
- Isolates problems to boost time to recovery and eliminate service outage losses
- Visibility of overall IT service health with intelligent dashboards and reports
- Collaboration across teams to coordinate investigation and problem-solving
- Predictive analytics
- Visualizes performance and anomalies across all on-premises and cloud infrastructures
- Applies consistent monitoring policies across all cloud and on-premises systems
- Delivers management as a service for DevOps teams
- Shares data and insights with other ITOM tools.
ScienceLogic discovers all components within the enterprise across physical, virtual, and cloud environments and stores the data in a data lake. It then helps IT to understand relationships between infrastructure, applications, and business services, using this context to gain actionable insights. The company claims 60% reduction in incidents, and a 25% improvement in time to recovery.
- Consolidates big data from all enterprise IT management tools and data sources together into a real-time data lake.
- Exchange and optimize cross-ecosystem data for visibility.
- Auto-remediation of issues.
- Auto-mapping and tracking of relationships across infrastructure, clouds, applications, and business services.
- Multi-directional integrations to automate actions at cloud scale.