Back to Blog
★★Intermediate Automation & Monitoring
AIOpsAINetwork AutomationPythonLLMMachine LearningNOC

AI and AIOps for Network Engineers: Anomaly Detection, LLM-Assisted Configs, and Predictive Operations

March 13, 2026·16 min read

Overview

AI is not replacing network engineers. It is changing what takes time. The tasks that used to consume hours — correlating alerts across 2,100 nodes, spotting a gradual interface error rate increase before it becomes a P1, generating boilerplate configs for a new site deployment — are increasingly handled by models. The engineer's job shifts from execution to judgment: reviewing AI output, making architectural decisions, and managing incidents that models cannot resolve on their own.

This post is practical. It covers what is actually useful today — AIOps anomaly detection, LLM-assisted config work, predictive capacity planning, and vendor AI platforms — not speculative futures. Where AI still falls short, that is noted too.


What AIOps Actually Means

AIOps (Artificial Intelligence for IT Operations) applies machine learning to operational data — telemetry, logs, events, metrics — to automate detection, correlation, and in some cases remediation. For network operations specifically, the value is in three areas:

  1. Signal-to-noise reduction: A 42-country network generates thousands of SolarWinds alerts per week. Most are noise. ML separates real anomalies from normal variance.
  2. Pattern recognition at scale: A model can correlate interface flaps across 200 devices simultaneously and identify that they all share an upstream switch — in seconds. A human takes 20 minutes.
  3. Predictive action: Forecast bandwidth saturation 48 hours before it happens based on trend analysis — not after users are calling.
// AIOPS PIPELINE — TELEMETRY TO ACTION Network Devices Routers · Switches SNMP · NetFlow · Syslog Firewalls ASA · Palo Alto Flow · Security Events Monitoring Stack SolarWinds · NMS Alerts · Metrics Data Ingestion Kafka · Logstash Normalise · Enrich Time-Series Store InfluxDB · Prometheus Long-term retention ML Engine Anomaly Detection Isolation Forest · LSTM Correlation Engine Event grouping Root-cause ranking Alert / Action PagerDuty · Ticket Auto-remediation Feedback Loop Alert outcome → retrain model → improve accuracy False positive rate tracks over time Telemetry ML processing Actionable output Feedback / retrain

Anomaly Detection: Statistical vs ML-Based

Statistical Baselining (What SolarWinds Does Today)

SolarWinds NPM already does basic anomaly detection: it calculates rolling averages and alerts when a metric exceeds baseline + N standard deviations. This works for simple, stable metrics but breaks down for:

  • Metrics with seasonal patterns (higher traffic on Monday mornings — normal)
  • Correlated anomalies (10 devices behaving oddly simultaneously — probably a shared root cause, not 10 separate incidents)
  • Slow-moving drift (interface error rate increasing 5% per week — below threshold, but trending to failure)

Isolation Forest for Network Anomaly Detection

Isolation Forest is an unsupervised ML algorithm that identifies anomalies by how few splits it takes to isolate a data point. Normal points require many splits; anomalies are isolated quickly.

python
import pandas as pdimport numpy as npfrom sklearn.ensemble import IsolationForestfrom sklearn.preprocessing import StandardScaler# Load interface metrics from SolarWinds export or Prometheus# Columns: timestamp, interface_id, in_utilization, out_utilization,#          in_errors, out_errors, in_discards, out_discardsdf = pd.read_csv('interface_metrics_30d.csv', parse_dates=['timestamp'])df = df.sort_values('timestamp')# Feature engineeringdf['error_rate'] = (df['in_errors'] + df['out_errors']) / (    df['in_utilization'] + df['out_utilization'] + 1)df['discard_rate'] = (df['in_discards'] + df['out_discards']) / (    df['in_utilization'] + df['out_utilization'] + 1)df['util_delta'] = df.groupby('interface_id')['in_utilization'].diff().abs()features = ['in_utilization', 'out_utilization', 'error_rate',            'discard_rate', 'util_delta']X = df[features].fillna(0)# Scale featuresscaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Train Isolation Forest# contamination=0.01 means we expect ~1% of data points to be anomaliesmodel = IsolationForest(    n_estimators=200,    contamination=0.01,    random_state=42)model.fit(X_scaled)# Predict: -1 = anomaly, 1 = normaldf['anomaly'] = model.predict(X_scaled)df['anomaly_score'] = model.score_samples(X_scaled)  # more negative = more anomalous# Report anomaliesanomalies = df[df['anomaly'] == -1].copy()anomalies = anomalies.sort_values('anomaly_score')print(f"Total anomalies detected: {len(anomalies)}")print(anomalies[['timestamp', 'interface_id', 'in_utilization',                  'error_rate', 'anomaly_score']].head(20))

LSTM for Time-Series Prediction

LSTM (Long Short-Term Memory) networks learn temporal patterns — ideal for predicting bandwidth utilization 24–48 hours ahead based on historical trends.

python
import numpy as npimport pandas as pdfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import LSTM, Dense, Dropoutfrom sklearn.preprocessing import MinMaxScaler# Prepare time-series data — hourly bandwidth utilization for one interfacedf = pd.read_csv('bandwidth_hourly.csv', parse_dates=['timestamp'])df = df.set_index('timestamp').resample('1H').mean().fillna(method='ffill')values = df['utilization_pct'].values.reshape(-1, 1)# Normalise to 0–1scaler = MinMaxScaler()scaled = scaler.fit_transform(values)# Create sequences: use last 168 hours (1 week) to predict next 24 hoursLOOK_BACK = 168PREDICT_STEPS = 24def create_sequences(data, look_back, predict_steps):    X, y = [], []    for i in range(len(data) - look_back - predict_steps):        X.append(data[i:i+look_back])        y.append(data[i+look_back:i+look_back+predict_steps])    return np.array(X), np.array(y)X, y = create_sequences(scaled, LOOK_BACK, PREDICT_STEPS)split = int(0.8 * len(X))X_train, X_test = X[:split], X[split:]y_train, y_test = y[:split], y[split:]# Build LSTM modelmodel = Sequential([    LSTM(64, return_sequences=True, input_shape=(LOOK_BACK, 1)),    Dropout(0.2),    LSTM(32),    Dropout(0.2),    Dense(PREDICT_STEPS)])model.compile(optimizer='adam', loss='mse')model.fit(X_train, y_train, epochs=20, batch_size=32,          validation_data=(X_test, y_test), verbose=1)# Predict next 24 hourslast_week = scaled[-LOOK_BACK:].reshape(1, LOOK_BACK, 1)predicted_scaled = model.predict(last_week)predicted = scaler.inverse_transform(predicted_scaled.reshape(-1, 1))# Alert if any predicted hour exceeds 85%if predicted.max() > 85:    print(f"CAPACITY ALERT: Interface predicted to reach {predicted.max():.1f}% "          f"utilization within 24 hours")

LLM-Assisted Network Operations

Large language models are genuinely useful for network engineers in three areas: config generation, config review/audit, and explaining unfamiliar output. They are not useful for real-time decision-making during incidents — the model has no live view of your network.

Config Generation

LLMs excel at generating boilerplate. You provide the intent and constraints; the model writes the first draft.

Effective prompt pattern:

bash
Context: Cisco IOS 15.x router, hub site, OSPF area 0 backboneTask: Generate BGP config for dual-homed ISP connectivityRequirements:- AS 65001 (internal), ISP-A AS 1234, ISP-B AS 5678- Prefer ISP-A for outbound (lower MED), fail over to ISP-B- Accept only default route from both ISPs (no full table)- Apply inbound prefix filter: deny anything more specific than /24- MD5 authentication on both peers- Log neighbor state changesOutput: Only the router bgp configuration block

The output will be ~90% correct for standard patterns. Your job: review for site-specific values, verify authentication keys, confirm prefix list names match your naming convention, and test in a lab before production.

What LLMs get wrong on network configs:

  • Vendor-specific syntax differences (IOS vs IOS-XE vs NX-OS have subtle differences)
  • Stateful context (the model doesn't know your existing BGP table or what's already configured)
  • Complex interactions (a route-map that interacts with an existing policy may conflict)

Config Review and Audit

Paste a config block and ask the model to identify security issues, missing hardening, or deviations from best practices. This is more reliable than generation because you are the ground truth — the model is just a fast second pair of eyes.

bash
Review this Cisco IOS router config for security issues.Flag: missing hardening, insecure protocols, weak authentication,      unnecessary services, logging gaps, and AAA issues.[paste config block]

This catches things like service password-encryption being absent, no ip source-route missing, SNMPv2c community "public" still active, no ip proxy-arp missing on user-facing interfaces, and logging buffered size being too small.

Explaining Unfamiliar Output

Paste an unknown error, BGP state output, or debug trace and ask for an explanation. This is the highest-value, lowest-risk use — the model is a knowledgeable colleague explaining what something means.

bash
Explain what this BGP debug output means and what likely caused it:*Mar 13 02:14:35.123: BGP: 203.0.113.1 went from Established to Idle*Mar 13 02:14:35.123: %BGP-5-ADJCHANGE: neighbor 203.0.113.1 Down  BGP Notification sent to neighbor: hold time expired*Mar 13 02:14:38.456: BGP: 203.0.113.1 active, open failed - Connection refused

The model will correctly identify: hold timer expired (neighbor not sending keepalives for 90 seconds), then connection refused on reconnect attempt (peer may have ACL or process issue), and suggest checking: MTU mismatch causing large BGP UPDATE packets to fragment, ISP-side filter blocking TCP 179, or peer BGP process having restarted.


Vendor AI Platforms

PlatformVendorKey AI CapabilitiesBest For
Cisco AI Network AnalyticsCisco / DNA CenterBaseline deviation detection, client experience scoring, encrypted traffic analysis (ETA)Campus networks, Catalyst / DNA Center deployments
Juniper Mist AIJuniper / MistMarvis AI assistant (NLP queries), anomaly detection, SLE (Service Level Experience) scoring, root-cause identificationWireless and wired access layer, SD-WAN
Aruba AIOpsAruba / HPEAI Insights for wireless, client health scoring, predictive alerts before user impactAruba wireless and CX switching
VeloCloud Edge IntelligenceVMware / BroadcomApplication path quality prediction, WAN link anomaly detection, QoS auto-tuningSD-WAN environments with VeloCloud edges
DarktraceDarktraceUnsupervised ML for threat detection, "Enterprise Immune System" — learns normal, flags deviationsSecurity-focused anomaly detection across network and cloud
Splunk ITSISplunkKPI monitoring, ML-based alert grouping, episode review, predictive analyticsLarge NOC environments with existing Splunk investment

Juniper Mist AI — Marvis in Practice

Marvis is worth calling out specifically. It is a natural language AI assistant embedded in the Mist dashboard. You can ask it: "Why did the wireless experience degrade in the Cebu office yesterday?" and it will correlate RF data, DHCP failures, authentication events, and uplink errors to give a ranked list of probable causes with specific AP and time references.

For wireless troubleshooting in particular — where there are dozens of radio variables, client types, and RF environment factors — this kind of correlation that would take 45 minutes manually takes Marvis about 3 seconds.


AI-Assisted Troubleshooting Workflow

// AI-ASSISTED TROUBLESHOOTING WORKFLOW Problem Reported PagerDuty alert / user ticket / NOC observation AI Correlation AIOps / vendor AI Engineer Triage Layer-by-layer check AI Output Probable causes ranked Correlated events LLM Assist Explain debug output Suggest fix commands Engineer Validates Reviews AI suggestions Confirms before applying Apply Fix Change control → deploy Verify + close ticket AI never applies changes directly Human in the loop AI/ML output Engineer action Parallel inputs

The key principle: AI assists, engineer decides. The model presents ranked hypotheses. The engineer validates with real commands. No AI system in production networking should apply config changes without human review and approval. The blast radius of a wrong automated change in a 42-country network is too large.


Predictive Capacity Planning

Combining time-series forecasting with your network inventory enables data-driven capacity decisions — replacing the "it feels like we need an upgrade" conversation with "this link hits 85% sustained in 6 weeks at current growth rate."

python
import pandas as pdimport numpy as npfrom prophet import Prophet  # Facebook/Meta's forecasting library# Load 6 months of daily peak utilization per interfacedf = pd.read_csv('daily_peak_util.csv')# Prophet requires columns: ds (date), y (value)interface_df = df[df['interface_id'] == 'Gi0/0/1-CEBU-HQ'].copy()interface_df = interface_df.rename(columns={'date': 'ds', 'peak_util_pct': 'y'})# Fit modelmodel = Prophet(    yearly_seasonality=False,    weekly_seasonality=True,    daily_seasonality=False,    changepoint_prior_scale=0.05  # regularize — prevent overfitting to noise)model.fit(interface_df)# Forecast 90 days aheadfuture = model.make_future_dataframe(periods=90)forecast = model.predict(future)# Find first day forecast exceeds 80%breach_date = forecast[forecast['yhat'] > 80]['ds'].min()if pd.notna(breach_date):    days_until = (breach_date - pd.Timestamp.today()).days    print(f"Interface Gi0/0/1-CEBU-HQ predicted to exceed 80% on {breach_date.date()}")    print(f"That is {days_until} days from today — recommend capacity review")else:    print("No capacity breach predicted in next 90 days")

Integrating with your CMDB: Run this forecast weekly across all monitored interfaces. Feed results into ServiceNow as proactive capacity tickets assigned to the site owner — before users feel anything.


Building a Simple Network AI Chatbot

With the Anthropic API, you can build a network-specific assistant that has context about your environment — device inventory, current alerts, recent changes — and answers questions or generates configs in your naming convention.

python
import anthropicimport jsonclient = anthropic.Anthropic()# Your network context — inject from CMDB / SolarWinds APInetwork_context = {    "site": "CEBU-HQ",    "devices": ["ASR1001-CEBU-HQ", "Catalyst9300-DIST-01", "PA-3220-FW-01"],    "naming_convention": "ROLE-SITE-NUMBER (e.g., SW-CEBU-01)",    "ip_scheme": "10.10.0.0/16 for CEBU site",    "active_alerts": ["Gi0/0/1 on ASR1001-CEBU-HQ at 78% utilization"],    "vendor_standard": "Cisco IOS-XE 17.x, Palo Alto PAN-OS 11.x"}def ask_network_ai(question: str) -> str:    system_prompt = f"""You are a senior network engineer assistant with deep expertise in    Cisco IOS/IOS-XE, Palo Alto firewalls, and enterprise network design.    Current network context:    {json.dumps(network_context, indent=2)}    Rules:    - Generate configs in the vendor standard specified    - Use the naming convention provided    - Use IPs from the site IP scheme    - Flag any security concerns in configs you generate    - Always note when a config change requires a maintenance window    - Never suggest changes that would cause a network outage without warning    """    message = client.messages.create(        model="claude-opus-4-6",        max_tokens=2048,        system=system_prompt,        messages=[{"role": "user", "content": question}]    )    return message.content[0].text# Example usageresponse = ask_network_ai(    "Generate an OSPF configuration for a new distribution switch "    "SW-CEBU-03 that connects to ASR1001-CEBU-HQ in area 0. "    "Include MD5 authentication and BFD.")print(response)

Where AI Falls Short

Being honest about limitations matters more than hype.

TaskAI ReliabilityWhy
Generating standard configs (OSPF, BGP, VLAN)High (80–90%)Well-represented in training data. Review always required for site-specific values.
Explaining error messages and debug outputHigh (85–95%)Pattern matching against known error types. Cross-check with official docs.
Anomaly detection on telemetry dataMedium-HighWorks well when trained on your specific baseline. Generic models have high false positive rates.
Novel vendor-specific syntax (new OS versions)MediumTraining data may predate latest features. Always verify with official docs.
Complex multi-vendor interactionsMedium-LowEdge cases in redistribution, route policies, and multi-vendor HA not well-covered.
Real-time incident decision-makingLowNo live network state. Cannot verify if a suggested fix will actually work on your topology.
Security policy design (what SHOULD be allowed)LowAI doesn't know your business requirements, compliance obligations, or risk tolerance.

The engineer who understands both networking fundamentals and how to effectively direct AI tools is significantly more productive than either a traditional engineer or an AI system alone. The model handles the repetitive synthesis work; the engineer provides context, judgment, and accountability.


Practical Starting Points

If you want to start applying AI to your network operations today, in order of effort and payoff:

  1. Use an LLM to explain unfamiliar debug output — zero setup, immediate value. Paste error messages into Claude or GPT-4 and ask for an explanation. Verify against Cisco TAC/docs.

  2. Add Isolation Forest to your SolarWinds data — export interface metrics to CSV weekly, run the anomaly detection script above, review the top 20 anomalies. You will find things your threshold alerts miss.

  3. Build a 90-day capacity forecast — pull peak utilization per interface from NPM, run Prophet, feed results into ServiceNow as proactive tickets. Eliminates reactive capacity surprises.

  4. Pilot a vendor AI platform on one site — Juniper Mist AI or Cisco DNA Center AI Analytics on your smallest site. Run it for 30 days and compare mean-time-to-identify against your previous baseline.

  5. Build a site-context-aware config assistant — the Anthropic API example above, extended with your actual CMDB data. Gives you a config generator that knows your naming conventions, IP scheme, and standards.