Overview
AI is not replacing network engineers. It is changing what takes time. The tasks that used to consume hours — correlating alerts across 2,100 nodes, spotting a gradual interface error rate increase before it becomes a P1, generating boilerplate configs for a new site deployment — are increasingly handled by models. The engineer's job shifts from execution to judgment: reviewing AI output, making architectural decisions, and managing incidents that models cannot resolve on their own.
This post is practical. It covers what is actually useful today — AIOps anomaly detection, LLM-assisted config work, predictive capacity planning, and vendor AI platforms — not speculative futures. Where AI still falls short, that is noted too.
What AIOps Actually Means
AIOps (Artificial Intelligence for IT Operations) applies machine learning to operational data — telemetry, logs, events, metrics — to automate detection, correlation, and in some cases remediation. For network operations specifically, the value is in three areas:
- Signal-to-noise reduction: A 42-country network generates thousands of SolarWinds alerts per week. Most are noise. ML separates real anomalies from normal variance.
- Pattern recognition at scale: A model can correlate interface flaps across 200 devices simultaneously and identify that they all share an upstream switch — in seconds. A human takes 20 minutes.
- Predictive action: Forecast bandwidth saturation 48 hours before it happens based on trend analysis — not after users are calling.
Anomaly Detection: Statistical vs ML-Based
Statistical Baselining (What SolarWinds Does Today)
SolarWinds NPM already does basic anomaly detection: it calculates rolling averages and alerts when a metric exceeds baseline + N standard deviations. This works for simple, stable metrics but breaks down for:
- Metrics with seasonal patterns (higher traffic on Monday mornings — normal)
- Correlated anomalies (10 devices behaving oddly simultaneously — probably a shared root cause, not 10 separate incidents)
- Slow-moving drift (interface error rate increasing 5% per week — below threshold, but trending to failure)
Isolation Forest for Network Anomaly Detection
Isolation Forest is an unsupervised ML algorithm that identifies anomalies by how few splits it takes to isolate a data point. Normal points require many splits; anomalies are isolated quickly.
import pandas as pdimport numpy as npfrom sklearn.ensemble import IsolationForestfrom sklearn.preprocessing import StandardScaler# Load interface metrics from SolarWinds export or Prometheus# Columns: timestamp, interface_id, in_utilization, out_utilization,# in_errors, out_errors, in_discards, out_discardsdf = pd.read_csv('interface_metrics_30d.csv', parse_dates=['timestamp'])df = df.sort_values('timestamp')# Feature engineeringdf['error_rate'] = (df['in_errors'] + df['out_errors']) / ( df['in_utilization'] + df['out_utilization'] + 1)df['discard_rate'] = (df['in_discards'] + df['out_discards']) / ( df['in_utilization'] + df['out_utilization'] + 1)df['util_delta'] = df.groupby('interface_id')['in_utilization'].diff().abs()features = ['in_utilization', 'out_utilization', 'error_rate', 'discard_rate', 'util_delta']X = df[features].fillna(0)# Scale featuresscaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Train Isolation Forest# contamination=0.01 means we expect ~1% of data points to be anomaliesmodel = IsolationForest( n_estimators=200, contamination=0.01, random_state=42)model.fit(X_scaled)# Predict: -1 = anomaly, 1 = normaldf['anomaly'] = model.predict(X_scaled)df['anomaly_score'] = model.score_samples(X_scaled) # more negative = more anomalous# Report anomaliesanomalies = df[df['anomaly'] == -1].copy()anomalies = anomalies.sort_values('anomaly_score')print(f"Total anomalies detected: {len(anomalies)}")print(anomalies[['timestamp', 'interface_id', 'in_utilization', 'error_rate', 'anomaly_score']].head(20))LSTM for Time-Series Prediction
LSTM (Long Short-Term Memory) networks learn temporal patterns — ideal for predicting bandwidth utilization 24–48 hours ahead based on historical trends.
import numpy as npimport pandas as pdfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import LSTM, Dense, Dropoutfrom sklearn.preprocessing import MinMaxScaler# Prepare time-series data — hourly bandwidth utilization for one interfacedf = pd.read_csv('bandwidth_hourly.csv', parse_dates=['timestamp'])df = df.set_index('timestamp').resample('1H').mean().fillna(method='ffill')values = df['utilization_pct'].values.reshape(-1, 1)# Normalise to 0–1scaler = MinMaxScaler()scaled = scaler.fit_transform(values)# Create sequences: use last 168 hours (1 week) to predict next 24 hoursLOOK_BACK = 168PREDICT_STEPS = 24def create_sequences(data, look_back, predict_steps): X, y = [], [] for i in range(len(data) - look_back - predict_steps): X.append(data[i:i+look_back]) y.append(data[i+look_back:i+look_back+predict_steps]) return np.array(X), np.array(y)X, y = create_sequences(scaled, LOOK_BACK, PREDICT_STEPS)split = int(0.8 * len(X))X_train, X_test = X[:split], X[split:]y_train, y_test = y[:split], y[split:]# Build LSTM modelmodel = Sequential([ LSTM(64, return_sequences=True, input_shape=(LOOK_BACK, 1)), Dropout(0.2), LSTM(32), Dropout(0.2), Dense(PREDICT_STEPS)])model.compile(optimizer='adam', loss='mse')model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), verbose=1)# Predict next 24 hourslast_week = scaled[-LOOK_BACK:].reshape(1, LOOK_BACK, 1)predicted_scaled = model.predict(last_week)predicted = scaler.inverse_transform(predicted_scaled.reshape(-1, 1))# Alert if any predicted hour exceeds 85%if predicted.max() > 85: print(f"CAPACITY ALERT: Interface predicted to reach {predicted.max():.1f}% " f"utilization within 24 hours")LLM-Assisted Network Operations
Large language models are genuinely useful for network engineers in three areas: config generation, config review/audit, and explaining unfamiliar output. They are not useful for real-time decision-making during incidents — the model has no live view of your network.
Config Generation
LLMs excel at generating boilerplate. You provide the intent and constraints; the model writes the first draft.
Effective prompt pattern:
Context: Cisco IOS 15.x router, hub site, OSPF area 0 backboneTask: Generate BGP config for dual-homed ISP connectivityRequirements:- AS 65001 (internal), ISP-A AS 1234, ISP-B AS 5678- Prefer ISP-A for outbound (lower MED), fail over to ISP-B- Accept only default route from both ISPs (no full table)- Apply inbound prefix filter: deny anything more specific than /24- MD5 authentication on both peers- Log neighbor state changesOutput: Only the router bgp configuration blockThe output will be ~90% correct for standard patterns. Your job: review for site-specific values, verify authentication keys, confirm prefix list names match your naming convention, and test in a lab before production.
What LLMs get wrong on network configs:
- Vendor-specific syntax differences (IOS vs IOS-XE vs NX-OS have subtle differences)
- Stateful context (the model doesn't know your existing BGP table or what's already configured)
- Complex interactions (a route-map that interacts with an existing policy may conflict)
Config Review and Audit
Paste a config block and ask the model to identify security issues, missing hardening, or deviations from best practices. This is more reliable than generation because you are the ground truth — the model is just a fast second pair of eyes.
Review this Cisco IOS router config for security issues.Flag: missing hardening, insecure protocols, weak authentication, unnecessary services, logging gaps, and AAA issues.[paste config block]This catches things like service password-encryption being absent, no ip source-route missing, SNMPv2c community "public" still active, no ip proxy-arp missing on user-facing interfaces, and logging buffered size being too small.
Explaining Unfamiliar Output
Paste an unknown error, BGP state output, or debug trace and ask for an explanation. This is the highest-value, lowest-risk use — the model is a knowledgeable colleague explaining what something means.
Explain what this BGP debug output means and what likely caused it:*Mar 13 02:14:35.123: BGP: 203.0.113.1 went from Established to Idle*Mar 13 02:14:35.123: %BGP-5-ADJCHANGE: neighbor 203.0.113.1 Down BGP Notification sent to neighbor: hold time expired*Mar 13 02:14:38.456: BGP: 203.0.113.1 active, open failed - Connection refusedThe model will correctly identify: hold timer expired (neighbor not sending keepalives for 90 seconds), then connection refused on reconnect attempt (peer may have ACL or process issue), and suggest checking: MTU mismatch causing large BGP UPDATE packets to fragment, ISP-side filter blocking TCP 179, or peer BGP process having restarted.
Vendor AI Platforms
Juniper Mist AI — Marvis in Practice
Marvis is worth calling out specifically. It is a natural language AI assistant embedded in the Mist dashboard. You can ask it: "Why did the wireless experience degrade in the Cebu office yesterday?" and it will correlate RF data, DHCP failures, authentication events, and uplink errors to give a ranked list of probable causes with specific AP and time references.
For wireless troubleshooting in particular — where there are dozens of radio variables, client types, and RF environment factors — this kind of correlation that would take 45 minutes manually takes Marvis about 3 seconds.
AI-Assisted Troubleshooting Workflow
The key principle: AI assists, engineer decides. The model presents ranked hypotheses. The engineer validates with real commands. No AI system in production networking should apply config changes without human review and approval. The blast radius of a wrong automated change in a 42-country network is too large.
Predictive Capacity Planning
Combining time-series forecasting with your network inventory enables data-driven capacity decisions — replacing the "it feels like we need an upgrade" conversation with "this link hits 85% sustained in 6 weeks at current growth rate."
import pandas as pdimport numpy as npfrom prophet import Prophet # Facebook/Meta's forecasting library# Load 6 months of daily peak utilization per interfacedf = pd.read_csv('daily_peak_util.csv')# Prophet requires columns: ds (date), y (value)interface_df = df[df['interface_id'] == 'Gi0/0/1-CEBU-HQ'].copy()interface_df = interface_df.rename(columns={'date': 'ds', 'peak_util_pct': 'y'})# Fit modelmodel = Prophet( yearly_seasonality=False, weekly_seasonality=True, daily_seasonality=False, changepoint_prior_scale=0.05 # regularize — prevent overfitting to noise)model.fit(interface_df)# Forecast 90 days aheadfuture = model.make_future_dataframe(periods=90)forecast = model.predict(future)# Find first day forecast exceeds 80%breach_date = forecast[forecast['yhat'] > 80]['ds'].min()if pd.notna(breach_date): days_until = (breach_date - pd.Timestamp.today()).days print(f"Interface Gi0/0/1-CEBU-HQ predicted to exceed 80% on {breach_date.date()}") print(f"That is {days_until} days from today — recommend capacity review")else: print("No capacity breach predicted in next 90 days")Integrating with your CMDB: Run this forecast weekly across all monitored interfaces. Feed results into ServiceNow as proactive capacity tickets assigned to the site owner — before users feel anything.
Building a Simple Network AI Chatbot
With the Anthropic API, you can build a network-specific assistant that has context about your environment — device inventory, current alerts, recent changes — and answers questions or generates configs in your naming convention.
import anthropicimport jsonclient = anthropic.Anthropic()# Your network context — inject from CMDB / SolarWinds APInetwork_context = { "site": "CEBU-HQ", "devices": ["ASR1001-CEBU-HQ", "Catalyst9300-DIST-01", "PA-3220-FW-01"], "naming_convention": "ROLE-SITE-NUMBER (e.g., SW-CEBU-01)", "ip_scheme": "10.10.0.0/16 for CEBU site", "active_alerts": ["Gi0/0/1 on ASR1001-CEBU-HQ at 78% utilization"], "vendor_standard": "Cisco IOS-XE 17.x, Palo Alto PAN-OS 11.x"}def ask_network_ai(question: str) -> str: system_prompt = f"""You are a senior network engineer assistant with deep expertise in Cisco IOS/IOS-XE, Palo Alto firewalls, and enterprise network design. Current network context: {json.dumps(network_context, indent=2)} Rules: - Generate configs in the vendor standard specified - Use the naming convention provided - Use IPs from the site IP scheme - Flag any security concerns in configs you generate - Always note when a config change requires a maintenance window - Never suggest changes that would cause a network outage without warning """ message = client.messages.create( model="claude-opus-4-6", max_tokens=2048, system=system_prompt, messages=[{"role": "user", "content": question}] ) return message.content[0].text# Example usageresponse = ask_network_ai( "Generate an OSPF configuration for a new distribution switch " "SW-CEBU-03 that connects to ASR1001-CEBU-HQ in area 0. " "Include MD5 authentication and BFD.")print(response)Where AI Falls Short
Being honest about limitations matters more than hype.
The engineer who understands both networking fundamentals and how to effectively direct AI tools is significantly more productive than either a traditional engineer or an AI system alone. The model handles the repetitive synthesis work; the engineer provides context, judgment, and accountability.
Practical Starting Points
If you want to start applying AI to your network operations today, in order of effort and payoff:
-
Use an LLM to explain unfamiliar debug output — zero setup, immediate value. Paste error messages into Claude or GPT-4 and ask for an explanation. Verify against Cisco TAC/docs.
-
Add Isolation Forest to your SolarWinds data — export interface metrics to CSV weekly, run the anomaly detection script above, review the top 20 anomalies. You will find things your threshold alerts miss.
-
Build a 90-day capacity forecast — pull peak utilization per interface from NPM, run Prophet, feed results into ServiceNow as proactive tickets. Eliminates reactive capacity surprises.
-
Pilot a vendor AI platform on one site — Juniper Mist AI or Cisco DNA Center AI Analytics on your smallest site. Run it for 30 days and compare mean-time-to-identify against your previous baseline.
-
Build a site-context-aware config assistant — the Anthropic API example above, extended with your actual CMDB data. Gives you a config generator that knows your naming conventions, IP scheme, and standards.