SolarWinds Alerting Strategies That Actually Work

April 12, 2026·7 min read

Contents

The Alert Fatigue Problem
Alert Design Principles
Alert Types and Thresholds
Alert Suppression Strategies
PagerDuty Integration
Common Alert Anti-Patterns to Avoid

The Alert Fatigue Problem

Every NOC eventually drifts into one of two failure modes:

Too few alerts — incidents go undetected until a client calls you
Too many alerts — engineers stop reading them because most are noise

Both kill your SLA. SolarWinds NPM can generate thousands of alerts per day out of the box if you're not careful. The goal of a good alerting strategy is surgical: every alert that fires must be actionable, meaningful, and routed to the right person.

After tuning alerts across a 42-country network, this is what works.

Alert Design Principles

Before configuring a single alert, establish these rules with your team:

1. Alert on symptoms, not causes. "Interface down" is a symptom. "BGP peer lost" could be a symptom of that interface going down. Don't alert on both — they're the same incident. Alert on the highest-level symptom that an engineer can act on.

2. Every alert needs a runbook. If there's no documented procedure for responding to an alert, it should not be generating PagerDuty pages. Create the runbook before enabling the alert.

3. Alerts must be stateful. SolarWinds supports trigger conditions (alert fires) and reset conditions (alert clears). Both must be configured. An alert that fires but never clears creates ghost notifications — engineers get paged for incidents that resolved themselves 20 minutes ago.

4. Maintenance window suppression is mandatory. Every change that causes expected device state changes (reloads, interface flaps) must suppress alerts. Use SolarWinds Scheduled Maintenance to automatically suppress during the window, not ad-hoc disable-and-forget.

Alert Types and Thresholds

Interface State Alerts

Alert	Condition	Severity	Notes
Critical uplink down	Interface operStatus = Down AND interface tagged "uplink-critical"	P1	Requires a custom property "uplink-critical" on critical interfaces
WAN interface down	Interface operStatus = Down AND interface tagged "wan"	P2	Fires after 2 consecutive polls (2 min) to reduce flap noise
Interface high utilization	Interface utilization > 80% for 15 min sustained	P3	Use sustained threshold (not single spike) to avoid burst false positives
Interface error rate	Interface errors/discards > 0.5% of total packets in 10 min	P3	Good indicator of physical layer issues
Interface flapping	Interface went up/down > 3 times in 10 min	P2	SolarWinds Advanced Alert conditions support count-based triggers

Node Availability Alerts

# SolarWinds Alert Condition — Node Down (best practice example)
#
# Trigger condition:
#   Node.Status = Down
#   AND Node.CustomProperties.Environment = "Production"
#   AND Node.CustomProperties.Tier != "Non-Critical"
#   Sustained for: 5 minutes
#   Evaluation interval: every 2 minutes
#
# Reset condition:
#   Node.Status = Up
#
# Key: the 5-minute sustained condition filters out ICMP timeouts
# from temporary congestion bursts. Without it, every congestion event
# generates a node-down false positive.

CPU and Memory Alerts

Metric	Warning Threshold	Critical Threshold	Duration
Router/switch CPU	70%	90%	Sustained 10 min
Firewall CPU	65%	85%	Sustained 5 min
Device memory	80%	92%	Sustained 15 min
Firewall connections	75% of max	90% of max	Sustained 5 min

Don't use single-poll thresholds for CPU/memory. CPU spikes happen during BGP reconvergence, interface state changes, and routing table updates. A single 95% CPU reading is normal during convergence. A 90% CPU reading sustained for 10 minutes is not.

Alert Suppression Strategies

Parent-Child Dependencies

Configure parent-child node relationships in SolarWinds. When a parent node (e.g., a core router) goes down, SolarWinds automatically suppresses alerts for all child nodes (access switches, servers) reachable only through that router. This is the single most effective way to reduce alert floods during a P1.

Setup: Manage → Dependencies → Add Dependency

Parent: core-router-01
Children: all nodes in that site

Maintenance Windows

# SolarWinds API — programmatically create a maintenance window
# Useful for scripting suppression into your change management workflow
$headers = @{ "Content-Type" = "application/json" }
$body = @{
    EntityType = "Orion.Nodes"
    EntityID = "12345"  # SolarWinds Node ID
    StartTime = "2026-04-15T02:00:00"
    EndTime = "2026-04-15T04:00:00"
    Message = "CHG0012345 - Router reload for IOS upgrade"
} | ConvertTo-Json
Invoke-RestMethod -Uri "https://solarwinds/api/v1/maintenance" -Method POST -Headers $headers -Body $body

PagerDuty Integration

SolarWinds connects to PagerDuty via webhooks. The key is routing — not every alert should page the same team.

Alert Type	PagerDuty Service	Escalation	Response
P1 — Production node/link down	NOC-Critical	Immediate page → 15 min → manager	Phone call to on-call engineer
P2 — WAN degraded	NOC-High	Page → 30 min → team lead	Push notification
P3 — Utilization warning	NOC-Medium	Push notification only, no escalation	Morning review
P4 — Low-priority warning	NOC-Low	Email only, no PagerDuty page	Weekly review

Alert Deduplication

Configure PagerDuty's Alert Grouping to prevent multiple SolarWinds alerts from the same incident creating dozens of separate PagerDuty incidents. Group by: node hostname + alert type + time window (5 minutes).

Without deduplication, a single core router failure generates: node-down alert, interface-down alerts for every connected interface, BGP neighbor alerts for every peer, child node alerts for downstream devices. That's 20+ PagerDuty pages for a single event.

Common Alert Anti-Patterns to Avoid

Anti-Pattern	Problem	Fix
Single-poll thresholds	Transient spikes create false positives — engineers tune out pager	Use sustained thresholds (minimum 2–3 poll cycles)
No reset condition	Alerts stay "active" after auto-recovery — engineers don't know incident is resolved	Always define reset condition; test it explicitly after alert creation
All alerts go to P1	P1 loses urgency when it fires 50 times/day	Reserve P1 for actual production outages; tune severity classification
No parent-child dependencies	Single outage generates 50 alerts; engineers can't find the root cause in the noise	Build node dependencies for every site; test by disabling parent node in a maintenance window
Alerting on every SNMP OID	"We monitor everything" sounds good; creates noise for things no one can act on	Only alert on metrics with a defined response procedure
No maintenance windows	Engineers get paged during planned maintenance; cry-wolf effect builds	Maintenance window creation is a mandatory step in every change request