Back to Blog
★★Intermediate Automation & Monitoring
SolarWindsNPMMonitoringAlertingPagerDutyNOCOperationsSNMP

SolarWinds Alerting Strategies That Actually Work

April 12, 2026·7 min read

The Alert Fatigue Problem

Every NOC eventually drifts into one of two failure modes:

  1. Too few alerts — incidents go undetected until a client calls you
  2. Too many alerts — engineers stop reading them because most are noise

Both kill your SLA. SolarWinds NPM can generate thousands of alerts per day out of the box if you're not careful. The goal of a good alerting strategy is surgical: every alert that fires must be actionable, meaningful, and routed to the right person.

After tuning alerts across a 42-country network, this is what works.


// Alert Pipeline: SolarWinds → PagerDuty → ServiceNow SOLARWINDS NPM SNMP polling ICMP ping WMI / agent NetFlow / NTA Alert engine Suppression rules Webhook / email PAGERDUTY Routing rules On-call schedule Escalation chains Dedup / group Mobile push Phone call (P1) SERVICENOW Auto ticket open CMDB enrichment SLA timer start Assignment group Client notification Incident mgmt Engineer + Runbook + Context = Resolution

Alert Design Principles

Before configuring a single alert, establish these rules with your team:

1. Alert on symptoms, not causes. "Interface down" is a symptom. "BGP peer lost" could be a symptom of that interface going down. Don't alert on both — they're the same incident. Alert on the highest-level symptom that an engineer can act on.

2. Every alert needs a runbook. If there's no documented procedure for responding to an alert, it should not be generating PagerDuty pages. Create the runbook before enabling the alert.

3. Alerts must be stateful. SolarWinds supports trigger conditions (alert fires) and reset conditions (alert clears). Both must be configured. An alert that fires but never clears creates ghost notifications — engineers get paged for incidents that resolved themselves 20 minutes ago.

4. Maintenance window suppression is mandatory. Every change that causes expected device state changes (reloads, interface flaps) must suppress alerts. Use SolarWinds Scheduled Maintenance to automatically suppress during the window, not ad-hoc disable-and-forget.


Alert Types and Thresholds

Interface State Alerts

AlertConditionSeverityNotes
Critical uplink downInterface operStatus = Down AND interface tagged "uplink-critical"P1Requires a custom property "uplink-critical" on critical interfaces
WAN interface downInterface operStatus = Down AND interface tagged "wan"P2Fires after 2 consecutive polls (2 min) to reduce flap noise
Interface high utilizationInterface utilization > 80% for 15 min sustainedP3Use sustained threshold (not single spike) to avoid burst false positives
Interface error rateInterface errors/discards > 0.5% of total packets in 10 minP3Good indicator of physical layer issues
Interface flappingInterface went up/down > 3 times in 10 minP2SolarWinds Advanced Alert conditions support count-based triggers

Node Availability Alerts

# SolarWinds Alert Condition — Node Down (best practice example)
#
# Trigger condition:
#   Node.Status = Down
#   AND Node.CustomProperties.Environment = "Production"
#   AND Node.CustomProperties.Tier != "Non-Critical"
#   Sustained for: 5 minutes
#   Evaluation interval: every 2 minutes
#
# Reset condition:
#   Node.Status = Up
#
# Key: the 5-minute sustained condition filters out ICMP timeouts
# from temporary congestion bursts. Without it, every congestion event
# generates a node-down false positive.

CPU and Memory Alerts

MetricWarning ThresholdCritical ThresholdDuration
Router/switch CPU70%90%Sustained 10 min
Firewall CPU65%85%Sustained 5 min
Device memory80%92%Sustained 15 min
Firewall connections75% of max90% of maxSustained 5 min

Don't use single-poll thresholds for CPU/memory. CPU spikes happen during BGP reconvergence, interface state changes, and routing table updates. A single 95% CPU reading is normal during convergence. A 90% CPU reading sustained for 10 minutes is not.


Alert Suppression Strategies

Parent-Child Dependencies

Configure parent-child node relationships in SolarWinds. When a parent node (e.g., a core router) goes down, SolarWinds automatically suppresses alerts for all child nodes (access switches, servers) reachable only through that router. This is the single most effective way to reduce alert floods during a P1.

Setup: Manage → Dependencies → Add Dependency

  • Parent: core-router-01
  • Children: all nodes in that site

Maintenance Windows

# SolarWinds API — programmatically create a maintenance window
# Useful for scripting suppression into your change management workflow
$headers = @{ "Content-Type" = "application/json" }
$body = @{
    EntityType = "Orion.Nodes"
    EntityID = "12345"  # SolarWinds Node ID
    StartTime = "2026-04-15T02:00:00"
    EndTime = "2026-04-15T04:00:00"
    Message = "CHG0012345 - Router reload for IOS upgrade"
} | ConvertTo-Json
Invoke-RestMethod -Uri "https://solarwinds/api/v1/maintenance" -Method POST -Headers $headers -Body $body

PagerDuty Integration

SolarWinds connects to PagerDuty via webhooks. The key is routing — not every alert should page the same team.

Alert TypePagerDuty ServiceEscalationResponse
P1 — Production node/link downNOC-CriticalImmediate page → 15 min → managerPhone call to on-call engineer
P2 — WAN degradedNOC-HighPage → 30 min → team leadPush notification
P3 — Utilization warningNOC-MediumPush notification only, no escalationMorning review
P4 — Low-priority warningNOC-LowEmail only, no PagerDuty pageWeekly review

Alert Deduplication

Configure PagerDuty's Alert Grouping to prevent multiple SolarWinds alerts from the same incident creating dozens of separate PagerDuty incidents. Group by: node hostname + alert type + time window (5 minutes).

Without deduplication, a single core router failure generates: node-down alert, interface-down alerts for every connected interface, BGP neighbor alerts for every peer, child node alerts for downstream devices. That's 20+ PagerDuty pages for a single event.


Common Alert Anti-Patterns to Avoid

Anti-PatternProblemFix
Single-poll thresholdsTransient spikes create false positives — engineers tune out pagerUse sustained thresholds (minimum 2–3 poll cycles)
No reset conditionAlerts stay "active" after auto-recovery — engineers don't know incident is resolvedAlways define reset condition; test it explicitly after alert creation
All alerts go to P1P1 loses urgency when it fires 50 times/dayReserve P1 for actual production outages; tune severity classification
No parent-child dependenciesSingle outage generates 50 alerts; engineers can't find the root cause in the noiseBuild node dependencies for every site; test by disabling parent node in a maintenance window
Alerting on every SNMP OID"We monitor everything" sounds good; creates noise for things no one can act onOnly alert on metrics with a defined response procedure
No maintenance windowsEngineers get paged during planned maintenance; cry-wolf effect buildsMaintenance window creation is a mandatory step in every change request