★★☆Intermediate Automation & Monitoring

SolarWindsNPMServiceNowPagerDutyMonitoringSNMPNOC

SolarWinds NPM: Custom Alerts, Baselining, and ServiceNow/PagerDuty Integration

March 13, 2026·23 min read

Contents

Overview
SNMP Configuration (Device Side)
Adding Nodes to SolarWinds NPM
SNMP Polling Intervals and Baselining
Custom Alerts — The 5 Most Useful for NOC
Alert Suppression and Maintenance Windows
ServiceNow Integration
PagerDuty Integration
NOC Dashboard Best Practices
Troubleshooting SolarWinds

Overview

At Conduent, the network monitoring stack covers 2,100+ clients across 42 countries. At that scale, manual monitoring is impossible — a single analyst cannot watch thousands of interfaces across every timezone. SolarWinds NPM is the foundation: it polls every device, tracks every interface, and fires alerts the moment something deviates from baseline.

The full monitoring stack layers four SolarWinds modules together:

NPM (Network Performance Monitor) — SNMP polling, node/interface/volume health, alerting
NTA (NetFlow Traffic Analyzer) — per-flow traffic analysis, top talkers, application breakdown
NCM (Network Configuration Manager) — config backup, change detection, compliance policy
IPAM (IP Address Manager) — IP allocation, subnet tracking, DHCP/DNS management

The integration chain that turns a network fault into an engineer's phone ringing:

NPM detects fault → Alert fires → REST API POST to ServiceNow → INC ticket created and auto-assigned → Webhook to PagerDuty → On-call engineer paged → Escalation if no acknowledgement

This post covers the full stack from SNMP configuration through PagerDuty escalation — everything needed to build a production NOC monitoring environment.

SNMP Configuration (Device Side)

SolarWinds polls devices over SNMP — no agent required. The configuration lives entirely on the network device. Getting SNMP right is the foundation of accurate monitoring.

SNMPv2c vs SNMPv3

SNMPv2c uses a plaintext community string for authentication. It is simple to configure and still works for non-regulated environments. SNMPv3 adds authentication (MD5/SHA) and optional encryption (DES/AES) — required for PCI-DSS, HIPAA, and any compliance framework that mandates encrypted management traffic.

Feature	SNMPv2c	SNMPv3
Authentication	Community string (plaintext)	Username + MD5/SHA HMAC
Encryption	None	DES / AES-128 / AES-256
PCI-DSS compliant	No	Yes
Config complexity	Low	Medium
Orion support	Full	Full

SNMPv3 Configuration on Cisco IOS

! Step 1: Create SNMP view — restrict what OIDs are readable
Router(config)# snmp-server view ORION-VIEW iso included
! Step 2: Create SNMPv3 group with auth+priv security
Router(config)# snmp-server group ORION-GROUP v3 priv read ORION-VIEW
! Step 3: Create SNMPv3 user — SHA auth, AES-128 priv
Router(config)# snmp-server user orion-monitor ORION-GROUP v3 auth sha Auth$ecret2024 priv aes 128 Priv$ecret2024
! Step 4: Set trap destination — Orion server IP
Router(config)# snmp-server host 10.0.0.10 version 3 priv orion-monitor
! Step 5: ACL restricting SNMP access to Orion only
Router(config)# ip access-list standard SNMP-ACL
Router(config-std-nacl)#  permit 10.0.0.10
Router(config-std-nacl)#  deny   any log
Router(config-std-nacl)#  exit
Router(config)# snmp-server community NOT-USED-DISABLED RO SNMP-ACL
! Step 6: Enable common traps
Router(config)# snmp-server enable traps snmp linkdown linkup coldstart
Router(config)# snmp-server enable traps bgp
Router(config)# snmp-server enable traps cpu threshold
! Verify SNMPv3 user was created
Router# show snmp user
! Expected output: orion-monitor · authSHA · privAES128

Community string hardening: Never use public or private. For SNMPv2c environments, use a randomly generated string (minimum 16 characters, mixed case, numbers, symbols). Apply an ACL to every community string — if Orion is your only poller, no other host should ever query SNMP.

Key OIDs Polled by SolarWinds NPM

OID / MIB Object	What It Measures	Poll Use
ifOperStatus (.1.3.6.1.2.1.2.2.1.8)	Interface up/down state	Node and interface availability
ifInOctets / ifOutOctets	Bytes in/out per interface	Bandwidth utilization calculation
ifInErrors / ifOutErrors	Interface error counters	Hardware fault detection
sysUpTime (.1.3.6.1.2.1.1.3.0)	Device uptime in timeticks	Reboot detection
cpmCPUTotal5min	Cisco 5-min CPU average %	CPU threshold alerting
cbgpPeer2State	BGP neighbor FSM state	BGP down alerting
hrStorageUsed / hrStorageSize	Disk/volume utilization	Volume capacity alerting

Adding Nodes to SolarWinds NPM

Step-by-Step Node Addition

Navigation path in Orion Web Console:
Settings → Manage Nodes → Add Node
Step 1: Enter hostname or IP address
  Hostname/IP: 10.100.1.1
  Polling method: Most Devices (ICMP + SNMP)
Step 2: Select SNMP version and credential
  SNMP version: v3
  Credential: [Select from Credential Library — "SNMPv3-Orion-Monitor"]
Step 3: Test connectivity
  → Click "Test" — confirms ICMP reachable + SNMP responding
  → Verify OIDs populate: sysDescr, sysUpTime, interface list
Step 4: Select resources to monitor
  [x] All interfaces (includes port-channels, loopbacks, SVIs)
  [x] CPU and memory
  [x] Volumes (disk utilization)
  [ ] Uncheck: loopback interfaces (reduce noise)
  [ ] Uncheck: null0, management VRF if not needed
Step 5: Set Custom Properties
  Site:        CEBU-PH
  Region:      APAC
  Device_Type: Router
  Criticality: High
  Team:        Network-APAC
Step 6: Click Add Node — Orion begins polling immediately

Node Management States

Managed: Normal polling state. Alerts fire on threshold breaches.
Unmanaged: Polling suspended. Use during maintenance windows, device replacements, or planned outages. Alerts do not fire — prevents false pages.
Shutdown: Node deleted from Orion. Historical data retained per retention policy.

To unmanage a node during a maintenance window:

Via Orion Web Console — right-click node → Unmanage
Set unmanage duration: Start time, End time (or indefinite)
Node status changes to "Unmanaged" — grey icon in dashboards
All child resources (interfaces, volumes) are also suppressed
Via Orion SDK (PowerShell) — bulk unmanage before maintenance:
$nodes = Get-SwisData $swis "SELECT NodeID FROM Orion.Nodes WHERE Site='CEBU-PH'"
foreach ($node in $nodes) {
    Invoke-SwisVerb $swis Orion.Nodes Unmanage @($node.NodeID, "2026-03-15T02:00:00", "2026-03-15T06:00:00", "false")
}

SNMP Polling Intervals and Baselining

Polling Interval Strategy

Resource Type	Default Interval	Critical Device	Notes
Node availability (ICMP)	120 seconds	60 seconds	Faster detection of outages
Interface utilization	9 minutes	2 minutes	More granular trending data
CPU / memory	9 minutes	2 minutes	Catch CPU spikes early
Volume (disk)	15 minutes	10 minutes	Disk fills slower — less critical
BGP neighbor state	2 minutes	60 seconds	BGP flaps need fast detection

Decreasing polling intervals increases load on both the Orion server and the polled device. For 2,100+ nodes, use Additional Polling Engines to distribute load — a single Orion poller maxes out around 800–1,000 nodes at 2-minute intervals.

Statistical Baselining

SolarWinds NPM calculates a rolling statistical baseline from 30 days of historical data. The baseline includes:

Average: The mean value over the rolling window
95th percentile: Removes spike outliers — more useful than average for capacity planning
Standard deviation: Measures how much the metric varies day-to-day

Baseline deviation alerting fires when current value exceeds baseline + (N × standard_deviation). This is more intelligent than a fixed threshold because a core router link running at 70% might be normal business hours traffic, while a 70% spike at 3am is anomalous.

Practical example: An MPLS uplink baseline is 45% utilization with a standard deviation of 8%. A baseline + 2σ alert fires at 61% — but only triggers during off-peak hours when that level represents a genuine anomaly. During business hours the alert may not fire at the same level because the baseline accounts for normal peak patterns by time-of-day segmentation.

Custom Alerts — The 5 Most Useful for NOC

Navigate to: Alerts & Activity → Manage Alerts → Add New Alert

Each alert has three key components: Trigger condition (when to fire), Reset condition (when to clear), and Actions (what to do when fired/reset).

Alert 1: Node Down

Detects: Device completely unreachable — both ICMP ping and SNMP poll fail.

Trigger Condition:
  Node.Status is equal to Down
  AND Node.StatusDescription contains "Node Down"
  Trigger after condition exists for: 3 consecutive polling cycles
  (prevents false alerts from single poll timeout)
Trigger Actions:
  1. Send email → noc-team@conduent.com
  2. HTTP POST → ServiceNow Table API  (create INC)
  3. HTTP POST → PagerDuty Events API  (page on-call)
  Alert message: "CRITICAL: ${NodeName} (${IP_Address}) is DOWN — Site: ${Site}"
Reset Condition:
  Node.Status is not equal to Down
Reset Actions:
  1. Send email → noc-team@conduent.com  (subject: RESOLVED)
  2. HTTP POST → ServiceNow  (resolve INC — state=6)
  3. HTTP POST → PagerDuty  (event_action=resolve)

Alert 2: Interface Utilization > 80%

Detects: Sustained bandwidth saturation — not momentary spikes.

Trigger Condition (OR on direction, AND on duration):
  Interface.InPercentUtil is greater than 80
  OR Interface.OutPercentUtil is greater than 80
  Interface.OperStatus is equal to Up
  Node.Criticality is equal to High
  Trigger after condition exists for: 10 minutes
Scope: Only apply to WAN/uplink interfaces
  Interface.Name contains "GigabitEthernet0/0"
  OR Interface.ifType is equal to ethernetCsmacd
Alert message:
  "HIGH UTIL: ${NodeName} ${InterfaceName} at ${InPercentUtil}% in / ${OutPercentUtil}% out"
  "Baseline avg: ${BaselineInAvg}% — 95th pct: ${Baseline95th}%"

Alert 3: BGP Neighbor Down

Detects: BGP peering session drops — critical for MPLS/SD-WAN environments.

Trigger Condition:
  Component.CbgpPeer2State is not equal to 6
  (BGP states: 1=Idle 2=Connect 3=Active 4=OpenSent 5=OpenConfirm 6=Established)
  Component.CbgpPeer2State is not equal to 0  (exclude unknown/unpolled)
  Trigger after condition exists for: 2 minutes
Alert message:
  "BGP DOWN: ${NodeName} — Peer ${CbgpPeer2RemoteAddr} state: ${CbgpPeer2State}"
  "Remote ASN: ${CbgpPeer2RemoteAs} — Expected state: Established (6)"
Severity: Critical — BGP down means traffic blackholing on that peer

Alert 4: CPU > 85% Sustained

Detects: CPU exhaustion — can indicate routing loop, DDoS, or misconfiguration.

Trigger Condition:
  Node.CPULoad is greater than 85
  Trigger after condition exists for: 10 minutes
  (short spikes during BGP convergence are normal — suppress those)
Alert message:
  "CPU HIGH: ${NodeName} CPU at ${CPULoad}% for >10 minutes"
  "Device: ${Device_Type} — Site: ${Site} — Region: ${Region}"
Reset condition (with hysteresis — prevents flapping):
  Node.CPULoad is less than 70

Alert 5: Interface Error Rate

Detects: Hardware-level errors — duplex mismatch, bad SFP, failing cable.

Trigger Condition:
  (Interface.InErrorsThisHour + Interface.OutErrorsThisHour) is greater than 100
  Interface.OperStatus is equal to Up
  Trigger after condition exists for: 2 polling cycles
Alert message:
  "INTERFACE ERRORS: ${NodeName} ${InterfaceName}"
  "Errors this hour: ${InErrorsThisHour} in / ${OutErrorsThisHour} out"
  "Likely cause: duplex mismatch, bad SFP, or physical layer fault"
Severity: Warning — usually physical, rarely immediately traffic-impacting
Action: Create ServiceNow problem record for hardware investigation

Alert Suppression and Maintenance Windows

This is the most underutilised feature in SolarWinds — and the most important for a global deployment. When a single WAN uplink goes down, every device behind it becomes unreachable. Without suppression, one WAN failure generates hundreds of Node Down alerts. With parent/child dependency set correctly, it generates one alert for the WAN link and suppresses everything downstream.

Configuring Parent/Child Dependencies

In Orion: Node Details page → Dependencies tab → Add Dependency
Example topology:
  CEBU-MPLS-RTR (parent WAN router)
  └── CEBU-CORE-SW1 (depends on parent)
      ├── CEBU-ACC-SW1
      ├── CEBU-ACC-SW2
      └── CEBU-ACC-SW3
Dependency configuration:
  Parent Node:      CEBU-MPLS-RTR
  Parent Interface: GigabitEthernet0/0  (the WAN-facing interface)
  Child Nodes:      CEBU-CORE-SW1, CEBU-ACC-SW1, CEBU-ACC-SW2, CEBU-ACC-SW3
Result when WAN link fails:
  → CEBU-MPLS-RTR GigabitEthernet0/0: CRITICAL  (1 alert fires, pages on-call)
  → CEBU-CORE-SW1: Unreachable — SUPPRESSED     (no alert)
  → CEBU-ACC-SW1:  Unreachable — SUPPRESSED     (no alert)
  → CEBU-ACC-SW2:  Unreachable — SUPPRESSED     (no alert)
  → CEBU-ACC-SW3:  Unreachable — SUPPRESSED     (no alert)
  Total alerts fired: 1 (not 5 — and not 200 in a large site)

Maintenance Windows — Scheduled Alert Suppression

Settings → Manage Maintenance Windows → New Window
  Name:             CEBU-SITE-POWER-MAINTENANCE
  Description:      Scheduled UPS replacement — all devices offline
  Nodes:            All nodes with Site=CEBU-PH  (dynamic group filter)
  Start:            2026-03-15 02:00 UTC+8
  End:              2026-03-15 06:00 UTC+8
  Suppress alerts:  Yes
  Unmanage nodes:   Yes  (stops polling — reduces Orion server load)
During maintenance window:
  - All alerts for selected nodes are suppressed
  - No ServiceNow tickets created
  - No PagerDuty pages sent
  - Nodes show "Unmanaged" status in all dashboards and maps

ServiceNow Integration

SolarWinds connects to ServiceNow via the REST API — specifically the Table API endpoint for the incident table. When an alert fires, Orion executes an HTTP POST action that creates an INC record with all relevant device context pre-populated.

Alert Action: HTTP POST to ServiceNow

In Alert definition → Add Action → HTTP GET/POST
URL:           https://conduent.service-now.com/api/now/table/incident
Method:        POST
Content-Type:  application/json
Auth:          Basic (service account credentials)
Request body — Orion substitutes ${variables} at alert fire time:
{
  "caller_id":        "solarwinds-service-account",
  "category":         "network",
  "subcategory":      "connectivity",
  "short_description":"NETWORK ALERT: ${AlertName} on ${NodeName}",
  "description":      "${AlertMessage}\n\nNode: ${NodeName}\nIP: ${IP_Address}\nSite: ${Site}\nRegion: ${Region}\nDevice Type: ${Device_Type}\nSeverity: ${AlertSeverity}\nTime: ${AlertTriggerTime}",
  "priority":         "${ServiceNow_Priority}",
  "assignment_group": "${ServiceNow_AssignmentGroup}",
  "cmdb_ci":          "${NodeName}",
  "u_monitoring_tool":"SolarWinds NPM",
  "u_alert_id":       "${AlertID}"
}

Custom Properties → ServiceNow Routing

The assignment_group is dynamically set using the node's custom properties. This routes alerts to the correct regional team automatically — no manual triage required:

Region (Custom Property)	Device_Type	ServiceNow Assignment Group	Priority
APAC	Router	Network-APAC-WAN	P2 — High
EMEA	Router	Network-EMEA-WAN	P2 — High
AMER	Firewall	Network-AMER-Security	P1 — Critical
Any	Switch (Criticality=High)	Network-Core-Infra	P2 — High
Any	Switch (Criticality=Low)	Network-NOC-L1	P3 — Moderate

Auto-Resolve on Alert Reset

When SolarWinds clears an alert (reset condition met), a second HTTP POST closes the ServiceNow incident:

Reset action — PATCH to update existing incident by alert ID:
URL:    https://conduent.service-now.com/api/now/table/incident?sysparm_query=u_alert_id=${AlertID}^state!=7
Method: PATCH
Body:
{
  "state":       "6",
  "close_code":  "Solved (Permanently)",
  "close_notes": "Alert auto-resolved by SolarWinds NPM at ${AlertResetTime}. Node returned to normal state."
}

PagerDuty Integration

PagerDuty handles the human side of the alerting chain: who gets paged, in what order, and what happens if they don't respond. SolarWinds fires the webhook; PagerDuty manages the escalation from there.

Alert Action: HTTP POST to PagerDuty Events API v2

PagerDuty Events API v2 — trigger payload:
URL:          https://events.pagerduty.com/v2/enqueue
Method:       POST
Content-Type: application/json
{
  "routing_key":  "R3a7b2c9d4e5f6a7b8c9d0e1f2a3b4c5d",
  "event_action": "trigger",
  "dedup_key":    "${NodeName}-${AlertName}",
  "payload": {
    "summary":   "${AlertName}: ${NodeName} — ${Site}",
    "source":    "${IP_Address}",
    "severity":  "critical",
    "timestamp": "${AlertTriggerTime}",
    "component": "${Device_Type}",
    "group":     "${Region}",
    "class":     "network",
    "custom_details": {
      "node":          "${NodeName}",
      "ip_address":    "${IP_Address}",
      "site":          "${Site}",
      "region":        "${Region}",
      "alert_message": "${AlertMessage}",
      "orion_url":     "${NodeDetailsURL}"
    }
  }
}
Auto-resolve payload (fires from alert Reset action):
{
  "routing_key":  "R3a7b2c9d4e5f6a7b8c9d0e1f2a3b4c5d",
  "event_action": "resolve",
  "dedup_key":    "${NodeName}-${AlertName}"
}

The dedup_key is critical. If SolarWinds fires a Node Down alert and then the interface flaps and fires again before the engineer acknowledges, PagerDuty uses the dedup_key to recognise it as the same incident — no duplicate pages. Without it, an unstable link could generate dozens of pages per hour.

Escalation Policy

The on-call schedule at Conduent uses a 24/5 weekday rotation with weekend coverage:

Level	Who	Wait Time	Contact Method
L1	NOC Analyst (on-call)	0 min — immediate	SMS + Push notification
L2	Network Engineer	5 min — no ack from L1	SMS + Phone call
L3	Senior Network Engineer	15 min — no ack from L2	Phone call (repeated)
Manager	Network Team Lead	30 min — no ack from L3	Phone call + Email
Secondary	Backup on-call engineer	30 min — no ack from Manager	SMS + Phone call

On-Call Rotation Configuration

PagerDuty Schedule: Network-APAC-OnCall
Layer 1 (Primary):   Weekly rotation — Mon 08:00 local → Mon 08:00 local
Layer 2 (Secondary): Weekly rotation — offset 1 week from Primary
Layer 3 (Manager):   Always-on override during business hours
Per-user notification rules:
  Immediately:  Push notification (PagerDuty mobile app)
  After 1 min:  SMS to registered mobile
  After 3 min:  Phone call (auto-dialer)
  After 5 min:  Escalate to next policy level

NOC Dashboard Best Practices

A well-structured NOC dashboard answers one question immediately: Is anything on fire right now?

Dashboard Layout (Orion Web Console)

Top-of-screen: Alert summary bar
  Critical: 2  |  Major: 7  |  Minor: 14  |  Warning: 31
Left panel: World map resource
  - IP geolocation of all 2,100+ nodes mapped by country
  - Color-coded by status: green=up, red=down, yellow=warning
  - Click country to drill into regional node list
  - Shows at a glance which of the 42 countries has active issues
Center panel: Active alert table + key metrics
  - Node availability % (target: 99.9%)
  - Top 10 interfaces by current utilization
  - BGP neighbor count: expected vs currently established
Right panel: Recent events feed
  - Last 50 SNMP trap messages
  - Last 10 config changes (NCM integration)
  - Last 5 ServiceNow incidents created

Custom Views by Team

NOC View: All critical and major alerts, unacknowledged only. Auto-refresh every 60 seconds.
APAC Regional View: Nodes filtered Region=APAC. Shows local time, alerts, and regional map.
EMEA Regional View: Nodes filtered Region=EMEA. Scoped to EMEA business hours context.
Management View: 30-day SLA trend, availability % per site, monthly incident counts.

Monthly SLA Reporting

Orion Reports → New Report → Node Availability Summary
  Report type: Node Availability
  Time range:  Last 30 days
  Group by:    Site, Region
  Columns:     Node Name, IP, Availability %, Downtime (min), Alert Count
SLA reference thresholds:
  99.5% monthly = max 3.6 hours downtime per node per month
  99.9% monthly = max 43.8 minutes downtime per node per month
Export formats:
  PDF  — monthly stakeholder report (auto-email to management)
  CSV  — raw data for trending and capacity planning analysis

Troubleshooting SolarWinds

Node Shows "Unknown" Status

Symptom: Node icon is grey/yellow, status "Unknown" — polling failing
Check 1: ICMP reachability from Orion server
Orion-Server> ping 10.100.1.1
If ping fails: routing or firewall issue between Orion and device
Check 2: SNMP reachability from Orion server
Orion-Server> snmpwalk -v3 -u orion-monitor -l authPriv \
    -a SHA -A Auth$ecret2024 -x AES -X Priv$ecret2024 \
    10.100.1.1 sysDescr
Timeout = firewall blocking UDP 161, wrong credentials, or SNMP ACL mismatch
Check 3: Test credential from Orion UI
  Settings → Manage Credentials → Select credential → Test against node IP
  "Failed" result = credential mismatch (community, v3 auth/priv wrong)
Check 4: Verify firewall permits
  UDP 161: Orion IP → Device      (SNMP queries, outbound from Orion)
  UDP 162: Device → Orion IP      (SNMP traps, inbound to Orion)

Alert Not Firing

Use Simulate Alert to test without waiting for a real fault condition:
  Alerts → Manage Alerts → Select alert → Simulate Alert
  Specify test node → "Trigger Alert Now"
  Verify: email arrived? ServiceNow ticket created? PagerDuty fired?
Most common root causes:
  1. Alert scope too narrow — node may not match all filter conditions
  2. Node is in an active maintenance window — alerts are suppressed
  3. Alert is already active — same alert firing for this node already exists
  4. Trigger duration not elapsed — condition exists but timer not reached

ServiceNow Integration Failing

Test REST endpoint directly from Orion server CLI:
Orion-Server> curl -u solarwinds-svc:P@ssword! \
  -H "Content-Type: application/json" \
  -X POST \
  -d '{"short_description":"Test from SolarWinds","category":"network"}' \
  https://conduent.service-now.com/api/now/table/incident
200 OK   = endpoint works — issue is in Orion alert action configuration
401      = wrong credentials stored in Orion alert action
000/Conn = proxy required — add proxy settings in Orion HTTP action config
Proxy configuration in Orion alert HTTP action:
  Proxy server: 10.0.0.254:8080
  Proxy auth:   [if required by proxy policy]

High Orion Server Load / Polling Queue Backup

Symptoms: nodes showing stale data, polling queue growing, Orion CPU high
Check polling queue depth:
  Settings → Polling Settings → View Current Polling Queue
  Queue > 1,000 entries = poller is overloaded
Resolution options (in order of preference):
  1. Add Additional Polling Engine (APE)
     — each APE handles 800-1,000 nodes at 2-min interval independently
  2. Increase polling intervals for non-critical access layer devices
     — access switches: 5-min node poll instead of 2-min
  3. Reduce polled resources per node
     — exclude loopback, null0, management VRF interfaces
  4. Enable SNMP Bulk Walk (fewer packets per poll cycle)
     Settings → Polling Settings → Use Bulk Walk: Enabled

The SolarWinds + ServiceNow + PagerDuty stack, when configured correctly, transforms a reactive NOC into a proactive one. Alerts fire before users notice outages, tickets route to the right team automatically, and engineers get paged with full context — not just "something is down." The key is investing time in the alert logic: parent/child dependencies, baseline deviation thresholds, and dedup keys in PagerDuty. Get those right and the 3am wake-up calls become targeted, actionable, and rare.