Overview
At Conduent, the network monitoring stack covers 2,100+ clients across 42 countries. At that scale, manual monitoring is impossible — a single analyst cannot watch thousands of interfaces across every timezone. SolarWinds NPM is the foundation: it polls every device, tracks every interface, and fires alerts the moment something deviates from baseline.
The full monitoring stack layers four SolarWinds modules together:
- NPM (Network Performance Monitor) — SNMP polling, node/interface/volume health, alerting
- NTA (NetFlow Traffic Analyzer) — per-flow traffic analysis, top talkers, application breakdown
- NCM (Network Configuration Manager) — config backup, change detection, compliance policy
- IPAM (IP Address Manager) — IP allocation, subnet tracking, DHCP/DNS management
The integration chain that turns a network fault into an engineer's phone ringing:
NPM detects fault → Alert fires → REST API POST to ServiceNow → INC ticket created and auto-assigned → Webhook to PagerDuty → On-call engineer paged → Escalation if no acknowledgement
This post covers the full stack from SNMP configuration through PagerDuty escalation — everything needed to build a production NOC monitoring environment.
SNMP Configuration (Device Side)
SolarWinds polls devices over SNMP — no agent required. The configuration lives entirely on the network device. Getting SNMP right is the foundation of accurate monitoring.
SNMPv2c vs SNMPv3
SNMPv2c uses a plaintext community string for authentication. It is simple to configure and still works for non-regulated environments. SNMPv3 adds authentication (MD5/SHA) and optional encryption (DES/AES) — required for PCI-DSS, HIPAA, and any compliance framework that mandates encrypted management traffic.
| Feature | SNMPv2c | SNMPv3 |
|---|---|---|
| Authentication | Community string (plaintext) | Username + MD5/SHA HMAC |
| Encryption | None | DES / AES-128 / AES-256 |
| PCI-DSS compliant | No | Yes |
| Config complexity | Low | Medium |
| Orion support | Full | Full |
SNMPv3 Configuration on Cisco IOS
! Step 1: Create SNMP view — restrict what OIDs are readable
Router(config)# snmp-server view ORION-VIEW iso included
! Step 2: Create SNMPv3 group with auth+priv security
Router(config)# snmp-server group ORION-GROUP v3 priv read ORION-VIEW
! Step 3: Create SNMPv3 user — SHA auth, AES-128 priv
Router(config)# snmp-server user orion-monitor ORION-GROUP v3 auth sha Auth$ecret2024 priv aes 128 Priv$ecret2024
! Step 4: Set trap destination — Orion server IP
Router(config)# snmp-server host 10.0.0.10 version 3 priv orion-monitor
! Step 5: ACL restricting SNMP access to Orion only
Router(config)# ip access-list standard SNMP-ACL
Router(config-std-nacl)# permit 10.0.0.10
Router(config-std-nacl)# deny any log
Router(config-std-nacl)# exit
Router(config)# snmp-server community NOT-USED-DISABLED RO SNMP-ACL
! Step 6: Enable common traps
Router(config)# snmp-server enable traps snmp linkdown linkup coldstart
Router(config)# snmp-server enable traps bgp
Router(config)# snmp-server enable traps cpu threshold
! Verify SNMPv3 user was created
Router# show snmp user
! Expected output: orion-monitor · authSHA · privAES128
Community string hardening: Never use public or private. For SNMPv2c environments, use a randomly generated string (minimum 16 characters, mixed case, numbers, symbols). Apply an ACL to every community string — if Orion is your only poller, no other host should ever query SNMP.
Key OIDs Polled by SolarWinds NPM
| OID / MIB Object | What It Measures | Poll Use |
|---|---|---|
| ifOperStatus (.1.3.6.1.2.1.2.2.1.8) | Interface up/down state | Node and interface availability |
| ifInOctets / ifOutOctets | Bytes in/out per interface | Bandwidth utilization calculation |
| ifInErrors / ifOutErrors | Interface error counters | Hardware fault detection |
| sysUpTime (.1.3.6.1.2.1.1.3.0) | Device uptime in timeticks | Reboot detection |
| cpmCPUTotal5min | Cisco 5-min CPU average % | CPU threshold alerting |
| cbgpPeer2State | BGP neighbor FSM state | BGP down alerting |
| hrStorageUsed / hrStorageSize | Disk/volume utilization | Volume capacity alerting |
Adding Nodes to SolarWinds NPM
Step-by-Step Node Addition
Navigation path in Orion Web Console:
Settings → Manage Nodes → Add Node
Step 1: Enter hostname or IP address
Hostname/IP: 10.100.1.1
Polling method: Most Devices (ICMP + SNMP)
Step 2: Select SNMP version and credential
SNMP version: v3
Credential: [Select from Credential Library — "SNMPv3-Orion-Monitor"]
Step 3: Test connectivity
→ Click "Test" — confirms ICMP reachable + SNMP responding
→ Verify OIDs populate: sysDescr, sysUpTime, interface list
Step 4: Select resources to monitor
[x] All interfaces (includes port-channels, loopbacks, SVIs)
[x] CPU and memory
[x] Volumes (disk utilization)
[ ] Uncheck: loopback interfaces (reduce noise)
[ ] Uncheck: null0, management VRF if not needed
Step 5: Set Custom Properties
Site: CEBU-PH
Region: APAC
Device_Type: Router
Criticality: High
Team: Network-APAC
Step 6: Click Add Node — Orion begins polling immediately
Node Management States
- Managed: Normal polling state. Alerts fire on threshold breaches.
- Unmanaged: Polling suspended. Use during maintenance windows, device replacements, or planned outages. Alerts do not fire — prevents false pages.
- Shutdown: Node deleted from Orion. Historical data retained per retention policy.
To unmanage a node during a maintenance window:
Via Orion Web Console — right-click node → Unmanage
Set unmanage duration: Start time, End time (or indefinite)
Node status changes to "Unmanaged" — grey icon in dashboards
All child resources (interfaces, volumes) are also suppressed
Via Orion SDK (PowerShell) — bulk unmanage before maintenance:
$nodes = Get-SwisData $swis "SELECT NodeID FROM Orion.Nodes WHERE Site='CEBU-PH'"
foreach ($node in $nodes) {
Invoke-SwisVerb $swis Orion.Nodes Unmanage @($node.NodeID, "2026-03-15T02:00:00", "2026-03-15T06:00:00", "false")
}
SNMP Polling Intervals and Baselining
Polling Interval Strategy
| Resource Type | Default Interval | Critical Device | Notes |
|---|---|---|---|
| Node availability (ICMP) | 120 seconds | 60 seconds | Faster detection of outages |
| Interface utilization | 9 minutes | 2 minutes | More granular trending data |
| CPU / memory | 9 minutes | 2 minutes | Catch CPU spikes early |
| Volume (disk) | 15 minutes | 10 minutes | Disk fills slower — less critical |
| BGP neighbor state | 2 minutes | 60 seconds | BGP flaps need fast detection |
Decreasing polling intervals increases load on both the Orion server and the polled device. For 2,100+ nodes, use Additional Polling Engines to distribute load — a single Orion poller maxes out around 800–1,000 nodes at 2-minute intervals.
Statistical Baselining
SolarWinds NPM calculates a rolling statistical baseline from 30 days of historical data. The baseline includes:
- Average: The mean value over the rolling window
- 95th percentile: Removes spike outliers — more useful than average for capacity planning
- Standard deviation: Measures how much the metric varies day-to-day
Baseline deviation alerting fires when current value exceeds baseline + (N × standard_deviation). This is more intelligent than a fixed threshold because a core router link running at 70% might be normal business hours traffic, while a 70% spike at 3am is anomalous.
Practical example: An MPLS uplink baseline is 45% utilization with a standard deviation of 8%. A baseline + 2σ alert fires at 61% — but only triggers during off-peak hours when that level represents a genuine anomaly. During business hours the alert may not fire at the same level because the baseline accounts for normal peak patterns by time-of-day segmentation.
Custom Alerts — The 5 Most Useful for NOC
Navigate to: Alerts & Activity → Manage Alerts → Add New Alert
Each alert has three key components: Trigger condition (when to fire), Reset condition (when to clear), and Actions (what to do when fired/reset).
Alert 1: Node Down
Detects: Device completely unreachable — both ICMP ping and SNMP poll fail.
Trigger Condition:
Node.Status is equal to Down
AND Node.StatusDescription contains "Node Down"
Trigger after condition exists for: 3 consecutive polling cycles
(prevents false alerts from single poll timeout)
Trigger Actions:
1. Send email → noc-team@conduent.com
2. HTTP POST → ServiceNow Table API (create INC)
3. HTTP POST → PagerDuty Events API (page on-call)
Alert message: "CRITICAL: ${NodeName} (${IP_Address}) is DOWN — Site: ${Site}"
Reset Condition:
Node.Status is not equal to Down
Reset Actions:
1. Send email → noc-team@conduent.com (subject: RESOLVED)
2. HTTP POST → ServiceNow (resolve INC — state=6)
3. HTTP POST → PagerDuty (event_action=resolve)
Alert 2: Interface Utilization > 80%
Detects: Sustained bandwidth saturation — not momentary spikes.
Trigger Condition (OR on direction, AND on duration):
Interface.InPercentUtil is greater than 80
OR Interface.OutPercentUtil is greater than 80
Interface.OperStatus is equal to Up
Node.Criticality is equal to High
Trigger after condition exists for: 10 minutes
Scope: Only apply to WAN/uplink interfaces
Interface.Name contains "GigabitEthernet0/0"
OR Interface.ifType is equal to ethernetCsmacd
Alert message:
"HIGH UTIL: ${NodeName} ${InterfaceName} at ${InPercentUtil}% in / ${OutPercentUtil}% out"
"Baseline avg: ${BaselineInAvg}% — 95th pct: ${Baseline95th}%"
Alert 3: BGP Neighbor Down
Detects: BGP peering session drops — critical for MPLS/SD-WAN environments.
Trigger Condition:
Component.CbgpPeer2State is not equal to 6
(BGP states: 1=Idle 2=Connect 3=Active 4=OpenSent 5=OpenConfirm 6=Established)
Component.CbgpPeer2State is not equal to 0 (exclude unknown/unpolled)
Trigger after condition exists for: 2 minutes
Alert message:
"BGP DOWN: ${NodeName} — Peer ${CbgpPeer2RemoteAddr} state: ${CbgpPeer2State}"
"Remote ASN: ${CbgpPeer2RemoteAs} — Expected state: Established (6)"
Severity: Critical — BGP down means traffic blackholing on that peer
Alert 4: CPU > 85% Sustained
Detects: CPU exhaustion — can indicate routing loop, DDoS, or misconfiguration.
Trigger Condition:
Node.CPULoad is greater than 85
Trigger after condition exists for: 10 minutes
(short spikes during BGP convergence are normal — suppress those)
Alert message:
"CPU HIGH: ${NodeName} CPU at ${CPULoad}% for >10 minutes"
"Device: ${Device_Type} — Site: ${Site} — Region: ${Region}"
Reset condition (with hysteresis — prevents flapping):
Node.CPULoad is less than 70
Alert 5: Interface Error Rate
Detects: Hardware-level errors — duplex mismatch, bad SFP, failing cable.
Trigger Condition:
(Interface.InErrorsThisHour + Interface.OutErrorsThisHour) is greater than 100
Interface.OperStatus is equal to Up
Trigger after condition exists for: 2 polling cycles
Alert message:
"INTERFACE ERRORS: ${NodeName} ${InterfaceName}"
"Errors this hour: ${InErrorsThisHour} in / ${OutErrorsThisHour} out"
"Likely cause: duplex mismatch, bad SFP, or physical layer fault"
Severity: Warning — usually physical, rarely immediately traffic-impacting
Action: Create ServiceNow problem record for hardware investigation
Alert Suppression and Maintenance Windows
This is the most underutilised feature in SolarWinds — and the most important for a global deployment. When a single WAN uplink goes down, every device behind it becomes unreachable. Without suppression, one WAN failure generates hundreds of Node Down alerts. With parent/child dependency set correctly, it generates one alert for the WAN link and suppresses everything downstream.
Configuring Parent/Child Dependencies
In Orion: Node Details page → Dependencies tab → Add Dependency
Example topology:
CEBU-MPLS-RTR (parent WAN router)
└── CEBU-CORE-SW1 (depends on parent)
├── CEBU-ACC-SW1
├── CEBU-ACC-SW2
└── CEBU-ACC-SW3
Dependency configuration:
Parent Node: CEBU-MPLS-RTR
Parent Interface: GigabitEthernet0/0 (the WAN-facing interface)
Child Nodes: CEBU-CORE-SW1, CEBU-ACC-SW1, CEBU-ACC-SW2, CEBU-ACC-SW3
Result when WAN link fails:
→ CEBU-MPLS-RTR GigabitEthernet0/0: CRITICAL (1 alert fires, pages on-call)
→ CEBU-CORE-SW1: Unreachable — SUPPRESSED (no alert)
→ CEBU-ACC-SW1: Unreachable — SUPPRESSED (no alert)
→ CEBU-ACC-SW2: Unreachable — SUPPRESSED (no alert)
→ CEBU-ACC-SW3: Unreachable — SUPPRESSED (no alert)
Total alerts fired: 1 (not 5 — and not 200 in a large site)
Maintenance Windows — Scheduled Alert Suppression
Settings → Manage Maintenance Windows → New Window
Name: CEBU-SITE-POWER-MAINTENANCE
Description: Scheduled UPS replacement — all devices offline
Nodes: All nodes with Site=CEBU-PH (dynamic group filter)
Start: 2026-03-15 02:00 UTC+8
End: 2026-03-15 06:00 UTC+8
Suppress alerts: Yes
Unmanage nodes: Yes (stops polling — reduces Orion server load)
During maintenance window:
- All alerts for selected nodes are suppressed
- No ServiceNow tickets created
- No PagerDuty pages sent
- Nodes show "Unmanaged" status in all dashboards and maps
ServiceNow Integration
SolarWinds connects to ServiceNow via the REST API — specifically the Table API endpoint for the incident table. When an alert fires, Orion executes an HTTP POST action that creates an INC record with all relevant device context pre-populated.
Alert Action: HTTP POST to ServiceNow
In Alert definition → Add Action → HTTP GET/POST
URL: https://conduent.service-now.com/api/now/table/incident
Method: POST
Content-Type: application/json
Auth: Basic (service account credentials)
Request body — Orion substitutes ${variables} at alert fire time:
{
"caller_id": "solarwinds-service-account",
"category": "network",
"subcategory": "connectivity",
"short_description":"NETWORK ALERT: ${AlertName} on ${NodeName}",
"description": "${AlertMessage}\n\nNode: ${NodeName}\nIP: ${IP_Address}\nSite: ${Site}\nRegion: ${Region}\nDevice Type: ${Device_Type}\nSeverity: ${AlertSeverity}\nTime: ${AlertTriggerTime}",
"priority": "${ServiceNow_Priority}",
"assignment_group": "${ServiceNow_AssignmentGroup}",
"cmdb_ci": "${NodeName}",
"u_monitoring_tool":"SolarWinds NPM",
"u_alert_id": "${AlertID}"
}
Custom Properties → ServiceNow Routing
The assignment_group is dynamically set using the node's custom properties. This routes alerts to the correct regional team automatically — no manual triage required:
| Region (Custom Property) | Device_Type | ServiceNow Assignment Group | Priority |
|---|---|---|---|
| APAC | Router | Network-APAC-WAN | P2 — High |
| EMEA | Router | Network-EMEA-WAN | P2 — High |
| AMER | Firewall | Network-AMER-Security | P1 — Critical |
| Any | Switch (Criticality=High) | Network-Core-Infra | P2 — High |
| Any | Switch (Criticality=Low) | Network-NOC-L1 | P3 — Moderate |
Auto-Resolve on Alert Reset
When SolarWinds clears an alert (reset condition met), a second HTTP POST closes the ServiceNow incident:
Reset action — PATCH to update existing incident by alert ID:
URL: https://conduent.service-now.com/api/now/table/incident?sysparm_query=u_alert_id=${AlertID}^state!=7
Method: PATCH
Body:
{
"state": "6",
"close_code": "Solved (Permanently)",
"close_notes": "Alert auto-resolved by SolarWinds NPM at ${AlertResetTime}. Node returned to normal state."
}
PagerDuty Integration
PagerDuty handles the human side of the alerting chain: who gets paged, in what order, and what happens if they don't respond. SolarWinds fires the webhook; PagerDuty manages the escalation from there.
Alert Action: HTTP POST to PagerDuty Events API v2
PagerDuty Events API v2 — trigger payload:
URL: https://events.pagerduty.com/v2/enqueue
Method: POST
Content-Type: application/json
{
"routing_key": "R3a7b2c9d4e5f6a7b8c9d0e1f2a3b4c5d",
"event_action": "trigger",
"dedup_key": "${NodeName}-${AlertName}",
"payload": {
"summary": "${AlertName}: ${NodeName} — ${Site}",
"source": "${IP_Address}",
"severity": "critical",
"timestamp": "${AlertTriggerTime}",
"component": "${Device_Type}",
"group": "${Region}",
"class": "network",
"custom_details": {
"node": "${NodeName}",
"ip_address": "${IP_Address}",
"site": "${Site}",
"region": "${Region}",
"alert_message": "${AlertMessage}",
"orion_url": "${NodeDetailsURL}"
}
}
}
Auto-resolve payload (fires from alert Reset action):
{
"routing_key": "R3a7b2c9d4e5f6a7b8c9d0e1f2a3b4c5d",
"event_action": "resolve",
"dedup_key": "${NodeName}-${AlertName}"
}
The dedup_key is critical. If SolarWinds fires a Node Down alert and then the interface flaps and fires again before the engineer acknowledges, PagerDuty uses the dedup_key to recognise it as the same incident — no duplicate pages. Without it, an unstable link could generate dozens of pages per hour.
Escalation Policy
The on-call schedule at Conduent uses a 24/5 weekday rotation with weekend coverage:
| Level | Who | Wait Time | Contact Method |
|---|---|---|---|
| L1 | NOC Analyst (on-call) | 0 min — immediate | SMS + Push notification |
| L2 | Network Engineer | 5 min — no ack from L1 | SMS + Phone call |
| L3 | Senior Network Engineer | 15 min — no ack from L2 | Phone call (repeated) |
| Manager | Network Team Lead | 30 min — no ack from L3 | Phone call + Email |
| Secondary | Backup on-call engineer | 30 min — no ack from Manager | SMS + Phone call |
On-Call Rotation Configuration
PagerDuty Schedule: Network-APAC-OnCall
Layer 1 (Primary): Weekly rotation — Mon 08:00 local → Mon 08:00 local
Layer 2 (Secondary): Weekly rotation — offset 1 week from Primary
Layer 3 (Manager): Always-on override during business hours
Per-user notification rules:
Immediately: Push notification (PagerDuty mobile app)
After 1 min: SMS to registered mobile
After 3 min: Phone call (auto-dialer)
After 5 min: Escalate to next policy level
NOC Dashboard Best Practices
A well-structured NOC dashboard answers one question immediately: Is anything on fire right now?
Dashboard Layout (Orion Web Console)
Top-of-screen: Alert summary bar
Critical: 2 | Major: 7 | Minor: 14 | Warning: 31
Left panel: World map resource
- IP geolocation of all 2,100+ nodes mapped by country
- Color-coded by status: green=up, red=down, yellow=warning
- Click country to drill into regional node list
- Shows at a glance which of the 42 countries has active issues
Center panel: Active alert table + key metrics
- Node availability % (target: 99.9%)
- Top 10 interfaces by current utilization
- BGP neighbor count: expected vs currently established
Right panel: Recent events feed
- Last 50 SNMP trap messages
- Last 10 config changes (NCM integration)
- Last 5 ServiceNow incidents created
Custom Views by Team
- NOC View: All critical and major alerts, unacknowledged only. Auto-refresh every 60 seconds.
- APAC Regional View: Nodes filtered Region=APAC. Shows local time, alerts, and regional map.
- EMEA Regional View: Nodes filtered Region=EMEA. Scoped to EMEA business hours context.
- Management View: 30-day SLA trend, availability % per site, monthly incident counts.
Monthly SLA Reporting
Orion Reports → New Report → Node Availability Summary
Report type: Node Availability
Time range: Last 30 days
Group by: Site, Region
Columns: Node Name, IP, Availability %, Downtime (min), Alert Count
SLA reference thresholds:
99.5% monthly = max 3.6 hours downtime per node per month
99.9% monthly = max 43.8 minutes downtime per node per month
Export formats:
PDF — monthly stakeholder report (auto-email to management)
CSV — raw data for trending and capacity planning analysis
Troubleshooting SolarWinds
Node Shows "Unknown" Status
Symptom: Node icon is grey/yellow, status "Unknown" — polling failing
Check 1: ICMP reachability from Orion server
Orion-Server> ping 10.100.1.1
If ping fails: routing or firewall issue between Orion and device
Check 2: SNMP reachability from Orion server
Orion-Server> snmpwalk -v3 -u orion-monitor -l authPriv \
-a SHA -A Auth$ecret2024 -x AES -X Priv$ecret2024 \
10.100.1.1 sysDescr
Timeout = firewall blocking UDP 161, wrong credentials, or SNMP ACL mismatch
Check 3: Test credential from Orion UI
Settings → Manage Credentials → Select credential → Test against node IP
"Failed" result = credential mismatch (community, v3 auth/priv wrong)
Check 4: Verify firewall permits
UDP 161: Orion IP → Device (SNMP queries, outbound from Orion)
UDP 162: Device → Orion IP (SNMP traps, inbound to Orion)
Alert Not Firing
Use Simulate Alert to test without waiting for a real fault condition:
Alerts → Manage Alerts → Select alert → Simulate Alert
Specify test node → "Trigger Alert Now"
Verify: email arrived? ServiceNow ticket created? PagerDuty fired?
Most common root causes:
1. Alert scope too narrow — node may not match all filter conditions
2. Node is in an active maintenance window — alerts are suppressed
3. Alert is already active — same alert firing for this node already exists
4. Trigger duration not elapsed — condition exists but timer not reached
ServiceNow Integration Failing
Test REST endpoint directly from Orion server CLI:
Orion-Server> curl -u solarwinds-svc:P@ssword! \
-H "Content-Type: application/json" \
-X POST \
-d '{"short_description":"Test from SolarWinds","category":"network"}' \
https://conduent.service-now.com/api/now/table/incident
200 OK = endpoint works — issue is in Orion alert action configuration
401 = wrong credentials stored in Orion alert action
000/Conn = proxy required — add proxy settings in Orion HTTP action config
Proxy configuration in Orion alert HTTP action:
Proxy server: 10.0.0.254:8080
Proxy auth: [if required by proxy policy]
High Orion Server Load / Polling Queue Backup
Symptoms: nodes showing stale data, polling queue growing, Orion CPU high
Check polling queue depth:
Settings → Polling Settings → View Current Polling Queue
Queue > 1,000 entries = poller is overloaded
Resolution options (in order of preference):
1. Add Additional Polling Engine (APE)
— each APE handles 800-1,000 nodes at 2-min interval independently
2. Increase polling intervals for non-critical access layer devices
— access switches: 5-min node poll instead of 2-min
3. Reduce polled resources per node
— exclude loopback, null0, management VRF interfaces
4. Enable SNMP Bulk Walk (fewer packets per poll cycle)
Settings → Polling Settings → Use Bulk Walk: Enabled
The SolarWinds + ServiceNow + PagerDuty stack, when configured correctly, transforms a reactive NOC into a proactive one. Alerts fire before users notice outages, tickets route to the right team automatically, and engineers get paged with full context — not just "something is down." The key is investing time in the alert logic: parent/child dependencies, baseline deviation thresholds, and dedup keys in PagerDuty. Get those right and the 3am wake-up calls become targeted, actionable, and rare.