Contents
The Scale Problem
Managing network infrastructure across 42 countries sounds daunting. It's not magic — it's systems, process, and tooling working together. The engineers in our NOC in Cebu handle devices in the US, UK, Germany, South Africa, India, Japan, and everywhere in between, during a 24/5 on-call rotation.
Here's what actually makes it work.
The Monitoring Foundation
You can't manage what you can't see. Every device in our network — routers, switches, firewalls, wireless controllers — is registered in SolarWinds NPM with SNMP v3 credentials.
What we poll:
- Interface up/down status (every 60 seconds)
- Interface utilization — input/output in bps, error/discard counters
- CPU and memory on all critical devices
- BGP neighbor state changes (via SNMP traps)
- Hardware health — PSU, fan, temperature sensors
What we don't rely on polling for:
- Security events (go to Splunk via syslog)
- SD-WAN path quality (VeloCloud Orchestrator handles this natively)
- Wireless client issues (Meraki Dashboard + SNS alerts)
The key principle: every alert must be actionable. If an alert fires and the on-call engineer can't do anything about it, the alert is wrong — either it should be suppressed, correlated with a parent event, or its threshold needs tuning. Alert fatigue kills NOC effectiveness faster than anything else.
Severity Framework
All alerts map to one of four severities. This is consistent across SolarWinds, PagerDuty, and ServiceNow.
The hardest discipline is not escalating everything to P1. The instinct is to call everything critical. Resist it. If you cry wolf on P1, your stakeholders stop responding with urgency.
The On-Call Rotation
We run 24/5 on-call coverage (weekdays only for Sev1/2; weekend coverage is handled by a separate escalation tier). Each engineer carries an on-call week every 5–6 weeks.
What makes on-call sustainable:
- Runbooks for every common incident type. No one should be debugging MPLS core failures at 3am from memory. The runbook lives in Confluence, linked from the PagerDuty alert.
- Shift handoff notes. Every on-call shift ends with a written handoff: what was worked on, what's still open, any devices in degraded state.
- Post-incident reviews (PIRs) for all P1s. Within 48 hours. Not blame-focused — focused on what the process missed.
- Suppression of known maintenance windows. Every scheduled maintenance registers in our change calendar, which SolarWinds reads to suppress expected alerts. Nothing wakes up the on-call engineer for a planned router reload.
How I Triage a P1
When PagerDuty fires a P1 at 2am:
# 1. Acknowledge in PagerDuty within 15 minutes (SLA clock starts on ack)
# 2. Open the SolarWinds alert — identify the affected device and its location
# 3. Correlate: is this one device or multiple?
# Multiple devices in the same region = ISP/transport issue, not device issue
# Single device = hardware or config
# 4. Check last change — run a quick check in ServiceNow change calendar
# "Was there a change in the last 4 hours on or near this device?"
# If yes: change is likely the cause. Contact the change owner first.
# 5. SSH to the device (via jump server) and run diagnostics
$ ssh -J jumpserver admin@device-ip
Router# show interface GigabitEthernet0/0/0
Router# show log | last 50
Router# show ip bgp summary
Router# show ip route summary
# 6. Determine: is this a network issue (routing, ISP) or device issue?
# Network: open ISP ticket, notify account team, document in ServiceNow
# Device: apply fix or workaround; if hardware — dispatch onsite tech
# 7. Update the P1 ticket every 30 minutes until resolved
# Client-facing update every 30 min is contractual for our largest accounts
The 42-Country Reality: What Nobody Tells You
Time zones are the hardest part. When a P1 fires in Japan at 8am JST, it's 7am in Cebu — which is our morning. That's actually fine. The problem is when the US East Coast sites go down at midnight EST — that's 1pm in Cebu, also fine. The painful slot is 3–6am UTC, which covers late night US and early morning Europe simultaneously.
Local contacts matter more than you think. For countries where we can't directly access the physical layer — where a cable is unplugged, a power outlet tripped, or a device needs a reboot — having a trusted local contact (facilities, IT helpdesk, or a dedicated TAC partner) is the difference between a 30-minute resolution and a 6-hour one.
Standardization is your force multiplier. The more your configurations look alike across every site, the faster you can work when things break. We have baseline config templates for every device type. Deviation from baseline requires a change request. This means when I SSH into a Cisco ASA in Germany I haven't touched in 6 months, the output of show run still looks like every other ASA in the network.
Documentation decays. What's in Confluence was accurate when it was written. Assume every document is 6 months stale and verify against actual device output before trusting it during an incident.
Lessons Learned
After years of 24/5 global ops:
- Alert on symptoms, not causes. "Interface down" is a symptom. "BGP peer lost" is often also just a symptom. Alert on client impact — the downstream effect — and correlate backward to root cause.
- Never fix under pressure what you don't understand. A P1 with an unknown root cause is better served by a stable workaround and a methodical diagnosis than a frantic config change that might make things worse.
- Know your blast radius before every change. Even in a P1 — especially in a P1 — ask: if this command is wrong, what breaks?
- Build relationships with ISP NOCs. You will call them. Knowing the right team, having account numbers ready, and speaking their escalation language gets you prioritized over generic customers.
- Automate the repetitive, trust the complex. Automated scripts for known P4 alerts (interface flap reset, syslog parser) are fine. Never automate P1 response. You want a human making that call.