Overview
There is a distinction that separates competent network engineers from engineers who are genuinely trusted with critical infrastructure: the ability to manage an incident, not just fix it. Technical skill gets you halfway there. The other half is structured thinking under pressure, communication under fire, and systematic elimination of hypotheses while six people are talking at once on a bridge and your SLA clock is at 40 minutes.
This post documents the exact approach I used across global P1 incidents at Conduent, supporting healthcare, finance, and government clients across 24/5 PagerDuty on-call rotations. The most significant was an Asia-Pacific MPLS core failure that affected 2,100+ clients simultaneously. Total downtime: 42 minutes against a 60-minute SLA. What made that possible was not luck — it was a repeatable system.
That system is what this post is about.
Incident Severity Framework
Not every alert is a P1. Misclassifying severity wastes resources on low-impact events and, more dangerously, under-resources genuine crises. Every team needs a severity matrix everyone understands and agrees on before incidents happen — not during them.
Critical rule: When in doubt, declare up. It is easy to downgrade a P1 to P2 after five minutes of investigation when you have more information. It is much harder to recover from the trust damage caused by treating a P1 as a P3 and missing your SLA because you were moving slowly.
The First 5 Minutes — From PagerDuty Alert to Bridge
The first five minutes of a P1 are the most important. Not because you will solve the incident in five minutes — you will not. But because the posture you establish in the first five minutes sets the tone for everything that follows: how quickly the team organizes, how confident the client is that someone competent is in control, and how clean your audit trail will be when you write the RCA.
What to do in the first 5 minutes — step by step:
1. Acknowledge PagerDuty immediately. The acknowledge action stops the escalation clock. If you are driving, pull over safely. If you are asleep, get to a device. Acknowledge first, assess second. An unacknowledged P1 alert escalates to your manager in most configurations after 5–10 minutes — do not make your first act of the incident be getting your manager paged unnecessarily.
2. Verify the alert is real. Not every PagerDuty alert represents a real outage. SolarWinds can have polling failures, ICMP can be blocked, a monitoring agent can crash. Before you spin up a full war room, take 90 seconds to confirm: open SolarWinds directly, check a second monitoring system if available, or call the NOC desk. A specific question: "Can you ping 10.42.0.1 from the NOC probe?" If the answer is yes, you have a monitoring false alarm, not a P1.
3. Assess blast radius immediately. How many sites are down? How many clients? Which clients? Healthcare clients with patient care systems have a different urgency profile than internal corporate sites. Know which clients are on your service before the incident happens — have this list saved offline.
4. Open the bridge — even without answers. The bridge is the incident. Even if you know nothing yet, opening the bridge within 5 minutes signals to everyone that a human is in control and working it. You can open a bridge, say "I have the bridge, I'm pulling logs, give me two minutes," and that is enough to stop the chaos of multiple people independently trying to investigate without coordination.
5. Make your opening statement. The moment you start talking on the bridge, use this structure:
"I'm Emman, I have the bridge. We have [X] sites confirmed down, [Y] sites suspected down, [Z] clients potentially affected. SLA is 60 minutes from [time of first alert]. I'm looking at [specific system] now. [Name], I need you to pull show interface status on [router]. [Name], please draft the initial client notification and hold for my review."
That statement establishes your role as IC, gives the team their first tasks, and makes clear you have a plan. All before you have actually diagnosed anything.
War Room / Bridge Structure
A bridge without structure is just noise. The most common failure mode in P1 incidents is everyone on the call simultaneously troubleshooting, speculating out loud, and reporting conflicting findings to no one in particular. The Incident Commander model exists to prevent exactly this.
Roles on the bridge — defined clearly:
Incident Commander (IC): Controls the bridge. Assigns tasks. Tracks the timeline. Makes decisions on escalation, failover, and client communication. The IC does NOT troubleshoot — this is the hardest discipline to maintain under pressure. The moment the IC starts staring at a routing table, nobody is driving the incident anymore.
Technical Lead: Does the actual troubleshooting. Has full access to devices. Reports findings to the IC — "I'm seeing BGP neighbor down on BorderRouter-SG with the Singapore ISP link." The Technical Lead does not communicate directly with the client and does not make escalation decisions — those go to IC.
Client Liaison / Account Manager: Handles all client-facing communication. Drafts the updates, gets them reviewed by IC, sends them. Never joins the technical troubleshooting conversation — their job is to keep the client's anxiety level manageable while the technical team works.
NOC Team: Runs specific commands as directed. "Please run show ip bgp summary on all APAC border routers and paste output to the ticket." The NOC reports results to the Technical Lead or IC. They do not troubleshoot independently during a P1 — independent NOC investigation during a bridge creates duplicate actions and confusion.
Scribe: Documents everything in the ticket in real time. Every finding. Every decision. Every command run. Every time check. This person is invisible during the incident and invaluable during the RCA and any customer audit afterward.
Communication Cadence
Client Updates — The Golden Rule: No Surprises
The single most damaging thing you can do to a client relationship during an incident is go silent. A client who has not heard from you in 45 minutes during an outage will call their account executive, their manager, and your manager simultaneously. By the time you restore service, you are also managing three angry escalation conversations.
The update cadence:
- T+10–15: Initial notification — you are aware, you are investigating, next update in 30 minutes
- T+30: Status update — what you have identified, what you are doing about it, next update in 30 minutes
- T+60 / T+90: Progress update — even if nothing has changed, you send the update. "We are still working with our upstream provider. No new information, but our team is actively engaged. Next update in 30 minutes."
- Resolution: Restoration confirmation — service restored at [exact time], all sites confirmed, RCA to follow within 24 hours
Template messages:
Initial Notification (T+10):
Subject: [INCIDENT-2024-0342] Service Disruption — APAC Region — P1 In Progress
Dear [Client Name],
We are writing to advise that we are currently experiencing a service disruption affecting connectivity to sites in the Asia-Pacific region. Our network operations team is actively investigating. We will provide a status update within 30 minutes or sooner if the situation changes.
Incident Reference: INC-2024-0342 Time of Detection: 02:47 AEST Affected Region: Asia-Pacific Impact: Connectivity degradation to multiple sites
We apologize for the inconvenience and will keep you informed.
Status Update (T+30):
Subject: [INCIDENT-2024-0342] Update #2 — Root Cause Identified — Remediation In Progress
Dear [Client Name],
Update on INC-2024-0342. We have identified the root cause as a BGP routing disruption at our upstream MPLS provider's Singapore point-of-presence. We are currently implementing a traffic rerouting solution via our alternate provider path. We expect service restoration within the next 20 minutes.
Next update: 30 minutes or upon service restoration.
Resolution Confirmation:
Subject: [INCIDENT-2024-0342] RESOLVED — Service Restored — All Sites Confirmed
Dear [Client Name],
We are pleased to confirm that service has been fully restored as of 03:29 AEST. All [47] affected sites have been confirmed operational by our NOC. Total duration of impact: 42 minutes.
A full Root Cause Analysis (RCA) document will be provided within 24 hours.
We sincerely apologize for the disruption and the impact to your operations. Please do not hesitate to contact us if any sites remain affected.
Internal Bridge Discipline
Only the IC speaks to manage the bridge. When six people are talking simultaneously, nobody is listening. The IC sets the rhythm: asks specific people specific questions, receives answers, synthesizes them, makes decisions, announces them. Everyone else is on mute unless directly addressed or has a critical finding to report.
State changes go to IC first. The Technical Lead does not announce to the whole bridge "I think it might be BGP." The Technical Lead tells the IC privately (in a side chat or muted): "I'm seeing BGP session down on SG-Border-01." The IC decides when and how to share that with the bridge and the client.
"I don't know" is a valid answer. "The BGP session is down and I don't yet know why" is correct. "I think maybe it could be the ISP or possibly a config change or maybe hardware" is not — it is noise that creates false hypotheses on the bridge and gets into client communications where it causes more harm.
Speculation goes to IC privately. Hypotheses are valuable. Hypotheses shouted on a live bridge where the client might be listening are not. Filter your ideas through the IC.
The Triage Methodology — Structured Elimination
The biggest mistake engineers make in P1 troubleshooting is jumping to the most interesting hypothesis first rather than the most systematic one. You have seen BGP before, so you go straight to checking BGP. But the BGP session might be down because the physical interface is flapping — and if you never checked Layer 1, you spent 20 minutes debugging BGP on a dead link.
The methodology is bottom-up. Always. Every time. You skip layers only when you have definitive proof that the lower layer is not the issue.
Layer-by-Layer Approach
Layer 1 — Physical and Link: Is the interface administratively up? Is it operationally up? Any CRC errors, input errors, output drops? A line protocol down is never a routing issue.
Layer 2 — Data Link: For switched environments — is STP topology correct? Any topology change notifications? Any MAC table anomalies? For MPLS environments — are the LDP or RSVP adjacencies established?
Layer 3 — Network: Are the routes present in the routing table? Is the routing protocol neighbor relationship established? Can you ping the next hop? Can you ping the destination with the correct source interface?
Layer 4 and above: If all lower layers check out — is the issue service-specific? A specific port? A specific application? Is it affecting all users on a site or a subset?
# Systematic L1 through L3 check — run in order, never skip
BorderRouter-SG# show interfaces GigabitEthernet0/0
# Check: line protocol up/down, input errors, CRC, output drops
BorderRouter-SG# show interfaces GigabitEthernet0/0 counters errors
# L2 — if this is a switched segment, check STP
DistSW-SG# show spanning-tree summary
DistSW-SG# show spanning-tree detail | include ieee|occur|from|to
# L3 — routing table check
BorderRouter-SG# show ip route | include 10.42
BorderRouter-SG# show ip bgp summary
# Output: look for "Idle", "Active", "Connect" state — any non-"Estab" is a problem
BorderRouter-SG# show ip bgp neighbors 203.0.113.1
# Check: BGP state, last reset reason, notification messages sent/received
Real Case: Asia-Pacific MPLS Core Failure
This is the incident I reference most when training engineers on P1 management. The technical resolution was not complex — it was a BGP path manipulation. What made it successful was the speed of the process around the technical steps.
Incident timeline — minute by minute:
T+00:00 — PagerDuty fires at 02:47 AEST. SolarWinds shows 47 nodes transitioning to down state in rapid succession across the APAC region. The pattern — not a single node but a cascade — is the first clue that this is a network path failure, not individual device failures.
T+00:02 — NOC confirms: all APAC sites unreachable, client helpdesks beginning to receive calls. I acknowledge PagerDuty. I call the NOC duty manager directly: "How many sites? Which clients?" Answer: 47 sites, multiple clients including two healthcare accounts.
T+00:05 — Bridge opened. Opening statement: "I'm Emman, I have the bridge. 47 APAC sites are down, two healthcare accounts affected. P1 declared. [Name from NOC], I need show ip bgp summary on SG-BorderRouter-01 and AU-BorderRouter-01 in the next two minutes. [Account team name], draft the initial client notification now — hold for my review."
T+00:08 — First client notification reviewed and sent by account team.
T+00:12 — NOC reports back: BGP session to primary ISP (SingTel MPLS cloud) is in Active state on both APAC border routers. BGP session to backup ISP path is Established but has very long AS path — it has 3x AS prepend applied, making it unattractive. Pattern is now clear: primary path through Singapore ISP has failed, backup path exists but is deprioritized by our own BGP policy.
T+00:18 — I engage ISP TAC directly while Technical Lead continues verification. ISP TAC: "We are investigating a core router issue in our Singapore PoP. BGP sessions to customer edge devices are resetting." Confirmed: the problem is in the ISP's infrastructure, not ours. We cannot fix their core router, but we can redirect our traffic.
T+00:25 — ISP confirms: Singapore PoP core router BGP session reset caused by a hardware fault. Their ETA for full restoration: 45–90 minutes. That puts us past our SLA. Decision point: activate our own multipath failover now rather than wait for ISP restoration.
T+00:28 — Decision made. Technical Lead applies BGP prepend removal on both APAC border routers to make the alternate path (via Tokyo PoP) preferred. Commands applied on SG-BorderRouter-01 and AU-BorderRouter-01:
# T+00:28 — Apply multipath failover via BGP prepend removal
# Removing 3x AS prepend makes our alternate ISP path attractive again
BorderRouter-SG# configure terminal
BorderRouter-SG(config)# route-map APAC-PRIMARY-OUT permit 10
BorderRouter-SG(config-route-map)# no set as-path prepend 65001 65001 65001
BorderRouter-SG(config-route-map)# end
# Soft reset — push updated outbound policy without dropping the BGP session
BorderRouter-SG# clear ip bgp 203.0.113.1 soft out
# Verify BGP convergence — next-hop should now point to Tokyo PoP
BorderRouter-SG# show ip bgp 10.42.0.0/16
# Confirm route is present and best path is via alternate ISP
BorderRouter-SG# show ip route 10.42.0.0 255.255.0.0
# Verify traffic is actually flowing — check interface counters on alternate uplink
BorderRouter-SG# show interfaces GigabitEthernet0/1 counters
# Same commands repeated on AU-BorderRouter-01
BorderRouter-AU# configure terminal
BorderRouter-AU(config)# route-map APAC-PRIMARY-OUT permit 10
BorderRouter-AU(config-route-map)# no set as-path prepend 65001 65001 65001
BorderRouter-AU(config-route-map)# end
BorderRouter-AU# clear ip bgp 203.0.113.5 soft out
T+00:35 — BGP convergence begins. Traffic starts flowing via Tokyo PoP alternate path. NOC watching SolarWinds sees nodes transitioning back to green.
T+00:42 — All 47 sites confirmed restored by NOC. SolarWinds is fully green. NOC verifies by pinging all sites from multiple probe points.
T+00:45 — Restoration notification sent to all affected clients.
T+01:30 — ISP restores Singapore PoP. We restore the original BGP prepend policy to rebalance traffic back to primary path via a second controlled soft reset. Verify traffic distribution across both paths.
Result: 42 minutes total downtime. SLA: 60 minutes. SLA met. The BGP manipulation itself took 7 minutes from decision to convergence confirmed. The other 35 minutes were the systematic process of verifying the alert, diagnosing the layer, confirming with the ISP, and making a good decision.
Resolution vs Workaround — Know the Difference
This distinction matters enormously, and confusing the two creates follow-on incidents and erodes client trust.
Resolution means the root cause has been fixed and the service is restored on the primary path in its original, intended configuration. The problem is gone. No follow-up technical action required.
Workaround means service has been restored through an alternate means, but the root cause still exists. In the APAC case: removing the BGP prepend and routing traffic via Tokyo was a workaround. Service was restored. But the Singapore PoP was still broken, and our traffic was running on a sub-optimal path on a single link without redundancy.
Always communicate which one you have. The restoration notification must be explicit:
- Resolution: "Service has been fully restored. The root cause has been remediated."
- Workaround: "Service has been restored via alternate routing. A follow-up action is required to implement the permanent fix. Reference ticket INC-2024-0342-FOLLOWUP has been created."
A workaround creates two deliverables: the incident RCA and a separate follow-up ticket (P3 or planned maintenance) to implement the permanent fix. In the APAC case, the follow-up ticket was to implement BFD on all APAC border routers to detect ISP path failure within 1 second instead of the 30-second BGP hold timer — the 30-second detection window was what made the manual failover necessary.
The Post-Incident RCA
The RCA (Root Cause Analysis) is not paperwork. It is the mechanism by which the incident generates value beyond the immediate restoration. A good RCA prevents the next incident. A bad RCA documents what happened and files it where nobody reads it.
RCA Structure — Five Sections
1. Incident Summary
One paragraph. What happened, when it started, when it was resolved, total duration, number of users/sites affected, SLA result. This is written for executives and account managers, not engineers.
On 13 March 2024, between 02:47 and 03:29 AEST, a BGP routing failure at the Singapore point-of-presence of our primary MPLS provider caused a connectivity loss affecting 47 sites across the Asia-Pacific region. Service was restored in 42 minutes via activation of an alternate routing path. SLA target of 60 minutes was met.
2. Timeline
Minute-by-minute from first alert to full restoration. Use UTC or a single consistent timezone. Every action, every finding, every communication. This is the scribe's output — which is why the scribe role matters.
3. Root Cause
Specific. Technical. Not "ISP issue." Not "network failure."
Bad: "Root cause: ISP network issue in Singapore."
Good: "Root cause: A hardware fault on a Juniper MX960 core router in SingTel's Singapore Tanjong Pagar PoP caused BGP session resets to all customer edge devices. The fault was triggered by a scheduled firmware update applied at 02:44 AEST that caused an unexpected process restart, resetting all established BGP sessions. ISP case reference: SingTel-INC-20240313-7842."
4. Contributing Factors
What made the impact worse or the detection/response slower? This is where you look at your own systems, not the ISP's.
- BGP hold timer of 30 seconds meant 30 seconds of blackholing before our routers even knew the session was down
- No BFD configured on APAC border routers — BFD would have detected the failure in under 1 second
- Alternate path BGP prepend was in a route-map that required manual removal — no automated failover policy existed
- NOC runbook did not include a specific procedure for ISP path failure — NOC spent 4 minutes looking up the correct escalation contact
5. Action Items
Each action item must be: Specific, Measurable, Assigned to a named person, Time-bound. No exceptions.
What Makes a Good Action Item vs a Bad One
Bad: "Review MPLS configuration."
Who? Review it how? By when? To what end? This action item will still be open in six months because it cannot be completed — it has no measurable outcome.
Good: "Configure BFD on all 6 APAC border routers (SG-Border-01, SG-Border-02, AU-Border-01, AU-Border-02, JP-Border-01, JP-Border-02) BGP sessions to primary and secondary ISP peers. BFD timers: tx/rx 300ms, multiplier 3 = 900ms detection. Owner: Emman Bracuso. Due: 2024-03-28. Success metric: BFD show bfd neighbors shows Established state on all sessions; lab failover test confirms BGP reconverges within 2 seconds of forced link failure."
That action item can be completed, verified, and closed. It cannot be forgotten or half-done.
The Blameless Culture
The RCA is a systems analysis, not a performance review. If a human made an error, the question is not "who made the error" but "what process, tooling, or runbook failure allowed that error to go undetected or unmitigated?"
If the NOC engineer typed the wrong BGP neighbor address: what change management process would have caught that? Why is there no config diff review before changes are pushed? Why does the configuration management system not flag unexpected BGP session drops within 30 seconds of a change?
The blameless RCA does not mean accountability-free. The action items have named owners. People are accountable for completing preventive measures. But the root cause analysis focuses on systemic failures, not individual failures. Systemic fixes prevent recurrence. Blaming individuals does not — it just makes people afraid to acknowledge mistakes in future incidents.
P1 Readiness Checklist
The time to build your P1 readiness infrastructure is not during a P1. Everything on this checklist should be verified and tested before you are on call.
The last item is the one most engineers miss. When the network is down, you cannot reach your wiki. You cannot reach your monitoring system. You cannot reach your change management tool. The documentation you need most in a P1 is the documentation that must exist completely independently of the network you are trying to fix. Keep it on your laptop, in a printed binder if your environment warrants it, and on a USB drive in your on-call bag.
Final Thoughts
The engineers who consistently resolve P1 incidents within SLA are not necessarily the ones with the deepest technical knowledge. They are the ones who have internalized a repeatable process. Technical knowledge tells you what commands to run. Process tells you when to run them, in what order, while simultaneously managing a bridge, communicating with a client, and making escalation decisions under time pressure.
The playbook in this post is what I used across dozens of incidents at Conduent. It is not the only way to manage P1 incidents, but every element of it exists because something went wrong without it. The blameless RCA culture, the IC role separation, the 30-minute client update cadence, the bottom-up triage methodology — all of it comes from real incidents where the absence of that practice extended downtime or damaged a client relationship.
Build your runbooks. Test your failovers. Practice the process before the 3 AM alert fires. The preparation is invisible until the moment it matters — and then it is everything.