Lucent Grid Learning  ·  Defensive Security

Incident
Response

A complete practitioner's guide — from building your first IR plan through cloud forensics, post-incident reporting, and programme maturity. Fifteen chapters covering the full lifecycle.

15 chapters
~2.5 hrs total reading
NIST SP 800-61 aligned
SANS PICERL aligned
📍
Continue where you left off
Chapter 01 ·  ~12 min read  ·  Foundations

What Is Incident Response?

Understanding what IR is, why it exists, and what separates an event from a crisis

Every organisation connected to the internet will eventually face a security incident. The only meaningful variable is whether they are ready when it happens. Incident Response (IR) is the organised, structured approach to preparing for, detecting, containing, and recovering from those moments — and then learning enough from them to be harder to hit next time.

Before we go further, it is worth sitting with one statistic. IBM's 2023 Cost of a Data Breach Report found the global average cost of a data breach was $4.45 million — a figure that climbs to over $9 million for organisations in the United States healthcare sector. Those numbers are not abstract. They represent legal fees, regulatory fines, customer notification costs, forensic investigations, lost business, and the months of overtime paid to the people cleaning up. Organisations with a mature incident response capability consistently cut those costs by a third or more. The ROI on IR is not theoretical — it is documented in breach after breach.

Definition

Incident Response is the set of policies, procedures, tools, and trained people that an organisation uses to identify, contain, eradicate, and recover from cybersecurity incidents, while preserving evidence and minimising damage.

Events vs. Incidents vs. Breaches

One of the first things a new SOC analyst learns — sometimes the hard way — is that not everything that fires an alert is an incident, and not every incident is a breach. The distinctions matter because they drive different response actions, different escalation paths, and very different legal obligations.

An event is any observable occurrence in a system or network. Your firewall generates thousands of events per minute. A user logging in is an event. A failed authentication attempt is an event. The word carries no implication of harm.

An adverse event is an event with negative consequences — a file deleted in error, a service crashing, a user being locked out. Still not necessarily a security incident.

A security incident is a violation or imminent threat of violation of computer security policies, acceptable use policies, or standard security practices. Someone brute-forcing your VPN credentials is a security incident. A confirmed malware infection is a security incident. A phishing email that reached and was opened by a user is a security incident.

A breach is a specific subset of incidents: one in which there is confirmed, unauthorised access to — or acquisition of — sensitive or protected data. Not every incident is a breach, but every breach is an incident. This distinction is critical because breaches trigger mandatory regulatory notification requirements.

Worked Example

A user reports receiving a suspicious email. Your email gateway logs show the message was delivered and opened.

Event: the email was delivered. Adverse event: it bypassed your spam filter. Incident: the embedded link was clicked and the user's browser connected to a known C2 domain. Breach: memory forensics confirms a credential-stealing payload ran and exfiltrated Active Directory credentials to the attacker's infrastructure.

Each step changes your response obligations entirely.

The IR Analyst Role

The incident responder's job exists at the intersection of several disciplines. You need enough network knowledge to read packet captures. Enough endpoint forensics skill to interpret process trees and registry artefacts. Enough log analysis ability to build a timeline from disparate data sources. Enough communication skill to brief an executive under pressure. And enough composure to stay methodical when everyone around you is panicking.

In a mature organisation, incident response sits within — or closely alongside — the Security Operations Centre (SOC). Tier-1 analysts handle initial triage and alert investigation. Tier-2 analysts take ownership of confirmed incidents and begin the containment process. Tier-3 / IR specialists handle complex investigations, threat hunting, and post-incident reporting. In smaller organisations, one person may wear all three hats.

Why IR Programmes Fail

Understanding failure modes is as useful as understanding best practices. The most common reasons IR programmes underperform:

  • Detection gaps — you cannot respond to what you cannot see. Organisations without endpoint visibility, DNS logging, or authentication monitoring are flying blind.
  • Untested plans — an IR plan that has never been exercised will fail under real pressure. People fall back on improvisation, steps get skipped, evidence gets destroyed.
  • Scope creep during an active incident — the tendency to keep investigating rather than containing. Every hour of dwell time while "gathering more information" is an hour the attacker spends establishing persistence.
  • Poor communication — incidents that are declared late because analysts were reluctant to escalate. Executives who are surprised by developments. Legal teams who are not looped in until it's too late to manage a breach notification properly.
  • No post-incident process — closing the ticket and moving on without a lessons-learned meeting. The same intrusion vector gets used again six months later.
Common Mistake

Many organisations treat IR as something to activate when an incident happens. It is actually a continuous programme — preparation, training, tooling maintenance, detection engineering, and tabletop exercises should be running constantly, not dusted off during a crisis.

The IR Lifecycle at a Glance

Before we go deep on each phase in later chapters, here is the end-to-end flow. This is the NIST model, which we will examine in detail in Chapter 2.

01
Preparation
02
Detection
03
Analysis
04
Containment
05
Eradication
06
Recovery
07
Lessons Learned

Each phase feeds the next, and the output of Lessons Learned feeds back into Preparation — making every incident an investment in a stronger programme. This cyclical structure is what separates reactive firefighting from a genuine security programme.

Key Takeaways — Chapter 1
  • Incident response is a continuous programme, not an emergency procedure activated during a crisis
  • Events, incidents, and breaches are distinct — each carries different response obligations and escalation requirements
  • The average unmanaged breach costs $4.45M; mature IR programmes reduce that figure by roughly a third
  • The most common IR failure modes are detection gaps, untested plans, slow escalation, and no post-incident learning process
  • The NIST 7-phase lifecycle provides the organising framework — Preparation through Lessons Learned, cycling continuously
Chapter 02 ·  ~14 min read  ·  Frameworks

Frameworks & Standards

NIST, SANS, ISO 27035, and the regulations that mandate IR programmes

There is no shortage of frameworks claiming to organise incident response. The experienced practitioner learns to treat them as complementary lenses rather than competing dogmas — each was designed for a slightly different context and audience. What matters is that your organisation has internalised some coherent model, practised it until it is instinct, and selected the framework or blend of frameworks that maps cleanly to your regulatory obligations.

NIST SP 800-61: The Industry Standard

NIST Special Publication 800-61, "Computer Security Incident Handling Guide," is the closest thing the industry has to a universal reference. Published by the National Institute of Standards and Technology and currently on Revision 2 (with Revision 3 in draft), it defines the IR lifecycle in four high-level phases. Note that these four phases contain what practitioners often describe as seven steps — the Detection and Analysis phase and the Post-Incident Activity phase each contain multiple sub-activities.

NIST PhaseWhat HappensKey Outputs
PreparationBuilding the capability before incidents occur — policies, tools, training, tabletop exercises, detection coverageIR policy, contact lists, toolkits, runbooks, trained team
Detection & AnalysisIdentifying that an incident has occurred and understanding its scope, timeline, and impactIncident declaration, severity assessment, initial timeline, IOC list
Containment, Eradication & RecoveryStopping the bleeding, removing the threat, and restoring normal operationsIsolated systems, cleaned endpoints, restored services, monitoring plan
Post-Incident ActivityLearning from the incident to improve the programmeLessons-learned report, updated playbooks, detection improvements

NIST 800-61 is the framework most commonly referenced in US federal and regulated-industry contexts. If you work in healthcare, finance, or government contracting, NIST is almost certainly the expected baseline.

SANS PICERL

The SANS Institute's model expands NIST's phases into six named stages, making the flow more granular and easier to teach. The acronym is PICERL:

  • Preparation — identical to NIST; building capability before incidents
  • Identification — detecting and confirming the incident (maps to NIST Detection)
  • Containment — limiting the spread and impact (short-term then long-term)
  • Eradication — removing the threat from the environment
  • Recovery — restoring normal operations safely
  • Lessons Learned — post-incident review and programme improvement

PICERL is widely taught in SANS courses (particularly FOR508 and FOR572) and is the framework most commonly referenced in IR job descriptions and SOC playbooks. Its explicit separation of Containment, Eradication, and Recovery makes it particularly useful for writing step-by-step runbooks.

Framework Comparison Note

NIST and SANS PICERL cover the same ground. NIST is the regulatory reference; PICERL is the practitioner's working model. Most mature programmes cite NIST for compliance purposes and use PICERL internally for runbooks and training.

ISO/IEC 27035

ISO/IEC 27035 is the international standard for information security incident management, published in three parts. Part 1 covers principles and process. Part 2 covers guidelines for planning and preparation. Part 3 covers operations.

ISO 27035 is relevant primarily to organisations pursuing ISO 27001 certification, multinational organisations operating under European regulatory frameworks, and companies whose customers or partners contractually require ISO certification. Its process model is broadly compatible with NIST but introduces additional governance and documentation requirements — it expects a formal Incident Management Policy, a documented team structure, and evidence of regular testing.

Regulatory Frameworks That Mandate IR

Beyond methodology frameworks, several regulations create legal obligations around how organisations handle incidents. Understanding these is not optional — non-compliance can result in fines that dwarf the cost of the incident itself.

GDPR — General Data Protection Regulation

The European Union's GDPR is arguably the most consequential breach notification regulation in the world, not least because it applies to any organisation that processes the personal data of EU residents — regardless of where the organisation itself is based.

The key IR obligation under GDPR is Article 33: notification to the relevant supervisory authority within 72 hours of becoming aware of a personal data breach. If 72 hours is not feasible, the notification must be made "without undue delay" with an explanation of why it was late. Notification must include: the nature of the breach, the categories and approximate number of individuals affected, the likely consequences, and the measures taken or proposed to address it.

Article 34 adds a further obligation: if the breach is likely to result in high risk to individuals (identity theft, financial loss, discrimination), those individuals must also be notified "without undue delay." Fines for non-compliance can reach €20 million or 4% of global annual turnover — whichever is higher.

Regulatory Requirement — GDPR

72 hours — notification to supervisory authority from the moment you become "aware" of the breach. Note that "aware" is a legal term of art — courts have interpreted it as the moment a reasonable assessment confirms a breach has occurred, not necessarily the moment of initial detection. Your IR plan should specify at exactly what point the clock starts.

HIPAA — Health Insurance Portability and Accountability Act

HIPAA's Breach Notification Rule requires covered entities (healthcare providers, health plans, healthcare clearinghouses) and their business associates to notify affected individuals, the Department of Health and Human Services (HHS), and in some cases the media, following a breach of unsecured Protected Health Information (PHI).

The notification timeline under HIPAA is 60 calendar days from discovery. If the breach affects 500 or more residents of a state or jurisdiction, media notification in that jurisdiction is also required. Breaches affecting 500 or more individuals must be reported to HHS immediately; smaller breaches can be logged annually. The HHS "Wall of Shame" — the public breach portal — is a sobering resource for understanding what actually gets reported.

PCI-DSS — Payment Card Industry Data Security Standard

PCI-DSS is a contractual standard (not a law) imposed by the major card networks. Requirement 12.10 mandates that organisations "implement an incident response plan" that is activated in the event of a system breach, includes defined roles and communication procedures, and is tested at least annually.

The practical IR obligations under PCI-DSS include: immediate notification to your acquiring bank and the relevant card brands upon discovery of a compromise, engagement of a PCI Forensic Investigator (PFI) for breaches involving cardholder data, and detailed forensic evidence preservation. Card brand rules (Visa's CAMS, Mastercard's SDP) impose their own timelines and forensic requirements on top of the standard.

SOC 2

The AICPA's SOC 2 framework is relevant to service providers handling customer data. The Availability and Security trust service criteria both require documented incident management procedures. SOC 2 auditors will expect evidence of: a documented IR policy, defined roles, a log of incidents and how they were handled, and evidence of post-incident review. Unlike GDPR or HIPAA, SOC 2 does not specify a breach notification timeline — but your auditors will scrutinise whether your actual response matched your documented procedures.

Choosing Your Framework

The practical answer for most organisations: adopt NIST SP 800-61 as your compliance reference and SANS PICERL as your operational model. If you are ISO-certified or pursuing certification, layer ISO 27035 governance requirements on top. Then map your specific regulatory obligations (GDPR, HIPAA, PCI-DSS, SOC 2, state breach notification laws) to create a notification timeline document that your legal and comms teams can act on without needing to read the frameworks themselves.

Key Takeaways — Chapter 2
  • NIST SP 800-61 (4 phases) and SANS PICERL (6 phases) cover the same lifecycle — NIST for compliance references, PICERL for runbooks
  • ISO/IEC 27035 adds governance structure relevant to ISO 27001 certification and European-regulated contexts
  • GDPR requires supervisory authority notification within 72 hours of becoming aware of a personal data breach
  • HIPAA allows 60 days from discovery — but media notification is required for large breaches
  • PCI-DSS mandates annual IR plan testing and immediate bank/card brand notification upon cardholder data compromise
Chapter 03 ·  ~13 min read  ·  Planning

Building the IR Plan

What belongs in a formal plan, how to define severity levels, and how to declare a major incident

An incident response plan is a living document, not a filing cabinet artefact. Organisations that produce an IR plan, bind it in a binder, and leave it on a shelf until the auditor asks for it are not better prepared than organisations with no plan — they are arguably worse, because they have false confidence. A real IR plan is reviewed quarterly, tested bi-annually, updated after every significant incident, and short enough to be usable under pressure.

Policy vs. Plan vs. Playbook vs. Runbook

These four terms are used interchangeably by people who haven't thought carefully about what each one is for. The confusion costs time during real incidents when people aren't sure which document governs what.

Document TypeAudienceWhat It ContainsHow Often Updated
IR PolicyOrganisation-wideWhy IR exists, who is responsible, what constitutes an incident, high-level obligationsAnnually / on major change
IR PlanIR team and managementRoles and responsibilities, escalation paths, communication procedures, activation criteria, legal/regulatory obligationsBi-annually / after major incidents
PlaybookAnalysts and IR leadsIncident-type-specific procedures (phishing playbook, ransomware playbook, etc.) — what to do in what orderAfter every incident of that type
RunbookTier-1 / Tier-2 analystsStep-by-step technical procedures — exact commands, tool invocations, screenshots. Designed to be followed without requiring judgmentOn tool changes / quarterly checks

The hierarchy matters: the policy sets the mandate, the plan organises the response, playbooks guide incident-type handling, and runbooks handle the technical execution. An analyst working a ransomware incident at 3 AM should be able to follow the ransomware playbook without needing to re-read the policy or plan.

Stakeholder Roles and Responsibilities

One of the most important things an IR plan establishes is exactly who does what — before the incident occurs. Discovering during an active compromise that nobody is sure who has authority to isolate production systems, or who calls the CEO, costs precious time.

  • Incident Commander (IC) — overall incident coordinator. Owns the decision to declare, escalate, and close the incident. In smaller organisations this is typically the CISO or senior IR lead; in larger ones it may be a dedicated role. The IC does not do technical analysis — they coordinate.
  • Lead Analyst / Technical Lead — owns the technical investigation. Directs the analysis, coordinates with external forensics if needed, and reports findings to the IC.
  • Scribe / Documentation Lead — maintains the real-time incident timeline. Every action taken, every decision made, every finding documented with a timestamp. This person is often undervalued and always critical — their notes become the post-incident report and the legal record.
  • Communications Lead — manages all internal and external communications. Coordinates with PR, legal, and the executive team on external statements. Drafts customer notification letters. The comms lead ensures nothing leaves the organisation without approval.
  • Legal / Compliance — advises on regulatory notification obligations, manages privilege considerations (attorney-client privilege can protect some IR-related communications from discovery), and coordinates with law enforcement if needed.
  • CISO / Executive Sponsor — senior escalation point. Receives regular updates from the IC, approves major containment decisions (taking down production systems, notifying customers), and coordinates with board-level stakeholders if required.
Practice Note

Many organisations make the mistake of assigning IR roles to individuals by name rather than by role. When the named individual is on holiday or leaves the company, the plan is broken. Assign roles to positions, not people, and maintain a secondary contact for every role.

Escalation Matrix

The escalation matrix answers a simple but critical question: at what point does this incident involve the next tier of the organisation? A good escalation matrix is triggered by objective criteria — severity level, data classification of affected systems, regulatory exposure, potential for media coverage — rather than by subjective analyst judgment.

A typical escalation matrix maps severity levels to response actions:

SeverityCriteriaInitial ResponseEscalationResponse Time
SEV-1 / P1Active breach, data exfiltration in progress, ransomware, business-critical system downFull IR team activation, Incident Commander engagedCISO, Legal, Executive team within 30 minImmediate
SEV-2 / P2Confirmed compromise of non-critical systems, significant malware infection, account takeover of privileged accountLead Analyst engaged, investigation openedCISO within 2 hours< 1 hour
SEV-3 / P3Suspicious activity under investigation, policy violation, failed attack with no confirmed compromiseTier-2 analyst investigationIR Lead if confirmed as incident< 4 hours
SEV-4 / P4Low-risk policy violation, informational alert, resolved automaticallyTier-1 triage, close or escalateTier-2 if evidence of compromise< 24 hours

Declaring a Major Incident

The major incident declaration is a formal act with real consequences — it activates the full IR team, triggers regulatory notification clocks, and changes how the organisation communicates. It should not happen casually, but it also should not be delayed out of reluctance to escalate. Most IR failures involve declaring too late.

Your plan should define explicit, objective criteria for declaring a major incident. Common triggers include:

  • Confirmed exfiltration of regulated data (PII, PHI, cardholder data)
  • Ransomware deployment on production systems
  • Confirmed compromise of a privileged account (domain admin, cloud root)
  • Evidence of persistent access — backdoor, web shell, or scheduled task installed by an attacker
  • Compromise of any system in scope for PCI-DSS or HIPAA
  • Attacker lateral movement observed across two or more systems

Once declared, the Incident Commander takes formal control. All communications are routed through the IC and the Communications Lead. The Scribe begins the formal timeline. Legal is notified. No external statements — to media, customers, regulators, or even colleagues outside the core IR team — go out without Communications Lead sign-off.

What Goes in the IR Plan Document

A practical IR plan is rarely longer than 20–30 pages. It does not need to be a comprehensive textbook — that's what playbooks and runbooks are for. The plan covers:

  1. Purpose, scope, and policy alignment
  2. Definitions (incident, breach, severity levels)
  3. Roles and responsibilities (by position, not name)
  4. Contact list (internal and external — forensics retainer, legal counsel, law enforcement liaison, cyber insurance carrier, key regulators)
  5. Escalation matrix and major incident declaration criteria
  6. Communication procedures (internal, customer, media, regulatory)
  7. Evidence handling and preservation requirements
  8. Regulatory notification obligations and timelines
  9. Plan review and testing schedule
  10. References to playbooks and runbooks
Real-World Note

During a real incident, your primary IR plan document will likely not be consulted — your team will be working from memory, from playbooks, and from on-call coordination tools. The value of writing the plan is not the document itself: it is the thinking that went into it, the conversations it forced between stakeholders, and the team's familiarity with the structure from having practised it.

Key Takeaways — Chapter 3
  • Policy, plan, playbook, and runbook are four distinct documents serving four distinct audiences — conflating them causes confusion during real incidents
  • Assign IR roles to positions, not individuals — and maintain secondary contacts for every role
  • The escalation matrix should use objective, pre-defined criteria rather than analyst judgment
  • Major incident declaration triggers regulatory notification clocks and must have explicit, documented criteria
  • A usable IR plan is 20–30 pages maximum — the detail belongs in playbooks and runbooks
Chapter 04 ·  ~11 min read  ·  Team Structure

The IR Team

CSIRT vs SOC, internal vs retainer, on-call rotations, and how real teams are structured

The people who respond to incidents are the most critical element of any IR programme — more important than any tool, any platform, or any policy document. A skilled, well-practised team with mediocre tools will outperform a poorly trained team with a world-class SIEM every time. This chapter covers how IR teams are structured, how to decide between building internal capability versus retaining external IR specialists, and how to operate an on-call rotation that doesn't burn people out.

CSIRT vs SOC

These terms are often used interchangeably but they describe different things. A Security Operations Centre (SOC) is the broader function responsible for continuous monitoring, alert triage, and day-to-day security operations. It is an operational team that runs 24/7 and handles the full volume of security events — most of which will never escalate to an incident.

A Computer Security Incident Response Team (CSIRT) is a specialised group that activates when an incident is declared. In some organisations the CSIRT is a dedicated team separate from the SOC. In many organisations — especially those without the headcount for a fully separate function — the senior members of the SOC also serve as the CSIRT. In small organisations, a single team wears all hats.

There is also the CERT — Computer Emergency Response Team — a term with a complicated history. The original CERT/CC was established at Carnegie Mellon in 1988 following the Morris Worm. In common use today, CERT and CSIRT are functionally synonymous, though some national-level bodies (such as US-CERT, now known as CISA) retain the CERT name.

Internal vs External IR Capability

Few organisations have the headcount and budget to staff a fully self-sufficient IR capability. Most operate on a hybrid model — maintaining internal analysts for day-to-day operations while retaining an external IR firm for complex investigations, major incident surge capacity, and specialist skills (malware reverse engineering, mobile forensics, critical infrastructure response).

The decision matrix looks roughly like this:

CapabilityInternal TeamExternal Retainer
AvailabilityAlways available, knows environmentActivation delay (hours); requires on-boarding
Environment knowledgeDeep — knows the architecture, the users, the quirksZero at engagement start; builds during investigation
Specialist skillsLimited by team size and budgetAccess to deep specialists (exploit analysis, threat intel)
Surge capacityLimited — same headcount for a 3-system incident and a 300-system incidentCan deploy teams of 10–30 for large-scale incidents
IndependenceMay face internal political pressure on findingsObjective — no internal allegiances
CostFixed (salaries, tools)Variable — retainers typically $50K–$200K/year, time-and-materials for activations

The major external IR firms include Mandiant (now part of Google), CrowdStrike Services, Secureworks CTU, Palo Alto Unit 42, and KPMG Cyber. Most offer a pre-negotiated retainer model where you pay a annual fee in exchange for guaranteed response SLAs and pre-positioned resources. The retainer also typically includes some pre-incident advisory hours that can be used for tabletop exercises and IR plan reviews.

Important

If you plan to use an external IR firm, do not wait until an incident to negotiate the contract. Legal review, MSA signing, NDA execution, and pre-positioning tasks (environment documentation, getting their tools on your jump hosts) take weeks. A retainer that hasn't been executed before you need it is not a retainer.

Team Tiers

In a SOC with IR capability, analysts are typically organised into tiers based on skill level and the complexity of work they handle:

  • Tier 1 — Alert Analyst: monitors the SIEM queue, performs initial triage on alerts, closes false positives, escalates true positives. Entry-level role. Key skills: SIEM navigation, alert classification, basic log reading.
  • Tier 2 — Incident Analyst: takes ownership of escalated incidents, performs deeper investigation, begins containment actions, documents findings. Mid-level role. Key skills: endpoint forensics, log analysis, network traffic analysis, threat intelligence lookup.
  • Tier 3 / IR Specialist: handles the most complex investigations, performs malware analysis, leads major incident responses, develops detection content, runs threat hunts. Senior role. Key skills: advanced forensics, reverse engineering, threat hunting, attacker TTPs.
  • Threat Intelligence Analyst: often separate from the tiered structure. Provides IOCs, TTP context, and adversary profiling to support active investigations and proactive detection improvements.
  • Detection Engineer: builds and maintains the SIEM rules, EDR detections, and correlation logic that the Tier 1 analysts work from. Bridges the gap between the IR team's operational findings and the technical controls that prevent recurrence.

The On-Call Rotation

Security incidents do not respect business hours. A mature IR programme operates with 24/7 coverage — whether through shift-based SOC staffing, an on-call rotation, a managed security service provider (MSSP), or some combination.

Running an on-call rotation sustainably requires careful design. Common mistakes:

  • Single-person on-call — when a SEV-1 fires at 2 AM and requires full team activation, a single on-call analyst cannot manage it alone. Every on-call rotation should have a primary and a secondary.
  • No escalation path — the on-call analyst needs to know exactly who to call and how for decisions above their authority level.
  • Pager fatigue — if your on-call rotation triggers every night, your detection content is miscalibrated and your people will burn out. Alert volume is a detection engineering problem, not an analyst endurance problem.
  • No incident commander on-call — Tier-1 analyst on-call, no path to activating an IC for a major incident, means SEV-1s get handled as SEV-3s until business hours.
On-Call Structure — Example

Primary on-call: Tier-2 analyst. Handles initial response, can action containment decisions up to SEV-3 without escalation.

Secondary on-call: Tier-3 / IR lead. Escalation point for complex investigations or SEV-2+. Can be reached within 15 minutes.

IC on-call: CISO or senior IR manager. Required for SEV-1 declaration, external communications, major containment decisions. 30-minute escalation window.

Key Takeaways — Chapter 4
  • SOC and CSIRT are distinct — the SOC handles ongoing monitoring; the CSIRT activates for declared incidents
  • Hybrid models (internal team + external retainer) are the practical norm; negotiate the retainer before you need it
  • Tier-1/2/3 structures distribute workload by complexity — Tier-3 specialists handle the investigations that would overwhelm alert-queue analysts
  • On-call rotations need primary, secondary, and IC coverage — single-person on-call cannot handle major incidents
  • High on-call alert volume is a detection engineering problem, not an analyst endurance problem
Chapter 05 ·  ~15 min read  ·  Preparation

Preparation Phase

Asset inventory, log coverage, toolkits, and tabletop exercises — building capability before the incident

Preparation is the phase that determines whether every other phase goes smoothly or catastrophically. It is also the phase most frequently underinvested — not because security leaders don't understand its importance, but because it competes for budget against reactive controls that are easier to justify and faster to procure. Preparation is invisible when it works.

A useful mental model: every hour spent on preparation eliminates roughly ten hours of chaos during an active incident. The organisations that contain ransomware in four hours instead of four weeks didn't get lucky — they did the preparation work when there was no crisis to focus the mind.

Asset Inventory

You cannot protect what you don't know you have, and you cannot scope an incident without knowing your environment. A comprehensive, current asset inventory is the foundation of effective IR. It should contain, at minimum:

  • All endpoint devices — workstations, laptops, servers — with OS version, owner, location, and criticality classification
  • Network infrastructure — firewalls, switches, routers, VPN concentrators, wireless access points
  • Cloud assets — AWS accounts, Azure subscriptions, GCP projects, SaaS platforms — with ownership and data classification
  • Third-party integrations — the suppliers and partners with access to your systems or data
  • Data flows — where sensitive data lives, how it moves, and what systems process it

Asset inventory is not a one-time project. It requires continuous maintenance — ideally automated through a CMDB or an asset discovery tool like Rumble (now Runzero) or Lansweeper. Shadow IT discovery should run periodically; assets that appear in incident investigations but aren't in your inventory are a persistent source of IR failures.

Log Coverage — What You Need to Be Collecting

Log coverage is perhaps the single most important preparatory investment. An incident you can investigate thoroughly is recoverable. An incident you cannot reconstruct because the relevant logs don't exist — or weren't retained long enough — can leave you unable to determine scope, root cause, or dwell time.

Endpoint Logging

Windows endpoints should be generating and forwarding Security Event Logs with a tuned audit policy. At minimum: account logon events (4624, 4625, 4768, 4769), process creation (4688), service installation (7045), PowerShell logging (Script Block Logging, Module Logging, Transcription), and scheduled task creation (4698). Sysmon, configured with a community ruleset such as SwiftOnSecurity's Sysmon config, dramatically expands visibility at minimal performance cost.

Linux endpoints should be generating auditd logs covering authentication, privilege escalation, command execution, and file access events. These should be forwarded to the SIEM in real time — local logs on a compromised system cannot be trusted.

Network Logging

NetFlow or sflow data from your core network infrastructure provides traffic metadata — source, destination, port, protocol, byte counts — without the storage overhead of full packet capture. This is essential for lateral movement detection. DNS query logs (every query from every endpoint) are high-value; DNS tunnelling, C2 beacon detection, and data exfiltration via DNS are all invisible without them. Proxy logs for HTTP/HTTPS traffic, including SNI fields from TLS connections, provide another visibility layer.

Authentication Logging

Active Directory domain controller security logs (4768, 4769, 4771, 4776), Azure AD sign-in logs, VPN authentication logs, and MFA logs should all flow to your SIEM. Authentication anomalies — logins from unusual locations, outside business hours, with Kerberoastable service tickets — are high-signal detections that require complete auth log coverage to work.

Cloud Logging

AWS CloudTrail for API activity, AWS GuardDuty for threat detection, VPC Flow Logs for network traffic. Azure Monitor and Azure Activity Log. GCP Cloud Audit Logs. S3 access logs and Azure Blob access logs if you store sensitive data there. Cloud logs are frequently the first place evidence of cloud-targeted attacks appears — and they are frequently not enabled by default. Verify your cloud logging configuration against the relevant CIS Benchmark.

Log Retention

Logs you can't query are as useful as logs you never collected. NIST recommends 12 months of log retention as a baseline; PCI-DSS requires 12 months with 3 months immediately available. Many breaches have dwell times exceeding 90 days — if your retention window is shorter than the attacker's presence, your root cause analysis will be incomplete. Hot storage (SIEM queryable) and cold storage (archive, lower-cost) are appropriate; plan the tiering deliberately.

EDR Deployment

An Endpoint Detection and Response (EDR) platform is the most operationally impactful investment most organisations make. It provides real-time process visibility, network connection monitoring, file activity, persistence mechanism detection, and — critically — remote response capability: the ability to isolate a compromised host, collect forensic artefacts, and kill malicious processes without physical access.

Major EDR platforms include CrowdStrike Falcon, SentinelOne, Microsoft Defender for Endpoint, and Carbon Black. Selection criteria should include: detection coverage against ATT&CK techniques, response capability (isolation, live response shell), integration with your SIEM, and management overhead. EDR without a team to action its alerts is expensive noise.

IR Toolkit

Your IR team should maintain a pre-built response toolkit — tested, documented, and available on a jump host or hardened USB — so that when an incident begins, tooling acquisition is not the first task. Key components:

IR Toolkit — Core Components

Memory acquisition: WinPmem (Windows), LiME — Linux Memory Extractor, DumpIt

Disk imaging: FTK Imager (Windows), dd, dcfldd (Linux/Mac)

Endpoint triage: KAPE (Kroll Artifact Parser and Extractor) — Windows artefact collection; Velociraptor — enterprise-scale remote triage and hunting

Memory analysis: Volatility 3 — process analysis, network connections, malware detection from RAM dumps

Network analysis: Wireshark, tshark, NetworkMiner, Zeek

Log analysis: Chainsaw (Windows event log hunting), Hayabusa, Eric Zimmerman's tools (Timeline Explorer, MFTECmd, etc.)

Malware analysis: Any.run, Cuckoo Sandbox (self-hosted), FLARE VM for static analysis

IOC enrichment: VirusTotal API, Shodan, AbuseIPDB, MalwareBazaar

Tabletop Exercises

A tabletop exercise is a facilitated discussion-based simulation of an incident scenario. No systems are actually compromised. The IR team, executives, legal, and communications staff walk through a scenario together — who does what, when, and how — and the facilitator probes for gaps, conflicts, and assumptions.

Tabletops should be run at least twice a year. Effective scenarios for most organisations: ransomware outbreak (the highest-urgency scenario for most IR teams), BEC and wire fraud (high financial impact, often not handled by the technical IR team alone), data exfiltration by an insider, third-party supplier breach, and cloud account compromise. The scenario should be based on actual threat intelligence relevant to your industry sector.

Purple team exercises take this further — they involve a red team actually executing attack techniques in a controlled environment while the blue team detects and responds in real time. Purple team exercises reveal detection and response gaps that tabletops cannot surface because they require real log generation and real tool usage.

Key Takeaways — Chapter 5
  • A current, complete asset inventory is a prerequisite for scoping any incident — shadow IT is a persistent IR blind spot
  • Log coverage across endpoints (Sysmon + Security Events), network (DNS, NetFlow, proxy), authentication (AD, Azure AD), and cloud (CloudTrail, Azure Monitor) is essential — missing any layer creates investigation gaps
  • Log retention should be minimum 12 months; many breaches have dwell times exceeding 90 days
  • EDR provides real-time endpoint visibility and remote response capability — it is the most operationally impactful investment most organisations make
  • Tabletop exercises should run bi-annually; purple team exercises reveal detection gaps tabletops cannot
Chapter 06 ·  ~14 min read  ·  Detection

Detection & Analysis

Alert triage, IOC vs TTP detection, ATT&CK mapping, OSINT enrichment, and building the incident timeline

Detection is where the gap between theoretical security and operational security becomes most visible. An organisation can have world-class preventive controls and still miss an attacker who has been living in their environment for weeks — because detection requires not just log data, but the queries, correlations, and analytical skills to surface meaningful signals from the noise. This chapter covers the mechanics of detection and the analytical discipline required to turn an alert into an actionable incident.

The Alert Triage Funnel

A mature SOC receives thousands of alerts per day. The vast majority are false positives — legitimate activity that triggered a detection rule. The triage funnel is the process by which those alerts are evaluated, categorised, and either closed or escalated.

  1. Alert generated — SIEM fires a detection rule, EDR detects a behavioural pattern, or a TIP surfaces a new IOC match
  2. Initial classification — Tier-1 analyst reviews the alert. Is it a known false positive? Is there prior context? What is the data source confidence?
  3. Triage investigation — the analyst gathers additional context: parent process, user account, network connections, recent logon history, endpoint reputation
  4. Disposition — the alert is either closed (false positive / benign), upgraded (confirmed incident, escalate to Tier-2), or pending (requires more data)
  5. Incident declaration — if escalated and confirmed, the incident is formally declared and enters the PICERL lifecycle

The quality of this funnel depends entirely on alert tuning. Analysts who are drowning in false positives cannot maintain the investigative depth required to catch real threats. Detection engineers must continuously measure false positive rates and tune rules to reduce noise without sacrificing coverage.

IOC-Based vs TTP-Based Detection

There are two fundamentally different approaches to detection, and understanding the tradeoffs between them is essential for a mature IR programme.

IOC-based detection matches specific artefacts: IP addresses, domain names, file hashes, email addresses. It is fast, precise, and produces low false positives when the IOC is reliable. Its critical weakness is that IOCs are ephemeral — attackers rotate IP addresses and domains routinely, and a file hash changes the moment the attacker recompiles their tool. IOC-based detection is reactive by nature; by the time you have an IOC to block, the attacker has already used it.

TTP-based detection — Tactics, Techniques, and Procedures — detects attacker behaviour rather than specific artefacts. A rule that fires when a PowerShell process spawns a network connection to a rare external IP is a TTP-based detection. It catches novel malware using well-known techniques even if the specific tool has never been seen before. TTP-based detections require more tuning (legitimate admin activity can look like attacker behaviour) but are far more durable.

The Pyramid of Pain

Security researcher David Bianco's Pyramid of Pain model maps indicator types to how painful they are for attackers when defenders detect and respond to them. Hash values are trivial to change; domain names take a few minutes; IP addresses are slightly harder; but TTPs are very difficult to change because they represent how the attacker fundamentally operates. Aim to detect and respond at the TTP level wherever possible.

MITRE ATT&CK in Practice

MITRE ATT&CK is a globally accessible knowledge base of adversary tactics, techniques, and sub-techniques derived from real-world threat intelligence. It organises attacker behaviour into 14 tactics (the why — what the attacker is trying to achieve) and hundreds of techniques and sub-techniques (the how).

In an active incident, ATT&CK serves three functions. First, it provides vocabulary — saying "we observed T1059.001 (PowerShell)" is unambiguous to any IR professional globally. Second, it drives investigation — if you observe a technique, the ATT&CK matrix shows you what other techniques are commonly used in the same campaign, suggesting what to look for next. Third, it provides coverage assessment — mapping your detections to ATT&CK techniques reveals which techniques you have coverage for and which ones attackers could use against you undetected.

During triage, classify each observed behaviour by ATT&CK technique. This practice has a compounding return — your post-incident data becomes threat intelligence for future detection engineering.

OSINT Enrichment

Every IOC encountered during an investigation should be enriched with open-source intelligence before acting on it. Enrichment provides confidence in whether an indicator is genuinely malicious, attribution context (which threat actor or campaign it is associated with), and tactical intelligence (what the indicator is used for).

  • VirusTotal — hash, URL, and IP reputation across 70+ AV engines and threat feeds. Also provides behaviour reports for file submissions and YARA rule matching.
  • Shodan — internet-wide scanning data. Shows what services an IP is running, certificate history, and reverse DNS. Excellent for understanding C2 infrastructure.
  • AbuseIPDB — community-reported malicious IP database. Useful for rapid triage of connection attempts.
  • AlienVault OTX (Open Threat Exchange) — community threat intelligence pulses. Good for correlating IOCs to known campaigns.
  • urlscan.io — automated web page scanning. Submit a suspicious URL and receive a screenshot, DOM capture, and outbound connection data without visiting the page yourself.
  • WHOIS and passive DNS — registration history and DNS resolution history for domains. Useful for understanding attacker infrastructure age and reuse.

Building the Incident Timeline

The incident timeline is the single most important analytical document produced during an investigation. It correlates all available evidence — logs, forensic artefacts, witness accounts — into a chronological record of attacker activity. It answers the questions that matter: when did the attacker first gain access, what did they do, how did they move, and what did they access or exfiltrate?

Good timeline construction requires discipline:

  • Normalise timestamps — all times in UTC. Windows event logs, web server logs, EDR telemetry, and firewall logs are all potentially in different timezones. Mixing them without normalisation creates a misleading picture.
  • Source everything — every entry in the timeline must reference the log source and log line that supports it. Unsourced assertions are not evidence.
  • Distinguish confirmed from suspected — "at 14:32 UTC, the attacker executed Mimikatz on DC01 (confirmed — Event ID 4688 + LSASS access)" is different from "the attacker likely moved laterally to DC01 at approximately 14:35 UTC (suspected — no direct evidence; inferred from subsequent credential use)".
  • Scope the blast radius — which systems were touched? Which accounts were used or compromised? Which data was accessed? The blast radius drives the recovery plan and the breach notification obligation.
Key Takeaways — Chapter 6
  • Alert triage is a funnel — false positive reduction through detection engineering is as important as detection coverage
  • TTP-based detection is more durable than IOC-based detection; aim to detect attacker behaviour, not specific artefacts
  • MITRE ATT&CK provides vocabulary, investigation direction, and coverage gap analysis during active incidents
  • All IOCs should be OSINT-enriched before action — VirusTotal, Shodan, AbuseIPDB, and passive DNS are the core toolset
  • The incident timeline must be UTC-normalised, sourced to specific log lines, and explicit about what is confirmed vs inferred
Chapter 07 ·  ~12 min read  ·  Response

Containment Strategies

Short-term vs long-term containment, isolation techniques, the containment decision matrix, and evidence preservation

Containment is the most operationally pressured phase of incident response. The attacker is active or may become active. Every minute of delay is a minute of continued access. But hasty containment — pulling network cables and shutting down servers without a plan — can destroy evidence, alert the attacker to take final destructive action, and create recovery problems that dwarf the original incident. The discipline is in moving quickly and systematically.

The Containment Paradox

There is a tension at the heart of every containment decision: the actions that most effectively stop an attacker also tend to disrupt evidence and business operations. Reformatting an infected endpoint eliminates the threat but also eliminates the forensic artefacts that would reveal how the attacker got in, what they accessed, and whether they established persistence elsewhere. Blocking a C2 IP address stops ongoing communication but may prompt the attacker to switch to a backup channel you haven't identified yet.

This is why containment decisions require deliberate judgment, not reflexive action. The goal is to control the attacker's ability to cause further harm while preserving enough visibility and evidence to complete the investigation.

Short-Term vs Long-Term Containment

Short-term containment is immediate action taken to limit damage while the investigation continues. It is designed to be temporary and minimally disruptive:

  • Network isolation of a specific compromised host (through EDR isolation, VLAN change, or firewall ACL)
  • Disabling a compromised user account without resetting credentials (preserves attacker's session while blocking further authentication)
  • Blocking a specific C2 IP or domain at the perimeter
  • Suspending a cloud service or API key that appears compromised

Long-term containment is the more thorough remediation that follows once the investigation has established scope. It is designed to be permanent:

  • Full network redesign to close the lateral movement path the attacker used
  • Credential rotation across all affected accounts (and potentially all privileged accounts)
  • Re-imaging compromised endpoints from known-good baselines
  • Patching the vulnerability that provided initial access
  • Removing all identified persistence mechanisms across all affected systems
Timing Note

Do not move to long-term containment while the investigation is still incomplete. If you reset credentials or re-image systems before you know the full scope of the compromise, you may destroy the evidence you need to find the remaining compromised systems. Sequence matters: complete detection and analysis, then execute containment comprehensively.

Isolation Techniques

Network Isolation

The cleanest containment for a compromised endpoint is network isolation — blocking all inbound and outbound traffic while preserving the ability to collect forensic data remotely. Modern EDR platforms (CrowdStrike, SentinelOne, Defender for Endpoint) support network isolation via the management console: the host remains reachable by the EDR agent while all other traffic is blocked. This allows continued forensic collection without giving the attacker network access.

If EDR isolation is not available, options include: VLAN reassignment (moving the port to an isolated VLAN with no routing), ACL-based blocking on the adjacent switch, or physical disconnection (last resort — destroys network forensic evidence).

Account Lockout vs Credential Reset

Locking an account prevents further authentication without alerting the attacker via a visible password change. Resetting credentials invalidates the attacker's session tokens and forces re-authentication — but if the attacker has established persistence via other means (backdoor, web shell, pass-the-hash), resetting credentials alone provides false confidence.

The correct sequence is: isolate the systems the compromised account accessed, collect forensic evidence, identify all persistence mechanisms, then perform a coordinated credential rotation that covers the original account and any other accounts the attacker may have touched or compromised.

Cloud Containment

Cloud environments introduce unique containment considerations. An IAM role or access key compromise in AWS can grant broad access to resources across multiple services and regions simultaneously. Containment actions for cloud incidents include: revoking the compromised access key or session token, attaching a deny-all SCP (Service Control Policy) to a compromised AWS account, disabling the compromised IAM user, and enabling CloudTrail in all regions if it wasn't already.

The Containment Decision Matrix

Every containment decision involves a tradeoff. The decision matrix formalises this judgment:

ScenarioIsolateMonitorRationale
Ransomware execution in progress✓ ImmediateEncryption spread is the primary damage — speed of isolation determines blast radius
Confirmed data exfiltration in progress✓ ImmediateEvery byte still leaving is additional breach scope
Suspected C2 beacon, no confirmed actionsConsider✓ Short-termMonitoring may reveal additional infrastructure and lateral movement before isolating
Compromised account, attacker not currently active✓ With lockoutAccount lockout + monitoring reveals full scope before credential reset disrupts the picture
Suspicious process, not yet confirmed maliciousPremature isolation destroys the investigation; gather evidence first

Preserving Evidence During Containment

Evidence preservation and containment are not mutually exclusive — but they require coordination. Before isolating any system:

  • Capture a memory image if possible (volatile data — running processes, network connections, encryption keys — is lost when the system is powered off)
  • Document the system's current state: running processes, active network connections, logged-in users, open files
  • Ensure all relevant logs are flowing to the SIEM and have been preserved (local log files may be modified or deleted by the attacker)
  • Photograph the screen if physical forensics will follow
Evidence Preservation — Live Response Commands

Before isolating a Windows host, capture volatile data with:

netstat -ano           # active connections and listening ports
tasklist /v           # running processes with full path
query user            # logged-in users
wevtutil qe Security /c:100 /f:text  # last 100 security events
ipconfig /all         # network configuration

On Linux: ss -tulnp, ps auxf, w, last, find /tmp -type f

Key Takeaways — Chapter 7
  • Containment involves a fundamental tradeoff between stopping damage and preserving evidence — hasty action can destroy the investigation
  • Short-term containment limits immediate damage; long-term containment executes after scope is fully understood
  • EDR network isolation is the cleanest option — it blocks attacker access while keeping the forensic channel open
  • Account lockout and credential reset are different actions with different consequences — sequence them deliberately
  • Capture volatile data (memory, network connections) before isolating any system
Chapter 08 ·  ~15 min read  ·  Forensics

Evidence Collection & Forensics

Chain of custody, disk imaging, memory acquisition, log preservation, and forensic timeline reconstruction

Digital forensics is the discipline of collecting, preserving, and analysing digital evidence in a manner that maintains its integrity and admissibility. For incident responders, forensics serves two parallel purposes: operational (understanding what happened so you can contain and remediate) and legal (building an evidentiary record that can support disciplinary action, civil litigation, or criminal prosecution). The standards for the two are different — operational forensics can be done quickly and imperfectly; legal forensics must be done carefully and documented exhaustively.

Chain of Custody

Chain of custody is a documented record that tracks who had access to evidence, when, and what they did with it. It is the mechanism by which you can demonstrate to a court, regulator, or HR tribunal that evidence has not been tampered with between collection and presentation.

A chain of custody record for a piece of digital evidence includes:

  • Description of the evidence (device make/model, serial number, hostname, hash value of the image)
  • Date and time of collection
  • Location of collection
  • Name and role of the person who collected it
  • Storage location (encrypted drive, evidence locker, cloud storage with access logging)
  • Every subsequent access: who, when, purpose, and duration

Chain of custody failures are one of the most common reasons digital evidence is challenged in legal proceedings. If you cannot produce a complete custody record, opposing counsel will argue the evidence was contaminated or fabricated. This does not only apply to criminal cases — insider threat investigations and employment tribunals have the same requirement.

Volatile vs Non-Volatile Data

The order in which you collect evidence is determined by volatility — how quickly the data will be lost if not captured. The classic rule from forensics is to collect in order from most volatile to least volatile:

  1. CPU registers and cache — lost immediately on shutdown
  2. RAM / physical memory — lost on shutdown or sleep; also affected by anti-forensics tools that actively wipe memory
  3. Network state — active connections, ARP cache, routing table — lost when connections drop
  4. Running processes and open file handles — lost on shutdown
  5. Disk data — persistent but can be altered; collect before remediation actions that might overwrite data
  6. Remote logging (SIEM) — generally persistent, but verify retention period and that logs are flowing
  7. Archived/backup data — least volatile; available long after the incident but may lag the actual event

Memory Acquisition

RAM dumps capture the contents of physical memory at a point in time. They are among the most forensically valuable artefacts in modern IR because: malware that runs entirely in memory (fileless malware) leaves no disk artefacts; encryption keys, passwords, and credentials are often present in plaintext in RAM; network connections and process trees captured in memory provide attacker context that logs may not.

On Windows, the primary tools are:

  • WinPmem — open-source, widely used, outputs AFF4 or raw format compatible with Volatility
  • DumpIt — simple single-binary acquisition, outputs a raw memory dump alongside a CSV of running processes
  • Magnet RAM Capture — GUI-based, outputs a .mem file; good for less technical responders

On Linux, LiME (Linux Memory Extractor) is a loadable kernel module that acquires physical memory over a network socket or to a file. It requires matching the LiME version to the kernel version — pre-compiling LiME for your kernel versions as part of IR preparation saves critical time during an incident.

Analysing Memory with Volatility

Volatility 3 is the standard framework for memory forensics. Key plugins for incident response:

PluginWhat It DoesIR Use Case
windows.pslistLists running processes from the EPROCESS linked listIdentify suspicious processes; compare to expected baseline
windows.psscanScans for EPROCESS structures in memory (finds hidden processes)Rootkit detection; compare pslist vs psscan discrepancies
windows.netscanShows active and recent network connectionsIdentify C2 connections, lateral movement
windows.malfindFinds injected code in memory (RWX regions with PE headers)Detect process injection, shellcode
windows.dlllistLists DLLs loaded by each processDetect DLL injection, malicious DLL loading
windows.hashdumpExtracts cached credential hashes from memoryUnderstand credential exposure; lateral movement risk
windows.cmdlineShows command-line arguments for running processesIdentify malicious command execution

Disk Imaging

A forensic disk image is a bit-for-bit copy of a storage device — it captures not just allocated files but deleted files, unallocated space, and filesystem metadata. This completeness is what makes it forensically valid; a simple file copy misses data that an attacker tried to delete but that remains recoverable.

FTK Imager (free, Windows) is the most widely used tool for Windows IR. It produces E01 (Expert Witness Format) images with embedded MD5 and SHA-1 hash verification. The hash is computed during acquisition and stored in the image — any subsequent modification to the image will produce a different hash, immediately revealing tampering.

On Linux and Mac, dd with the correct block size and error handling options produces a raw image. dcfldd adds on-the-fly hashing and progress monitoring. For network acquisition (useful when physical access is impractical), netcat piped from dd across a secure channel is a common technique.

Write Blockers

A write blocker is a hardware device (or software equivalent) that allows data to be read from a storage device without any writes being made to it. Connecting a suspect drive directly to a standard computer without a write blocker risks modifying the drive's metadata — access timestamps, last-written dates — which can contaminate the forensic record. Any legal investigation requires hardware write blocking. Operational IR can use software write blockers (such as those built into forensic distros like CAINE or DEFT) when hardware blockers are unavailable.

Log Preservation

Logs stored only on compromised systems cannot be trusted — the attacker may have modified or deleted them. For forensically reliable log evidence, you need logs that were forwarded off the system in real time to a SIEM or log aggregator that the attacker did not have access to.

Before containment actions that might overwrite or rotate logs, explicitly export and archive:

  • Windows Event Logs (EVTX format) — Security, System, Application, PowerShell, and any application-specific logs
  • Linux auth.log, syslog, and application logs in /var/log
  • Web server access and error logs (Apache/Nginx/IIS)
  • Firewall and proxy logs for the relevant time period
  • Active Directory log exports

Forensic Timeline Reconstruction

A super-timeline — a merged, chronologically sorted collection of timestamps from multiple forensic artefact types — is one of the most powerful investigative tools in digital forensics. Tools like log2timeline / plaso (Linux) and Timeline Explorer (Windows, Eric Zimmerman) automate the extraction and merging of timestamps from hundreds of forensic artefact types: filesystem metadata, Windows Event Logs, browser history, registry last-write times, prefetch files, and more.

Interpreting a super-timeline requires familiarity with Windows forensic artefacts — knowing, for example, that a prefetch file's creation timestamp indicates the first execution of a program on that system, or that shellbags record the last-accessed time for Windows Explorer folder browsing. This knowledge depth is what separates a Tier-3 forensics specialist from a Tier-1 alert analyst.

Key Takeaways — Chapter 8
  • Chain of custody documentation is mandatory for any evidence that may be used in legal or HR proceedings — every access must be logged
  • Collect in volatility order: RAM first, disk second, archived logs last
  • Memory acquisition (WinPmem, LiME) captures fileless malware, credentials, and process state that disk forensics cannot
  • Forensic disk images must be hash-verified during acquisition — E01 format stores the hash in the image itself
  • Hardware write blockers are required for legally admissible evidence; software write blockers are acceptable for operational IR
Chapter 09 ·  ~11 min read  ·  Response

Eradication

Removing malware, closing initial access vectors, credential rotation, and verifying clean state

Eradication is the phase in which you remove the attacker from your environment entirely — not just the system you first identified them on, but every system they touched, every persistence mechanism they installed, and every credential they compromised. Incomplete eradication is the most common cause of re-infection: the organisation believes it has recovered, lowers its guard, and the attacker returns through a backdoor that was missed.

Re-Imaging vs Surgical Removal

There are two approaches to eradicating a compromised endpoint: surgical removal (identifying and deleting specific malicious files, registry keys, scheduled tasks, and services) or complete re-imaging (wiping the system and reinstalling from a known-good baseline). The choice depends on the nature and severity of the compromise.

Surgical removal is appropriate when: the compromise is well-understood and limited in scope, the forensic investigation is complete, and every persistence mechanism has been identified and confirmed as removed. It is faster and less disruptive than re-imaging but carries higher risk — if any persistence mechanism is missed, the attacker retains access.

Re-imaging is the preferred approach for any complex compromise, any confirmed rootkit, and any incident involving advanced or nation-state threat actors. It provides certainty — a re-imaged system from a known-good baseline cannot contain attacker persistence. The downside is time, disruption, and the need for reliable, malware-free baselines. If your backup images were captured after the attacker established persistence, re-imaging from them will re-infect the system.

Critical Point

Re-imaging is only effective if the image itself is clean. Verify your build images were created before the suspected date of initial compromise. If the attacker has been present for six months and your system images are four months old, re-imaging from those images is re-infecting from those images.

Persistence Mechanisms to Check

Attackers establish persistence through dozens of techniques. During eradication, every confirmed compromised system must be checked for all of the following — not just the mechanism that was initially observed. Sophisticated attackers install multiple persistence mechanisms knowing that defenders will find the obvious one and stop looking.

Windows Persistence

  • Registry Run / RunOnce keys: HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Run
  • Scheduled tasks — check via schtasks /query /fo LIST /v and Get-ScheduledTask
  • Services — particularly any with unusual names, paths in temp directories, or recently created timestamps
  • WMI event subscriptions — a common persistence mechanism often missed: Get-WMIObject -Namespace root\subscription -Class __EventFilter
  • COM object hijacking — registry modifications that redirect COM object loading
  • DLL search order hijacking — malicious DLLs placed in locations that get loaded by legitimate applications
  • Boot-time execution — MBR infection, bootkit, or UEFI implant (extremely rare but catastrophic)
  • Active Directory persistence: DCSync rights granted to non-admin accounts, Golden Ticket capability (KRBTGT compromise)

Linux / Mac Persistence

  • Crontab entries — crontab -l and /etc/cron.* directories
  • SSH authorised keys — ~/.ssh/authorized_keys for every user account
  • Init scripts — /etc/init.d/, /etc/rc.local, systemd unit files in /etc/systemd/system/
  • Profile scripts — ~/.bashrc, ~/.profile, /etc/profile.d/
  • PAM modules — malicious PAM modules can log credentials or bypass authentication
  • SUID/SGID binaries — newly created or modified files with SUID bits set

Credential Rotation

Any account that was used on, logged in to, or could theoretically have been accessed from a compromised system must have its credentials rotated as part of eradication. This includes:

  • The directly compromised accounts
  • Any service accounts running on the compromised system
  • Any accounts whose credentials were stored in memory (Mimikatz can extract all accounts whose credentials are cached in LSASS)
  • The local administrator account on the affected system (it may have been used for lateral movement)
  • API keys, service tokens, and certificates stored on the system or accessible from it

Credential rotation must be coordinated and performed in a single, rapid operation. Rotating credentials piecemeal over several days alerts the attacker to the eradication effort and gives them time to establish additional persistence or exfiltrate remaining data. Set a coordinated rotation window, brief all stakeholders, execute simultaneously.

Closing the Initial Access Vector

Eradication is incomplete without permanently closing the initial access vector — the way the attacker got in. Common initial access vectors and their eradication actions:

Initial Access VectorEradication Action
Phishing — credential theftReset compromised account credentials, enable MFA, review email filtering rules
Phishing — malware deliveryIdentify and eradicate malware on affected endpoints, update email security controls
Exploited public-facing applicationPatch the vulnerability, review for other instances of the same software version, check for web shells
Valid account — credential stuffingReset compromised account, require MFA, review for other compromised accounts in the same campaign
Supply chain compromiseIsolate the affected software/update, assess all systems that received the compromised update, contact the supplier
RDP / VPN brute forcePatch, disable direct RDP exposure, implement account lockout, require MFA for VPN
Key Takeaways — Chapter 9
  • Incomplete eradication is the most common cause of re-infection — check for multiple persistence mechanisms on every compromised system
  • Re-imaging is preferable to surgical removal for complex compromises — but verify image integrity and pre-compromise date
  • WMI event subscriptions, COM hijacking, and AD-level persistence are frequently missed in partial investigations
  • Credential rotation must be coordinated and simultaneous — piecemeal rotation alerts the attacker
  • Closing the initial access vector is a required eradication step — containment without patching the entry point invites re-infection
Chapter 10 ·  ~10 min read  ·  Recovery

Recovery

Restoring from backup, phased return to production, enhanced monitoring during recovery, and closing the incident

Recovery is the phase most organisations are impatient about. Executives want systems back online. Users want their workstations back. The business wants to declare victory and move on. The IR team's job in this phase is to ensure that the return to normal operations is safe, monitored, and verifiable — not just fast.

Backup Integrity and Restore Planning

Recovery from a serious incident almost always involves restoring systems from backup. The quality of your recovery depends entirely on the quality of your backups — a fact that sounds obvious but is routinely ignored until it matters.

Before restoring from backup, verify:

  • Backup integrity — are the backup files intact and readable? Have they been tested recently? A backup that has never been restored from has an unknown success rate.
  • Pre-compromise date — are the backups from before the attacker established presence? If the attacker has been in your environment for 60 days and your oldest backup is 45 days old, every backup is potentially infected.
  • Backup storage isolation — ransomware specifically targets backup systems. Were your backups stored in a location the attacker could reach? Were the backup agents running on compromised systems? If so, the backups may contain encrypted or corrupted data.
The 3-2-1 Backup Rule

Three copies of data, on two different media types, with one copy off-site and isolated from production network access. The "isolated" part is the one organisations skip — until ransomware reaches the backup server through the same network segment. An air-gapped or immutable backup (cloud object storage with object lock enabled, or tape off-site) is the minimum viable ransomware-resistant backup strategy.

Phased Return to Production

Returning all systems to production simultaneously after a major incident is a risk management decision, not just an operational one. If the eradication was incomplete — if one system still harbours a persistence mechanism or an attacker account — bringing everything back online simultaneously gives the attacker the connected environment they need to re-spread.

A phased return prioritises:

  1. Systems with the highest business criticality and lowest compromise risk (most isolated, cleanest forensic evidence)
  2. Supporting infrastructure — authentication systems, DNS, core networking — that other systems depend on
  3. Systems with confirmed clean status — re-imaged from verified pre-incident baselines
  4. Systems that were monitored and showed no compromise indicators
  5. Systems that required surgical remediation — return last, with additional monitoring period

Enhanced Monitoring During Recovery

The period immediately after recovery is high-risk. If the eradication was incomplete, the attacker will re-activate during this window. If new credentials were issued and the attacker has found a way to intercept them, you will see this in the authentication logs. If a persistence mechanism was missed, the first thing it will do when the network comes back up is beacon out.

During the recovery monitoring period:

  • Increase SIEM alert sensitivity temporarily — accept higher false positive rates in exchange for not missing re-compromise signals
  • Monitor all outbound network connections from recovered systems, particularly to external IPs — new C2 connections will appear here
  • Watch for authentication events using the newly rotated credentials from unusual locations or at unusual hours
  • Run threat hunting queries specifically targeting the TTPs the attacker used — if any were missed in eradication, they will reappear
  • Review cloud API activity for signs of access using credentials that should have been rotated

Declaring the Incident Closed

An incident should not be closed on a schedule or under management pressure. It should be closed when specific objective criteria are met:

  • All compromised systems have been eradicated or re-imaged
  • All identified persistence mechanisms have been removed and verified absent
  • All compromised credentials have been rotated
  • The initial access vector has been closed
  • A defined recovery monitoring period (typically 2–4 weeks) has elapsed with no re-compromise indicators
  • Regulatory notification obligations have been assessed and, if triggered, completed
  • The post-incident report has been drafted
Key Takeaways — Chapter 10
  • Backup integrity and pre-compromise dating must be verified before restoration — ransomware frequently targets backup infrastructure
  • Phased return to production reduces re-infection risk — start with the cleanest, most isolated systems
  • Enhanced monitoring during recovery is mandatory — re-compromise most commonly occurs in the first 2–4 weeks after restoration
  • Close incidents on objective criteria, not timelines or management pressure
  • The 3-2-1 backup rule with isolation (air-gap or immutable storage) is the minimum ransomware-resistant backup strategy
Chapter 11 ·  ~18 min read  ·  Playbooks

Common Incident Types & Playbooks

Phishing, ransomware, account takeover, insider threat, data exfiltration, web shell, and supply chain — detection signals, containment steps, and escalation triggers

No incident is identical. But the vast majority of incidents encountered in enterprise environments fall into a small number of well-understood categories, each with characteristic detection signals, typical attacker behaviour, and proven response sequences. This chapter provides condensed playbooks for seven of the most common — not to replace your organisation's full playbooks, but to give you the pattern recognition that experienced IR analysts develop over years of repetition.

🎣
Phishing / Business Email Compromise
High Volume

Detection Signals

  • Email security gateway alerts — blocked or quarantined messages with malicious links or attachments
  • User report — "I think I clicked something I shouldn't have"
  • EDR alert — process spawned from browser or email client attempting network connection to unknown external IP
  • DNS logs — resolution of newly registered domains (low age, high-entropy names)
  • Authentication alert — successful login from a new country or device immediately after a credential-themed email was delivered
  • BEC-specific: email forwarding rule created in victim mailbox (SIEM rule: Office 365 "Set-InboxRule" audit log events)

Initial Response

  • Pull the original email: headers, body, attachments, URLs. Do not click any links — use urlscan.io and sandboxes.
  • Identify all recipients — search mail gateway logs for delivery of the same or similar email to other mailboxes
  • Determine if the link was clicked and if the attachment was executed — email client process logs, proxy logs, EDR telemetry
  • For BEC: check for inbox rules, email forwarding configuration, recent mail sends on behalf of the victim

Containment

  • Remove the email from all recipient mailboxes (Exchange: Search-Mailbox; M365: Content Search + Purge)
  • Block the sending domain and all URLs in the email at the email gateway and proxy
  • If credentials were entered: reset the account immediately, terminate all active sessions, require MFA re-enrolment
  • If malware executed: isolate the endpoint, begin malware eradication procedure
  • For BEC: remove malicious inbox rules, reset mail forwarding, notify finance if wire transfer was the objective

Escalation Triggers

  • Credentials confirmed stolen → escalate to account takeover playbook
  • Malware confirmed executed → escalate to endpoint compromise playbook
  • Wire transfer initiated → escalate immediately to financial fraud response, contact bank within hours for potential recall
  • Executive account compromised → SEV-1, immediate IC engagement
🔐
Ransomware
Critical

Detection Signals

  • EDR alert — mass file rename events, high-entropy file writes, known ransomware binary hash
  • SIEM alert — shadow copy deletion (vssadmin delete shadows or equivalent via WMI)
  • User report — files have been renamed with unknown extensions, ransom note appears
  • File server alert — unusual volume of file modifications in a short time window
  • Network alert — lateral movement (unusual SMB, PsExec, WMI) preceding encryption activity

Initial Response — Speed is Critical

  • Declare SEV-1 immediately. Activate the full IR team and IC.
  • Identify the initial infected host(s) — where did encryption first start?
  • Isolate infected systems immediately — every second of connectivity is additional encryption spread
  • Identify the ransomware family — check the ransom note, file extension, and hash against ID Ransomware (id-ransomware.malwarehunterteam.com)
  • Check for available decryptors — No More Ransom project (nomoreransom.org) lists free decryptors for many families
  • Determine if exfiltration preceded encryption — modern ransomware (double extortion) exfiltrates before encrypting; check outbound traffic in the 24–72 hours before encryption

Backup Assessment

  • Are backup systems online? Were they encrypted?
  • What is the most recent clean backup and when does it date from?
  • Have backup restoration procedures been tested?
  • If cloud backups: check for object lock / immutability status

Escalation Triggers

  • Confirmed exfiltration → breach notification assessment begins
  • All backups encrypted or unavailable → consider cyber insurance engagement for ransom decision
  • Critical infrastructure affected → notify relevant sector authority (CISA for US critical infrastructure)
  • Attackers contact organisation → route all communications through legal counsel
🔑
Account Takeover / Credential Stuffing
Common

Detection Signals

  • Authentication logs — successful login from a new country, ASN, or user-agent string
  • UEBA alert — impossible travel (login from New York then London 20 minutes apart)
  • MFA bypass alert — MFA approved from a device not previously associated with the account
  • Spike in failed authentications followed by success — credential stuffing pattern
  • Unusual API activity post-login — mass data export, permission changes, new OAuth application registered

Containment

  • Terminate all active sessions for the compromised account (Azure AD: Revoke-AzureADUserAllRefreshToken; AWS: invalidate all sessions for the IAM user)
  • Reset password and require MFA re-enrolment from a trusted device
  • Review all actions taken during the compromised session — data accessed, settings changed, emails sent
  • Check for persistence mechanisms created by the attacker: new OAuth apps, API keys, forwarding rules, admin role grants
  • If cloud account: review IAM policy changes, new resources created, data export operations

Escalation Triggers

  • Privileged account (domain admin, global admin, AWS root) → SEV-1 immediately
  • Evidence of data access → breach assessment required
  • Multiple accounts compromised → credential stuffing campaign; review full user population for compromise indicators
🎭
Insider Threat
Complex

Detection Signals

  • DLP alert — bulk download of sensitive files, upload to personal cloud storage (Dropbox, Google Drive, WeTransfer)
  • UEBA alert — unusual access patterns: accessing files outside the user's normal scope, off-hours activity, volume anomalies
  • Endpoint alert — large file compression events, USB insertion events, screenshot capture tools
  • HR-correlated signal — user has submitted resignation or is under disciplinary action (check your HR-IR coordination process)
  • Email — forwarding to personal account, large email with attachments to external address

Special Considerations

  • Insider investigations require legal and HR involvement from the start — do not take containment action without consulting both
  • Preserve evidence silently before taking any action — the insider should not be aware of the investigation until the organisation is ready to act
  • Attorney-client privilege: involve legal counsel early; some communications may be protected from discovery
  • The threshold for action is different for insiders: employment law, labour relations, and the risk of wrongful termination claims all apply

Escalation Triggers

  • Evidence of data exfiltration → breach notification assessment
  • Evidence of sabotage or destructive action → SEV-1, law enforcement referral consideration
  • Privileged insider (sysadmin, DBA) → elevated risk; consider immediate access revocation with legal sign-off
🕸️
Web Shell / Server Compromise
Persistent

Detection Signals

  • WAF alert — known web shell signatures in HTTP request/response
  • Web server access log anomaly — unusual user-agent strings, POST requests to static files (e.g., POST to a .jpg or .css), requests with encoded payloads
  • EDR alert — web server process (w3wp.exe, apache2, nginx) spawning a shell (cmd.exe, /bin/bash) or running unusual commands
  • File integrity monitoring (FIM) alert — new file written to web root or app directory
  • Outbound connection from web server process to external IP

Investigation

  • Locate the web shell — search web root for recently modified files, files with PHP/ASPX execution capability in unexpected directories, files with high entropy content
  • Parse web server access logs for requests to the shell — what commands were run? What data was accessed?
  • Determine initial access vector — file upload vulnerability? RCE? Exploitation of a vulnerable component? Patch assessment is required.
  • Check for lateral movement — did the attacker use the web shell as a pivot point to reach internal systems?

Containment

  • Remove the web shell from the filesystem
  • Patch the vulnerability used for initial access — do not restore the server without patching first
  • Review all service accounts associated with the web application — they may have been used for lateral movement
  • Consider temporary WAF block rules or server isolation while investigation continues
Key Takeaways — Chapter 11
  • Pattern recognition across incident types is what separates experienced analysts from junior ones — exposure to multiple incident types is the fastest way to develop it
  • Ransomware response speed determines blast radius — isolate first, investigate simultaneously, not sequentially
  • BEC wire fraud has a narrow recovery window (hours, not days) — financial escalation must happen in parallel with technical IR
  • Insider threat investigations must involve legal and HR from the first moment — containment without their sign-off creates legal liability
  • Web shell infections commonly persist for months before detection — the investigation scope should extend back at least 90 days from discovery
Chapter 12 ·  ~12 min read  ·  Post-Incident

Post-Incident Activity

Lessons-learned meetings, post-incident report structure, metrics, and feeding findings back into detection engineering

The lessons-learned phase is where most organisations fail to extract value from their incidents. The immediate impulse after a difficult incident — particularly a public or embarrassing one — is to close the ticket, brief the board, and move on. Organisations that do this are purchasing the same incident again. The lessons-learned process, done well, is the mechanism by which every incident makes the programme stronger.

The Lessons-Learned Meeting

The lessons-learned meeting should be held within two weeks of incident closure — close enough to the event that memories are fresh, but distant enough that immediate stress has dissipated and the post-incident report is available as a reference. It is not a blame meeting. It is a systems analysis meeting.

Attendees should include all members of the core IR team, plus representation from the stakeholder groups that were involved: legal, communications, IT operations, and relevant business unit leads. Leadership should attend or receive the output — but the meeting is most useful when it is a working session, not a presentation.

The agenda covers:

  1. Timeline review — walk through the incident timeline from detection to closure. Ensure everyone has the same understanding of what happened and when.
  2. What went well — explicitly document what the team did effectively. Detection that worked. Containment that was fast. Communication that was clear. This is not empty praise — it identifies practices worth preserving and scaling.
  3. What did not go well — gaps in detection, delays in containment, communication failures, tooling limitations, process steps that were unclear or skipped. Documented without attribution to individuals; the question is "what in our system or process contributed to this?" not "who is to blame?"
  4. Root cause analysis — why did this incident happen? Not just the proximate technical cause, but the underlying control gap. A five-whys analysis often reveals that the real issue is a training gap, a process gap, or a resource gap — not the specific technical vulnerability that was exploited.
  5. Actionable recommendations — each finding gets a recommendation, an owner, and a deadline. Unassigned recommendations do not get implemented.
Blameless Culture

A blameless post-mortem culture — borrowed from SRE (Site Reliability Engineering) practice — is not the same as a no-accountability culture. People are accountable for following process; the process itself is accountable for being designed well. If an analyst made a mistake, the question is: why did the process allow that mistake to cause harm? Could better tooling, training, or runbook design have prevented it? Attribution to individuals produces fear; attribution to systems produces improvement.

The Post-Incident Report

The post-incident report (PIR) is the formal written record of the incident. It serves multiple purposes: internal documentation, management briefing, regulatory evidence if required, and institutional memory for future incidents of the same type. A well-structured PIR has:

  • Executive Summary (1 page) — non-technical overview. What happened, when, what was affected, what was done, and what the current status is. Written for the CISO, CEO, and board. No jargon.
  • Incident Timeline — the complete chronological record from initial detection to incident closure, with timestamps and source references.
  • Technical Analysis — detailed narrative of attacker activity: initial access, lateral movement, data accessed, persistence mechanisms, tools and techniques observed. ATT&CK technique references should be included.
  • Impact Assessment — systems affected, data potentially compromised, business operations disrupted, estimated cost.
  • Root Cause — the underlying vulnerability, control gap, or process failure that enabled the incident.
  • Containment and Eradication Summary — what was done, by whom, and when.
  • Recommendations — prioritised list of improvements with owners and target dates.
  • Appendices — raw evidence, log excerpts, IOC list, YARA rules or other artefacts useful for future reference.

IR Metrics

You cannot manage what you cannot measure. A mature IR programme tracks operational metrics that reveal how the programme is performing and where it needs investment:

MetricDefinitionWhat It Reveals
MTTDMean Time to Detect — average time between an attacker gaining access and the organisation detecting itDetection coverage and tuning quality
MTTRMean Time to Respond / Remediate — average time from detection to incident closureResponse process efficiency; resourcing
MTTCMean Time to Contain — average time from detection to containment of the active threatContainment playbook effectiveness
False Positive RatePercentage of SIEM/EDR alerts that are false positivesDetection content quality; analyst workload burden
Incidents per MonthVolume of confirmed incidents by severity levelThreat landscape trends; preventive control effectiveness
Repeat Incident RatePercentage of incidents involving the same root cause or technique as a prior incidentLessons-learned implementation effectiveness
Recommendations Closed %Percentage of post-incident recommendations implemented on scheduleOrganisational commitment to IR improvement

Feeding Findings Back into Detection Engineering

Every incident is a detection engineering opportunity. The attacker techniques observed in the investigation should be cross-referenced against your existing detection content. For every technique that was used without triggering an alert, ask: why didn't we detect this? Is there a log source gap? A query gap? A tuning problem?

The output of this review — called a detection gap analysis — drives a prioritised queue of new detection content to build. This is how a mature SOC progressively improves its detection coverage over time: not by buying new tools, but by systematically translating operational experience into better detection rules.

Key Takeaways — Chapter 12
  • Lessons-learned meetings must be blameless-by-design — attribution to systems produces improvement; attribution to individuals produces fear
  • The post-incident report serves internal documentation, management briefing, and regulatory evidence purposes simultaneously
  • MTTD and MTTR are the two primary IR health metrics — track both, trend both, and understand the difference between them
  • High repeat incident rates indicate lessons-learned recommendations are not being implemented
  • Detection gap analysis — mapping incident TTPs to missing alert coverage — is how mature SOCs continuously improve detection
Chapter 13 ·  ~12 min read  ·  Legal & Compliance

Legal, Compliance & Communication

Breach notification, working with law enforcement, legal holds, external communications, and cyber insurance

The legal and communications dimension of incident response is where many technically excellent IR teams lose control of the incident. Technical responders who speak directly to journalists, regulatory notifications that go out with inaccurate scope assessments, forensic reports that are handed to law enforcement without legal counsel involvement — these are avoidable failures with serious consequences. This chapter covers the non-technical aspects of IR that determine outcomes as much as any forensic technique.

Breach Notification Requirements

We covered the major frameworks (GDPR, HIPAA, PCI-DSS) in Chapter 2. In practice, breach notification management requires a structured process:

  1. Threshold determination — does this incident meet the regulatory definition of a breach? Under GDPR, this means unauthorised access to personal data. Under HIPAA, it means access to unsecured PHI. The thresholds vary and have specific technical and legal definitions — do not make this determination without legal counsel.
  2. Scope assessment — how many individuals are affected? What categories of data were involved? This drives both notification obligation (some regulations only require notification above certain individual counts) and the content of the notification.
  3. Notification drafting — regulatory notification letters have mandatory content requirements. GDPR notifications to supervisory authorities must include: nature of breach, categories and approximate numbers of individuals affected, likely consequences, measures taken or planned. Omitting required content results in non-compliant notification.
  4. Timing management — maintain a notification timeline document tracking each applicable regulation, its notification deadline, the clock start event, and current status. This document is reviewed daily during an active breach notification process.

In the United States, in addition to federal sector-specific regulations (HIPAA, GLBA, etc.), all 50 states have their own data breach notification laws. Some impose shorter timelines than HIPAA (Florida requires notification within 30 days; Colorado within 30 days for state residents). For organisations handling consumer data across multiple US states, the notification matrix can be complex.

Attorney-Client Privilege in IR

Attorney-client privilege protects confidential communications between a client and their legal counsel from disclosure in legal proceedings. In the context of IR, this has a specific practical implication: if a forensic investigation is conducted at the direction of legal counsel in anticipation of litigation, the findings and reports of that investigation may be protected from production in discovery.

This is known as the Kovel doctrine (from a 1961 US Second Circuit case) — engaging external IR firms through legal counsel, with the purpose of providing legal advice, can bring those IR activities under the attorney-client privilege umbrella. Many organisations structure their IR retainer engagements this way specifically to preserve this protection.

Practical Note

Privilege protection is not automatic and it can be waived. Sharing a privileged IR report with third parties (vendors, insurance carriers, regulators) can waive the privilege. Do not share forensic reports without legal counsel review and approval.

Working with Law Enforcement

The decision to involve law enforcement is consequential and irreversible. Once you file a complaint with the FBI, CISA, or a local law enforcement agency, you lose control of aspects of the investigation. Law enforcement may execute search warrants on your systems or your attacker's infrastructure in ways that affect your operations. Evidence you provide may enter a criminal case proceeding on its own timeline.

Law enforcement involvement is most appropriate when:

  • The incident involves known criminal activity with identified perpetrators (ransomware gang attribution, nation-state actors)
  • There is an ongoing threat to critical infrastructure or public safety
  • The organisation seeks prosecution of an insider or external attacker
  • Regulatory requirements mandate reporting to law enforcement (certain HIPAA breaches, SEC-regulated entities)
  • The FBI has indicated they have intelligence relevant to your incident (this happens — they will often reach out to organisations being targeted by campaigns they are tracking)

When engaging law enforcement, bring legal counsel. Provide information in response to specific requests rather than proactively sharing the entire forensic record. Understand that your cooperation is voluntary unless a warrant or subpoena is issued.

External Communications

Every word that leaves the organisation during an active incident carries legal and reputational consequences. Communications management during an incident follows strict rules:

  • Single spokesperson — all external communications go through the Communications Lead only. Technical staff do not speak to media, regulators, or customers without explicit approval.
  • Legal review — every customer notification, press statement, and regulatory filing is reviewed by legal counsel before release.
  • Accuracy over speed — it is better to say "we are investigating and will provide an update by [date]" than to issue a statement with incorrect scope that must later be corrected. Corrected breach notifications are significantly more damaging to trust than late ones.
  • Employee communication — employees will talk. Brief them appropriately on what they can and cannot say. Reminding them of their confidentiality obligations in writing during an active incident creates a record and reduces inadvertent disclosure.

Cyber Insurance

Cyber insurance has become a critical component of enterprise risk management. In the context of IR, the insurance carrier is both a resource and a stakeholder with its own interests.

Most cyber insurance policies require immediate notification to the carrier when an incident occurs — often within 24–72 hours. Failing to notify within the required window can void coverage. Read your policy before you need it and add the notification requirement to your IR plan as an explicit action item.

Carriers typically provide access to pre-vetted IR firms, legal counsel, and crisis communications specialists as part of the policy. These services are often high quality and available faster than retaining firms independently during a crisis. However, note that the carrier's interests and the insured's interests are not always identical — the carrier wants to minimise payout; you want maximum coverage and recovery support. Have your own legal representation reviewing the engagement.

Key Takeaways — Chapter 13
  • Breach notification thresholds are regulatory and legal determinations — do not make them without legal counsel
  • Engaging external IR firms through legal counsel may bring investigation findings under attorney-client privilege
  • Law enforcement involvement is consequential and irreversible — decide deliberately and with counsel
  • All external communications must go through a single spokesperson, be legally reviewed, and prioritise accuracy over speed
  • Notify your cyber insurance carrier within the policy's required window — missed notification can void coverage
Chapter 14 ·  ~14 min read  ·  Cloud IR

IR in the Cloud

Shared responsibility, cloud-native logging, tooling differences, container and serverless IR, and cloud-specific attack patterns

The migration of workloads to cloud environments has fundamentally changed the incident response landscape. Many of the techniques described in earlier chapters — memory acquisition, disk imaging, network capture — either work differently in cloud environments or don't work at all. New attack patterns specific to cloud environments have emerged. And the shared responsibility model has created a category of incidents where the question "who is responsible for the forensics?" is genuinely unclear. This chapter addresses cloud IR for AWS, Azure, and GCP.

The Shared Responsibility Model

Every major cloud provider publishes a shared responsibility model that defines which security controls the provider manages and which the customer is responsible for. The specifics vary by service type:

For IaaS (EC2, Azure VMs, Compute Engine): the cloud provider secures the physical infrastructure, hypervisor, and network fabric. The customer is responsible for the operating system, application stack, data, and network access controls. IR above the hypervisor layer is entirely the customer's responsibility.

For PaaS (RDS, Lambda, App Service, Cloud Functions): the provider manages more of the stack. The customer's responsibility shrinks toward the application and data layers — but forensic access to the underlying infrastructure is correspondingly reduced or eliminated.

For SaaS (Microsoft 365, Salesforce, Workspace): the provider manages almost everything. Customer forensic capability is limited to what the provider exposes through APIs and audit logs — which varies dramatically by provider and tier.

Common Misconception

The shared responsibility model does not mean the cloud provider helps you respond to incidents affecting your cloud environment. AWS, Azure, and GCP will not provide forensic assistance, cooperate with your investigators, or share infrastructure-level data without a valid legal process. Your security is your responsibility. The providers offer security tools — GuardDuty, Defender for Cloud, Security Command Centre — but not security services.

Cloud-Native Logging

AWS

  • CloudTrail — logs all AWS API calls: who called what API, from where, with what parameters. The single most important log source for AWS IR. Must be enabled in all regions; management events and data events are separate configurations. Logs flow to S3 and optionally to CloudWatch Logs.
  • VPC Flow Logs — network traffic metadata at the VPC, subnet, or ENI level. Contains source/destination IP and port, protocol, and bytes transferred. Does not contain payload data.
  • GuardDuty — managed threat detection service that analyses CloudTrail, VPC Flow Logs, and DNS logs for threats. Generates findings that map to ATT&CK techniques. Strongly recommended as a baseline detection layer.
  • S3 Access Logs / S3 Server Access Logging — records requests made to S3 buckets. Essential for investigating data exfiltration via S3.
  • CloudWatch Logs — aggregation point for logs from EC2 instances, Lambda functions, RDS, and other services. Can be queried with CloudWatch Logs Insights.

Azure

  • Azure Activity Log — subscription-level operations: resource creation, deletion, modification. The Azure equivalent of CloudTrail for control plane operations.
  • Azure Monitor / Diagnostic Settings — data plane logs for individual resources. Must be configured per-resource; not enabled by default.
  • Microsoft Defender for Cloud — security posture management and threat detection. Generates alerts for common attack patterns across Azure workloads.
  • Azure AD Sign-in Logs and Audit Logs — authentication events and directory changes. Critical for identity-based IR.
  • Microsoft Sentinel — Azure-native SIEM. Natively ingests all Azure log sources plus Microsoft 365, Defender products, and third-party connectors.

Cloud-Specific Attack Patterns

SSRF → Metadata Service Abuse

Server-Side Request Forgery (SSRF) in a cloud-hosted application can be used to reach the instance metadata service — a local endpoint (169.254.169.254 on AWS) that provides temporary credentials for the IAM role attached to the instance. If an SSRF vulnerability exists, an attacker can retrieve those credentials and use them from anywhere on the internet to access AWS services with the instance's permissions. Detection: CloudTrail API calls originating from unusual source IPs using instance metadata credentials; unusual GetCallerIdentity calls.

IAM Privilege Escalation

Cloud IAM is complex, and misconfigurations are common. An attacker with limited IAM permissions may be able to escalate to full administrator access through techniques such as: creating a new IAM user with admin policy attachment, passing a privileged IAM role to a Lambda function they control, or exploiting iam:PassRole + ec2:RunInstances to launch an instance with a powerful role. Tools like Pacu (AWS) and Rhino Security's cloudsplaining enumerate these paths. Detection: CloudTrail events for IAM policy attachment, role assumption chains, and new administrator user creation.

S3 Data Exfiltration

S3 buckets containing sensitive data are a common target. Exfiltration patterns: bulk GetObject operations from unusual source IPs, presigned URL generation for large numbers of objects, cross-account replication configuration, S3 replication to an attacker-controlled bucket. Detection requires S3 access logging enabled — it is not on by default.

Container and Serverless IR

Containers introduce ephemeral workloads — a compromised container may be destroyed and replaced before forensic collection is possible. Key considerations:

  • Container runtime logs (Docker daemon logs, containerd logs) and Kubernetes audit logs are the primary forensic sources
  • Memory forensics of containers requires host-level access to the container runtime; standard memory acquisition tools do not work inside containers
  • Kubernetes RBAC misconfigurations are a significant attack surface — privilege escalation via service account token abuse, etcd exposure, and API server access control failures
  • For Kubernetes: kubectl get events, audit log review, and Falco (runtime security tool) are the core IR toolset

For serverless functions (Lambda, Azure Functions, Cloud Run), standard forensics do not apply — there is no persistent filesystem or memory to image. The forensic record is entirely in logs: CloudWatch Logs for Lambda output, VPC Flow Logs for network connections, and X-Ray for distributed tracing.

Key Takeaways — Chapter 14
  • The shared responsibility model does not mean the cloud provider assists with your IR — your cloud security is your responsibility
  • CloudTrail (AWS), Azure Activity Log, and GCP Audit Logs are the control-plane equivalents of on-premises authentication logs — they must be enabled in all regions
  • SSRF → metadata service credential theft is a cloud-specific attack pattern that produces CloudTrail anomalies detectable without payload inspection
  • Container workloads are ephemeral — forensic collection must happen before container termination; Kubernetes audit logs are the primary forensic source
  • Serverless forensics is entirely log-based — there are no filesystems or memory dumps to collect
Chapter 15 ·  ~11 min read  ·  Programme Maturity

Metrics, Maturity & Continuous Improvement

IR maturity models, KPIs, building a lessons-learned database, threat intel integration, and moving from reactive to proactive IR

An incident response programme is not a destination — it is a trajectory. The organisations that perform best during incidents are the ones that have been systematically building, measuring, and improving their IR capability over years. This final chapter addresses the question that separates operational IR from programme management: how do you know if you are getting better?

IR Maturity Models

Maturity models provide a structured framework for assessing where a programme currently sits and what improvements would advance it to the next level. Several models are in common use:

The Capability Maturity Model Integration (CMMI) applied to IR describes five maturity levels that roughly map as follows:

LevelNameCharacteristics
1InitialAd hoc, unpredictable. IR depends on individual heroics. No documented plan. Response is chaotic but may succeed through effort.
2ManagedDocumented IR plan exists. Basic escalation paths defined. Some tooling in place. Response is repeatable for common incident types.
3DefinedStandardised playbooks for all common incident types. Regular tabletop exercises. Metrics are tracked. Post-incident reviews consistently conducted.
4Quantitatively ManagedMetrics drive decisions. MTTD/MTTR targets are set and measured. Detection coverage is mapped to ATT&CK. Trends are identified and addressed proactively.
5OptimisingContinuous improvement is embedded in the programme. Threat intelligence drives proactive detection. Purple team exercises regularly test and improve capability. IR feeds into product and architecture decisions.

Most enterprise organisations operate at Level 2–3. The gap between Level 3 and Level 4 — moving from documented to data-driven — is where measurement and accountability become critical. The gap between Level 4 and Level 5 — moving from reactive to genuinely proactive — is where threat intelligence integration and continuous detection engineering deliver their full value.

Building a Lessons-Learned Database

Individual lessons-learned meetings are valuable. A structured database of findings across all incidents over time is transformative. When each incident's root causes, techniques observed, detection gaps, and recommendations are recorded in a searchable format, patterns emerge that would be invisible from any single incident report:

  • Which initial access vectors recur most frequently? (Informs preventive control investment)
  • Which ATT&CK techniques go undetected most often? (Informs detection engineering priorities)
  • Which recommendations are implemented versus stalled? (Reveals organisational friction points)
  • Which incident types have the highest MTTD? (Informs sensor coverage gaps)

Even a simple shared spreadsheet tracking incident type, ATT&CK techniques, detection method, time to detect, time to contain, root cause, and recommendation status — maintained consistently over 12–18 months — generates insight that no security tool produces.

Threat Intelligence Integration

A mature IR programme integrates threat intelligence proactively, not just reactively (using IOCs during active investigations). Proactive intelligence integration means:

  • Sector-specific threat intelligence — subscribing to ISACs (Information Sharing and Analysis Centers) relevant to your industry (FS-ISAC for finance, H-ISAC for healthcare, etc.) to receive early warning of campaigns targeting your sector
  • Threat intelligence platform (TIP) — aggregating and operationalising feeds from commercial providers (Recorded Future, Mandiant Advantage, CrowdStrike Falcon Intelligence), government sources (CISA, NCSC), and open-source feeds (AlienVault OTX, MISP communities)
  • Threat hunting — using threat intelligence to generate hypotheses about adversary techniques that may be present in your environment, and proactively hunting for evidence of them rather than waiting for an alert
  • Adversary profiling — maintaining awareness of the threat actors most likely to target your organisation based on sector, geography, and profile, and mapping their TTPs against your detection coverage

Moving from Reactive to Proactive IR

The reactive IR posture — waiting for alerts, responding to confirmed incidents, closing tickets — is the baseline that every organisation must meet. The proactive posture is what separates organisations that discover intrusions in days from those that discover them after months of dwell time.

The key investments that shift an IR programme from reactive to proactive:

  • Threat hunting programme — dedicated hunt cycles (typically 2-week cadences) where analysts proactively search for adversary TTPs in log data without relying on automated alerts. Hunt hypotheses are drawn from threat intelligence and past incident findings.
  • Detection engineering function — a dedicated role (or team) continuously building, testing, and improving detection content. New techniques observed in incidents become detection rules within days, not quarters.
  • Red team programme — internal or contracted red team operations that test realistic attack scenarios against your real environment and real detection capability. Purple team exercises extend this by involving blue team analysts in real-time.
  • Intelligence-driven IR planning — rather than having generic playbooks, developing threat-actor-specific response plans for the adversaries most likely to target your organisation.
The Maturity Gap in Practice

A Level 2 organisation discovers a compromise when a user reports strange behaviour. The dwell time is 47 days. A Level 4 organisation discovers the same attacker technique through a threat hunting operation triggered by new threat intelligence — 8 hours after the initial access. The technique used was the same. The difference was entirely in programme maturity, not luck or budget.

The IR Programme Review

Beyond individual incident reviews, a mature programme conducts a quarterly programme review covering: metric trends (MTTD, MTTR, false positive rate), recommendations backlog status, detection coverage gaps, upcoming threats relevant to the sector, and resourcing and tooling gaps. This review is the mechanism by which IR programme investment is justified, prioritised, and tracked.

Key Takeaways — Chapter 15
  • Most enterprise organisations operate at CMMI Level 2–3; the move to Level 4 requires data-driven decision making around MTTD and MTTR
  • A cumulative lessons-learned database reveals programme-level patterns invisible from individual incident reports
  • Threat intelligence integration is proactive when it drives hunt hypotheses and detection engineering — not just reactive IOC blocking
  • The shift from reactive to proactive IR requires dedicated threat hunting, detection engineering, and regular adversarial testing
  • Quarterly programme reviews are the governance mechanism that sustains IR programme investment and improvement over time