Lucent Grid Learning  ·  Security Operations

Security
Operations

The complete SOC practitioner's guide — from architecture and data collection through detection engineering, threat intelligence, SIEM query writing, EDR, identity monitoring, cloud coverage, SOAR automation, and the future of the SOC. Fifteen chapters spanning the full discipline.

15 chapters
~3.5 hrs reading
MITRE ATT&CK aligned
Platform-agnostic approach
📍
Continue where you left off
Chapter 01 · ~12 min · Foundations

What Is a Security Operations Centre?

The SOC defined, its three core missions, evolution from NOC, SOC models, build vs buy, and why the SOC exists from an attacker's perspective

The Security Operations Centre is one of the most misunderstood concepts in enterprise security. Popular imagination — and an unfortunate number of vendor marketing decks — picture it as a darkened room lined with enormous screens displaying scrolling threat maps and live attack visualisations. The reality is simultaneously less cinematic and more consequential: a SOC is a function, not a facility. It is the organisational capability through which a company monitors its digital environment, detects threats, and coordinates response. It can operate from a single analyst's desk, a shared service with an MSSP, or a purpose-built operations floor. What defines it is not the architecture but the mission.

Definition

A Security Operations Centre (SOC) is a team — supported by processes and technology — that continuously monitors an organisation's security posture, detects threats, investigates alerts, and coordinates incident response. Its output is not alerts: it is decisions.

The Three Core SOC Missions

Every SOC activity maps to one of three missions. Keeping this framework in mind prevents the common failure mode of optimising a SOC for one mission at the expense of the others.

  1. Monitor — maintain continuous visibility across the environment. Collect logs, aggregate telemetry, ensure that the data required to detect attacks is flowing, complete, and queryable. Monitoring is the foundation. A SOC that cannot see an attacker cannot detect or respond to one.
  2. Detect — identify adversarial activity within the monitored data. Detection is the analytical core of the SOC — it requires detection content (rules, queries, ML models) and human judgment (analysts who can evaluate what the content surfaces). Good detection means finding real threats quickly; great detection means finding them with acceptable false positive rates.
  3. Respond — take action when a threat is confirmed. Response includes containment, escalation to IR, communication to stakeholders, and documentation. The SOC's response actions vary from minor (blocking an IP) to major (activating the full IR programme for a confirmed breach).

SOC vs CSIRT

The SOC and CSIRT (Computer Security Incident Response Team) are complementary but distinct. The SOC operates continuously — it is always running. It handles the full spectrum of security events, most of which will never become incidents. The CSIRT activates when a declared incident exceeds the SOC's routine response capability. In organisations with separate teams, the SOC detects and performs initial triage; the CSIRT takes ownership of declared incidents. In smaller organisations, the same people do both — wearing the SOC hat day-to-day and the CSIRT hat when a major incident occurs.

SOC Models

ModelDescriptionBest Suited ForKey Risk
In-houseFully staffed internal team, proprietary toolsLarge enterprises with mature security budgets and compliance requirements demanding data sovereigntyHigh cost, staffing difficulty, 24/7 coverage challenging
Co-managedInternal team augmented by MSSP for coverage or specialist skillsMid-market organisations — internal team owns the environment, MSSP provides 24/7 monitoring and overflowKnowledge gap at handoff; MSSP may lack environment context
Fully managed (MSSP)External provider operates the SOC entirelyOrganisations without security headcount or budget for in-house capabilityLimited visibility into methodology; may receive lowest-priority analyst attention
Virtual / HybridDistributed team across time zones, may include remote-only analystsGlobal organisations, cloud-native companiesCommunication and coordination overhead; culture challenges

Build vs Buy Decision Framework

The build vs buy decision for a SOC is not binary — most organisations land somewhere on a spectrum. The key variables:

  • Data residency and compliance — can your logs leave your jurisdiction? Some regulated environments prohibit sending raw log data to third-party managed services. This may force in-house collection regardless of preference.
  • Institutional knowledge requirement — how much does effective detection and response in your environment depend on knowing your specific architecture, custom applications, and normal behaviour patterns? The more idiosyncratic your environment, the more value an internal team that knows it provides.
  • Staffing reality — the global shortage of experienced SOC analysts means that hiring and retaining a high-quality internal team is genuinely hard. MSSPs can sometimes provide access to talent that the organisation could not hire directly.
  • 24/7 coverage cost — three shifts of analysts with appropriate skill levels costs significantly more than the headline headcount suggests when you factor in management, training, tooling, and attrition.

The Attacker's Perspective on the SOC

Understanding why the SOC matters requires understanding what attackers do to avoid it. Sophisticated threat actors spend significant effort on SOC evasion: staying below alert thresholds, mimicking legitimate administrative activity, living off the land to avoid introducing new binaries, and operating during business hours to blend with normal traffic patterns. Nation-state actors conduct reconnaissance on target organisations' security tools and detection capabilities before beginning an operation.

This adversarial reality has two implications. First, the quality of SOC detection content directly determines which threats are caught and which operate undetected. Second, an attacker who has studied your SOC — your SIEM platform, your detection rules, your response times — has a significant advantage. This is why detection content should be treated as sensitive, why SIEM query logic should not be published externally, and why threat hunting matters: hunting uses methodologies that do not depend on pre-written rules that a sophisticated attacker might have anticipated.

Key Takeaways — Chapter 1
  • The SOC is a function, not a facility — defined by its missions (Monitor, Detect, Respond), not its physical infrastructure
  • SOC and CSIRT are complementary — the SOC runs continuously; the CSIRT activates for declared incidents above routine response capability
  • Model selection depends on data residency, institutional knowledge needs, staffing reality, and 24/7 coverage economics
  • Sophisticated attackers study and attempt to evade SOC detection — detection content quality is a direct security control
Chapter 02 · ~15 min · Architecture

SOC Architecture & Technology Stack

Core technology layers, SIEM architecture, cloud-native vs on-premises, common platform combinations, and the analyst console

A SOC's technology stack is the infrastructure through which its three missions are executed. The temptation — aggressively marketed by security vendors — is to evaluate each tool in isolation: which SIEM has the best dashboards, which EDR has the highest detection rate in independent tests. The more useful frame is architectural: how do the tools in your stack connect to each other, where does data flow, what are the handoff points between human and machine decision-making, and which gaps in visibility remain after the stack is fully deployed?

The Core Technology Layers

SOC Technology Stack — Data Flow
Data Sources
Endpoints · Network · Cloud · Identity
SIEM / Data Lake
Ingest · Parse · Correlate · Store
Detection Layer
Rules · ML · Threat Intel
Analyst Console
Triage · Investigate · Respond
EDR
Endpoint telemetry + response
TIP
Threat intel enrichment
SOAR
Playbook automation
Case Mgmt
Ticketing · Documentation

Log Collection Infrastructure

Before any analysis is possible, log data must flow from sources to the SIEM. This is achieved through a collection infrastructure of agents and forwarders. Common patterns:

  • Agent-based collection — a lightweight agent runs on each endpoint and forwards events in near-real-time. Examples: Splunk Universal Forwarder, Elastic Agent, Microsoft Monitor Agent (MMA). Provides the richest data with lowest latency but requires agent deployment and management across the fleet.
  • Syslog forwarding — network devices, firewalls, and Linux systems send logs via syslog (UDP 514 or TCP with TLS) to a central syslog aggregator or directly to the SIEM. Simple to configure but UDP syslog has no delivery guarantee.
  • API-based collection — cloud services (Microsoft 365, AWS CloudTrail, Okta) expose logs through APIs. The SIEM or a collection layer polls the API on a schedule. Latency depends on polling interval (typically 5–15 minutes).
  • Kafka / message bus — high-volume environments use a message bus between log sources and the SIEM for buffering and fan-out. Provides resilience against SIEM downtime and allows multiple consumers of the same log stream.

SIEM Architecture

The SIEM (Security Information and Event Management) platform performs four functions: collection, normalisation, correlation, and presentation. Understanding each is essential for deploying and tuning a SIEM effectively.

  • Collection — receiving log data from all sources via agents, syslog, or API
  • Normalisation / parsing — extracting structured fields from raw log text. A Windows event log arriving as XML must be parsed into discrete fields (EventID, Account, Source IP) before it can be queried or correlated. This is done via parsing rules or schemas (Common Information Model in Splunk, ECS in Elastic).
  • Correlation — identifying relationships between events across time and sources. A failed login on a VPN followed by a successful login 30 seconds later from the same IP against a different account is a correlation finding that neither event would trigger alone.
  • Presentation — dashboards, alerts, search interface. The analyst-facing surface through which investigation and triage occur.

SIEM vs Data Lake

A newer architectural pattern separates the storage layer (a data lake — cheap, scalable object storage like S3 or Azure Data Lake) from the query layer (a SIEM or SIEM-like interface that queries the lake). This approach allows retaining larger volumes of data at lower cost, reserving expensive SIEM indexing for high-priority data sources while keeping everything else queryable on demand.

Microsoft Sentinel uses this model — logs are stored in Azure Log Analytics (backed by Azure Data Explorer) with tiered pricing based on query recency. Splunk's SmartStore architecture uses S3 as a remote storage tier. This pattern is increasingly common for organisations that need long retention periods without the cost of full SIEM indexing for all data.

Common Platform Combinations

StackSIEMEDRCase MgmtCharacteristic
EnterpriseSplunk Enterprise SecurityCrowdStrike FalconServiceNow / JiraHighest capability, highest cost; dominant in large enterprises
Microsoft-nativeMicrosoft SentinelDefender for EndpointSentinel IncidentsDeep M365/Azure integration; strong value for Microsoft-heavy shops
Open-sourceElastic SIEMElastic Agent (Defend)TheHiveLowest licence cost; highest engineering overhead
Mid-market managedExabeam / LogRhythmSentinelOneBuilt-inMore turnkey than Splunk; strong UEBA in Exabeam

The Analyst Console — What Analysts Actually Do

A Tier-1 SOC analyst's working day is dominated by the alert queue — a prioritised list of security alerts requiring triage. In a mature SOC, this queue is surfaced through a case management or SOAR interface that presents each alert with pre-populated enrichment data: IOC reputation, asset criticality, user risk score, similar recent alerts. The analyst's task is to evaluate this context, determine whether the alert represents a genuine threat, and document their decision with reasoning.

In immature SOCs, the analyst opens raw SIEM alerts with no enrichment, manually looks up each IOC, and fills in a bare-minimum ticket. The difference in analyst throughput and decision quality between these two scenarios is enormous — and closing that gap is one of the primary returns on investment from SOAR and TIP deployment.

Key Takeaways — Chapter 2
  • Evaluate the technology stack architecturally — how tools connect and data flows matters more than individual tool benchmarks
  • Agent-based collection provides richer, lower-latency data; syslog is simpler but lossy; API collection introduces latency
  • SIEM functions: collection → normalisation → correlation → presentation — each layer requires deliberate configuration
  • Data lake architecture separates cheap storage from expensive indexing — enables long retention at manageable cost
  • Pre-populated alert enrichment is one of the most impactful improvements to analyst throughput — the gap between raw alerts and enriched cases is enormous
Chapter 03 · ~16 min · Data Sources

Log Collection & Data Sources

What to collect, Windows audit policy, Sysmon, Linux auditd, DNS and proxy logs, cloud sources, identity logs, and retention strategy

The most common SIEM failure mode is not a technology problem — it is a data problem. Organisations invest in SIEM platforms, pay for licences, deploy analysts, and then discover that when an incident occurs, the relevant logs were never collected, were collected with insufficient fidelity, or were collected but expired before the investigation needed them. Effective log collection is unglamorous operational work that pays off dramatically when it matters. This chapter covers what to collect, how to configure it, and how to retain it intelligently.

Windows Security Event Log — Audit Policy

Windows generates security-relevant events through its audit subsystem, controlled by Group Policy. The default audit policy is insufficient for SOC use — most high-value events require explicit enablement. Configure via: Computer Configuration → Windows Settings → Security Settings → Advanced Audit Policy Configuration.

Audit CategorySettingWhy It Matters
Account LogonSuccess + FailureGenerates 4624 (success), 4625 (failure), 4768/4769 (Kerberos) — authentication baseline
Process CreationSuccessGenerates 4688 — execution evidence; must enable command-line logging separately via GPO
Object Access — File SystemSelectiveGenerates 4663 on file access — high volume; enable only on sensitive directories
Privilege UseSuccessGenerates 4672 (special privileges) — privileged logon tracking
Account ManagementSuccess + FailureGenerates 4720 (account created), 4728/4732 (group membership changes)
Policy ChangeSuccessGenerates 4719 (audit policy changed) — detects attacker disabling logging
DS AccessSuccessGenerates 4662 on AD object access — DCSync detection requires this on DCs
Logon/LogoffSuccess + FailureGenerates 4634 (logoff), 4647 (interactive logoff), 4648 (explicit credentials)
Critical Setting

Process creation logging (Event ID 4688) without command-line logging enabled is significantly less useful. The event records that a process ran but not what arguments it received. Enable command-line inclusion via: Computer Configuration → Administrative Templates → System → Audit Process Creation → Include command line in process creation events → Enabled. This is where PowerShell payload visibility comes from.

Sysmon Deployment

Sysmon (System Monitor) is a Sysinternals tool that generates detailed Windows event log entries for: process creation (with hash, parent, and command line), network connections, file creation, registry modifications, driver loading, and more. It is the single most impactful detection investment for Windows endpoint visibility after basic audit policy configuration.

Sysmon Key Event IDs

Event ID 1 — Process Create: hash, parent process, command line, current directory, user

Event ID 3 — Network Connection: source/dest IP and port, process, protocol

Event ID 7 — Image Loaded: DLL loading — detects DLL injection, side-loading

Event ID 8 — CreateRemoteThread: process injection detection

Event ID 10 — ProcessAccess: LSASS access (Mimikatz), handles to other processes

Event ID 11 — FileCreate: new files created, timestomping detection

Event ID 12/13 — Registry: key creation, value modification — persistence detection

Event ID 22 — DNSEvent: DNS queries from each process — C2 beacon detection

Deploy with a community ruleset to avoid generating excessive noise from legitimate activity. SwiftOnSecurity's sysmon-config is the most widely used starting point; olafhartong's sysmon-modular provides a more granular modular approach that allows per-module tuning.

Sysmon Deployment Commands
sysmon64.exe -accepteula -i sysmonconfig.xml
sysmon64.exe -c sysmonconfig.xml   # update config on running install
sysmon64.exe -u                    # uninstall

Deploy via GPO using a startup script or software deployment tool (SCCM, Intune). The configuration XML should be version-controlled and treated as detection content.

Linux Log Collection

Linux log collection for SOC purposes requires two components: a forwarder to ship logs to the SIEM, and appropriate logging configuration to generate useful events.

  • auditd — the Linux audit subsystem. When configured, generates events for syscalls, file access, command execution, and privilege changes. Configuration via /etc/audit/audit.rules. High-value rules: monitor writes to /etc/passwd and /etc/shadow, monitor execve syscalls, monitor SSH authorised_keys files.
  • rsyslog / syslog-ng — forwards syslog output from auth.log, kern.log, and application logs to a central aggregator or directly to the SIEM.
  • Elastic Agent / Filebeat / Splunk UF — agent-based collection that handles both syslog and auditd, providing richer field extraction and lower-latency delivery than syslog forwarding.

DNS Query Logging

DNS query logs — every domain name queried by every host — are among the highest-value and most underutilised log sources in enterprise environments. A host resolving a newly-registered, high-entropy domain is a C2 beacon. A host making 500 DNS queries per minute to the same domain is tunnelling data. A host resolving a domain associated with a known threat actor is a confirmed IOC hit.

Collection options: Windows DNS Server debug logging, Sysmon Event ID 22 for per-process DNS queries on endpoints, network-level passive DNS capture (Zeek, Suricata), or DNS resolver query logs from Pi-hole or Infoblox deployments. Route all collection into the SIEM and ensure enough retention (90+ days) to support investigations with long dwell times.

Cloud Log Sources

PlatformLog SourceIngestion MethodCoverage
AWSCloudTrailS3 bucket → Lambda or SIEM connectorAll API calls — control plane activity
AWSVPC Flow LogsCloudWatch Logs or S3Network traffic metadata
AWSGuardDutySecurity Hub → SIEMPre-correlated threat findings
AzureActivity Log + Diagnostic SettingsEvent Hub → SIEMControl plane + per-resource
Azure ADSign-in + Audit LogsSentinel native / Event HubAuthentication + directory changes
M365Unified Audit LogMicrosoft Graph API / Sentinel connectorUser and admin activity across M365 services
GCPCloud Audit LogsPub/Sub → SIEMAdmin, data access, system events

Identity Provider Logs

Authentication is the highest-value single log source for detecting account compromise. Every identity provider in your environment must feed the SIEM: Active Directory domain controllers (Security event log), Azure AD (Sign-in logs), Okta (System Log), Ping, and any other SSO platform. Authentication anomalies — failed MFA challenges, logins from new countries, impossible travel, password spray patterns — are only detectable with complete authentication log coverage.

Log Retention Strategy

Retain everything forever sounds attractive and is practically impossible. A tiered retention strategy balances cost against investigative need:

  • Hot tier (SIEM indexed, 0–30 days) — all high-priority sources at full fidelity. Instantly queryable. Most expensive per GB. Essential for current incident investigation.
  • Warm tier (searchable archive, 31–90 days) — queryable on demand but with higher latency. Lower cost per GB. Needed for incidents with moderate dwell times.
  • Cold tier (cheap object storage, 91 days – 1 year+) — compressed, retrievable on demand but requiring hours-to-days to restore to searchable state. Lowest cost. Needed for compliance and long-dwell-time incidents.

Minimum recommended retention for SOC use: 90 days hot/warm for authentication and endpoint logs, 1 year cold for all security-relevant logs. Many breaches have dwell times exceeding 90 days — short retention windows guarantee incomplete investigations.

Key Takeaways — Chapter 3
  • Windows process creation logging (4688) requires separate GPO to include command-line arguments — without this it is far less useful
  • Sysmon dramatically expands endpoint visibility — deploy with a community config (SwiftOnSecurity or olafhartong) to avoid noise
  • DNS query logs are high-value and frequently missing — C2 beaconing, tunnelling, and DGA detection all depend on them
  • Cloud logs (CloudTrail, Azure Activity Log, M365 UAL) must all reach the SIEM — and they are not enabled comprehensively by default
  • Tiered retention (hot/warm/cold) balances cost against investigative reach — 90 days minimum queryable for high-priority sources
Chapter 04 · ~18 min · Detection

Detection Engineering

The detection lifecycle, Sigma rules, ATT&CK coverage mapping, false positive management, and detection as code

Detection engineering is the discipline of building, testing, maintaining, and improving the detection content that the SOC relies on. It is distinct from analysis — a detection engineer writes the rules that analysts work from, and the quality of that content determines the ceiling of what the SOC can find. Detection engineering is where security research, data engineering, and threat intelligence intersect.

Definition

Detection engineering is the practice of systematically translating knowledge of adversary tactics and techniques into detection logic — rules, queries, models — that surfaces relevant activity within the SOC's monitoring infrastructure, at an acceptable false positive rate, with documented coverage and known limitations.

The Detection Lifecycle

  1. Hypothesis — what adversary behaviour are we trying to detect? Framed in terms of an ATT&CK technique, a known threat actor TTP, or an observed incident technique. "Detect T1053.005 — Scheduled Task creation via schtasks.exe from a non-administrative parent process."
  2. Data source identification — what log source contains evidence of this behaviour? Is it currently collected? Is it at sufficient fidelity? If not, a data gap must be closed before the detection can be built.
  3. Rule development — write the detection logic in Sigma (vendor-neutral), then transpile to the SIEM's native language. Test against known-good and known-bad samples.
  4. Testing — validate against historical data (does it fire on known incidents in your environment?), against synthetic data (fire the technique in a test environment and confirm detection), and for false positives (run against a baseline period of production data to measure noise).
  5. Deployment — push the rule to production via the deployment pipeline. Assign severity, category, and response guidance metadata.
  6. Tuning — after deployment, monitor false positive rate. Tune exclusions for legitimate activity that matches the rule. Re-test after tuning to verify coverage is maintained.
  7. Maintenance — rules drift out of relevance as the environment changes. Detection content requires periodic review — rules that generate zero alerts for six months either have a data source problem, a rule logic problem, or the technique is simply not observed in your environment.

Detection Types

  • Signature-based — exact string or pattern matching. Fast, precise, high confidence when the signature is reliable. Does not detect novel variants. Best for known-bad IOCs (file hashes, C2 domains) and known attacker tools with stable signatures.
  • Threshold-based — fires when a count exceeds a baseline (more than 10 failed authentications in 5 minutes). Effective for detecting brute force, data exfiltration by volume, and scanning. Requires careful baseline calibration per environment.
  • Anomaly-based / ML — detects deviation from learned baseline behaviour. High potential for novel threat detection. High false positive potential during baselining. Most effective for user/entity behaviour analytics (UEBA) — detecting accounts behaving unusually relative to their own baseline.
  • Correlation rules — fire when multiple lower-confidence events occur in combination. "Failed authentication followed by successful authentication from the same IP within 60 seconds" — neither event alone warrants an alert; together they are a credential success indicator.

Sigma Rules — The Vendor-Neutral Detection Format

Sigma is an open standard for writing detection rules in a platform-agnostic YAML format, which can then be converted (transpiled) to Splunk SPL, Microsoft Sentinel KQL, Elastic ESQL, and dozens of other query languages. This allows detection content to be written once, version-controlled, shared across the community, and deployed to any supported SIEM.

Sigma Scheduled task created via schtasks — T1053.005
# Detect schtasks.exe creating a new scheduled task # from a suspicious parent process title: Scheduled Task Creation via Schtasks id: a7c3d890-4c1e-4b2f-9d8a-3e5f6c7b2a1d status: stable description: Detects scheduled task creation from non-admin parent processes references: - https://attack.mitre.org/techniques/T1053/005/ author: Lucent Grid date: 2024-01-15 tags: - attack.persistence - attack.t1053.005 logsource: category: process_creation product: windows detection: selection: Image|endswith: '\schtasks.exe' CommandLine|contains: - '/create' - '-create' filter_admin: ParentImage|contains: - '\taskeng.exe' - '\taskhostw.exe' - '\services.exe' condition: selection and not filter_admin falsepositives: - Legitimate software installers creating scheduled tasks - Administrative scripts level: medium fields: - CommandLine - ParentImage - User

The Sigma community maintains thousands of rules at github.com/SigmaHQ/sigma. These cover the full ATT&CK matrix and are a starting point for any detection engineering programme — not as rules to deploy wholesale (they require tuning for your environment) but as research assets that document what data source, what logic, and what conditions are needed to detect each technique.

ATT&CK Coverage Mapping

One of the most valuable detection engineering activities is mapping your current detection content against the MITRE ATT&CK matrix to identify coverage gaps. For each ATT&CK technique relevant to threats your organisation faces:

  • Do you have a data source that could detect this technique?
  • Do you have a rule that queries that data source for this technique?
  • Has that rule been validated against a known sample or red team exercise?

Tools like MITRE ATT&CK Navigator allow you to colour-code the matrix by coverage status — producing a visual representation of what you can and cannot detect. This becomes the input to the detection engineering backlog: techniques in your threat model with no coverage are the highest-priority items to build.

Detection as Code

Treating detection content as software — version-controlled, peer-reviewed, tested before deployment — dramatically improves detection quality and operational reliability. The core practices:

  • Git for all detection content — every Sigma rule in a repository with commit history, pull request reviews, and branch protection on the main branch
  • Automated testing pipeline — CI/CD that transpiles Sigma to native SIEM language, runs syntax validation, and optionally tests against sample log data before merging
  • Deployment automation — rules pushed to the SIEM via API on merge to main, eliminating manual upload steps
  • Alert metadata management — severity, category, ATT&CK tags, and analyst response guidance stored alongside the rule and deployed with it
Key Takeaways — Chapter 4
  • Detection engineering is a distinct discipline from analysis — it produces the rules analysts work from and determines the ceiling of what the SOC can find
  • The detection lifecycle: hypothesis → data source → rule → test → deploy → tune → maintain — skipping testing and tuning produces noisy, unmaintained content
  • Sigma provides vendor-neutral detection authoring — write once, transpile to any supported SIEM platform
  • ATT&CK coverage mapping reveals detection gaps — techniques in your threat model with no coverage are the highest-priority build items
  • Detection as code (Git + CI/CD + API deployment) brings software engineering discipline to detection content quality
Chapter 05 · ~14 min · Threat Intel

Threat Intelligence Operations

The intelligence cycle, strategic vs tactical intelligence, sources, TIPs, IOC enrichment, actor profiling, and TLP classification

Threat intelligence is the most over-marketed and under-practised concept in security operations. Vendors sell "threat intelligence" to mean IOC feeds — lists of known-bad IP addresses and domains. That is the lowest form of intelligence: reactive, ephemeral, and of declining value as attackers rotate infrastructure. Real threat intelligence is the product of an analytical process that produces actionable understanding of adversary intent, capability, and infrastructure — and that understanding informs decisions at every level of the SOC, from which detections to build to how the board should think about risk.

The Intelligence Cycle

  1. Direction — defining the intelligence requirements. What questions do we need answered? What decisions will intelligence support? Who are our likely adversaries? Direction prevents the team from drowning in data while missing the intelligence their stakeholders actually need.
  2. Collection — gathering raw data from sources. Includes commercial feeds, OSINT, information sharing communities, internal incident data, and dark web monitoring.
  3. Processing — normalising, de-duplicating, and structuring collected data for analysis. Raw IOC feeds require deduplication, format standardisation, and source confidence weighting before they are analytically useful.
  4. Analysis — converting processed data into intelligence. This is where human analytical judgment is applied — inferring adversary intent from observed behaviour, assessing attribution confidence, estimating likely future activity.
  5. Dissemination — delivering intelligence to consumers in the right format, at the right level of detail, through the right channel. An IOC goes directly to the SIEM. A threat actor profile goes to the detection engineering team. A strategic risk assessment goes to the CISO.
  6. Feedback — consumers tell the intelligence team whether the intelligence was useful, actionable, and timely. This drives direction for the next cycle.

Intelligence Types

TypeContentConsumerTimescale
StrategicHigh-level assessment of threat landscape, adversary motivations, sector targeting trends, geopolitical contextCISO, Board, Risk committeeWeeks to months
OperationalSpecific campaigns, actor TTPs, campaign infrastructure, targeted attack detailsSOC leadership, IR team, Detection engineersDays to weeks
Tactical (IOC-level)Specific indicators: IPs, domains, hashes, email headers, YARA rulesSIEM, EDR, firewall, email gateway, TIPHours to days (expires rapidly)

Intelligence Sources

  • ISACs (Information Sharing and Analysis Centers) — sector-specific threat intelligence sharing communities. FS-ISAC for financial services, H-ISAC for healthcare, E-ISAC for energy. Members share threat data under agreed handling rules. High-quality, sector-contextualised intelligence that is often more relevant than generic commercial feeds.
  • Commercial feeds — Recorded Future, Mandiant Advantage, CrowdStrike Falcon Intelligence, Intel 471. Provide deep actor profiling, infrastructure tracking, and dark web visibility. Premium cost; premium coverage.
  • Government sources — CISA (US), NCSC (UK), ENISA (EU) publish advisories, IOC sets, and joint cybersecurity advisories for significant campaigns. Free, authoritative, relevant to critical infrastructure sectors.
  • Open-source feeds — AlienVault OTX (community pulses), abuse.ch (MalwareBazaar, Feodo Tracker, URLhaus), Emerging Threats, Spamhaus. High volume, variable quality, require automated processing and confidence weighting.
  • Internal data — your own incident history, threat hunting findings, and malware analysis outputs are intelligence. An indicator from your own environment carries the highest confidence of any source and should feed back into your detection stack.

Threat Intelligence Platforms (TIPs)

A TIP aggregates intelligence from multiple sources, deduplicates and normalises it, tracks indicator lifecycle (new → active → expired → revoked), and provides APIs for operationalisation into downstream controls. Key platforms: MISP (open-source, widely used in ISACs and government), OpenCTI (open-source, strong STIX 2.1 support), Anomali ThreatStream, ThreatQ. The SIEM connects to the TIP and automatically matches events against current IOC sets — providing enrichment and alerting without manual lookup.

Actor Profiling and TTP-Led Detection

Understanding which threat actors are most likely to target your organisation — based on sector, geography, revenue, and data held — allows the SOC to build targeted detections for those actors' known TTPs rather than generic coverage. A UK financial institution building detections specifically for FIN7 and Lazarus Group's known techniques is more focused and more effective than an organisation building generic coverage for all 200 ATT&CK techniques without prioritisation.

Actor profiles should include: known initial access vectors, preferred persistence mechanisms, C2 protocols and infrastructure characteristics, target selection criteria, and historical campaign timing patterns. This context guides detection engineering priority, hunting hypotheses, and IR playbook development.

TLP — Traffic Light Protocol

TLP is a standardised classification scheme for controlling how threat intelligence is shared. It has four levels:

ColourDistributionWhen to Use
TLP:REDNamed recipients only — not shareableInformation specific to named individuals or organisations; sharing would risk the source or cause harm
TLP:AMBEROrganisation and clients with need-to-knowSensitive information that may damage the source or ongoing operations if shared broadly
TLP:GREENCommunity-wide (not public)Information useful to the security community but not intended for public release
TLP:WHITE / CLEARUnrestrictedPublic information — may be shared freely
Key Takeaways — Chapter 5
  • IOC feeds are the lowest form of threat intelligence — tactical, ephemeral, and decreasing in value as attackers rotate infrastructure
  • The intelligence cycle (Direction → Collection → Processing → Analysis → Dissemination → Feedback) transforms data into actionable understanding
  • Strategic intelligence informs risk decisions; operational intelligence drives detection engineering; tactical IOCs feed automated controls
  • Actor-specific detection — building detections for the TTPs of your sector's most likely adversaries — is more focused and effective than generic ATT&CK coverage
  • TLP classification governs sharing rights — respecting TLP markings is a prerequisite for participation in intelligence sharing communities
Chapter 06 · ~15 min · Triage

Alert Triage & Investigation

The alert queue, classification types, triage methodology, investigation workflow, pivoting, ATT&CK mapping, and documentation standards

Alert triage is the core daily work of a SOC analyst. Done well, it surfaces real threats quickly and disposes of noise efficiently. Done poorly, it produces a mounting queue of unreviewed alerts, analyst fatigue, and the inevitable consequence: real incidents buried under false positives. Triage is not just a technical skill — it is an analytical discipline that requires systematic methodology, pattern recognition built through repetition, and the composure to make decisions under uncertainty.

Alert Classification

Every alert must be classified before it can be acted on:

ClassificationDefinitionAction
True Positive (TP)Alert fired correctly on genuine malicious or policy-violating activityInvestigate fully; escalate if confirmed incident
False Positive (FP)Alert fired on legitimate activity that matched the ruleClose; consider tuning if recurring
True Negative (TN)No alert on legitimate activity (no alert fired, correctly)N/A — desired state
False Negative (FN)No alert on genuine malicious activity (missed detection)Ideally discovered through hunting; drives detection improvement

The false positive rate is the most operationally significant metric in daily SOC work. An alert that is a false positive 95% of the time trains analysts to dismiss it reflexively — eventually including the 5% of the time when it fires on a real threat. Tuning alerts below a 10–20% false positive rate per rule is a meaningful goal for mature SOCs.

The Five-Minute Triage

For each alert, a Tier-1 analyst should be able to reach a disposition within approximately five minutes using a consistent mental framework. The questions, in order:

  1. What fired? — understand the rule that generated the alert without immediately reading into the specific data. What technique or behaviour is this rule designed to detect?
  2. Who and what is involved? — identify the account, the endpoint, and the network context. Is this a critical asset? A privileged account? A known-bad external IP?
  3. Is this normal for this entity? — compare to baseline. Has this account, process, or source IP done this before? Is the timing consistent with normal working patterns?
  4. What context is available? — check enrichment: IOC reputation, recent alerts on the same entity, open incidents, asset classification.
  5. Is there a corroborating signal? — does any other alert, log event, or IOC match in the same time window support a malicious interpretation?

If after this process the analyst cannot determine a disposition, the correct action is to escalate — not to close tentatively and move on. Uncertainty is escalation-worthy.

Investigation Workflow — Pivoting

When an alert warrants deeper investigation beyond initial triage, the analyst moves into an investigation workflow based on pivoting — using each finding to surface the next relevant data point.

  • IP address → domains — reverse DNS, passive DNS history, certificate transparency — what other infrastructure is hosted at this IP or associated with the same registration?
  • Domain → IP history — what IPs has this domain resolved to over time? Are other endpoints in your environment connecting to any of them?
  • File hash → behaviour — VirusTotal dynamic analysis, sandbox execution, YARA matches — what does this file do?
  • Process → parent process — legitimate processes have expected parent-child relationships. An unusual parent indicates either a LOLbin execution chain or process injection.
  • User account → authentication history — when and where has this account authenticated? Are there other logons in the investigation window from unexpected sources?
  • Source IP → all activity in the window — what else did this IP do before and after the alert fired? Is there a reconnaissance pattern? A lateral movement pattern?
Investigation Pivot Example

Alert: SIEM detects PowerShell execution from winword.exe on a user's workstation.

Pivot 1: Process tree → PowerShell spawned by winword.exe → Word macro execution. Check the document that was open.

Pivot 2: PowerShell command line (Event ID 4104) → base64-encoded download cradle → decode → URL being fetched.

Pivot 3: DNS logs → the URL was resolved 4 seconds before the PowerShell event → proxy logs show an HTTP GET to the resolved IP.

Pivot 4: Proxy log → response was 47KB executable → file saved to %TEMP% → Prefetch confirms execution.

Verdict: True positive. Macro-delivered malware. Escalate to Tier-2 for containment and malware analysis.

Documentation Standards

Every alert that receives more than 30 seconds of analyst attention should be documented. The minimum a SOC ticket should contain:

  • Alert name and SIEM link to the firing event
  • Entities involved: account, hostname, source and destination IPs
  • Timeline of events in the investigation window (UTC)
  • IOCs identified: hashes, domains, IPs, file paths
  • Analyst findings: what was observed and what it indicates
  • Disposition: TP / FP / escalated, with explicit reasoning
  • Actions taken: blocked IP, isolated host, reset account, notified IR team
  • ATT&CK technique if identified (e.g., T1059.001 — PowerShell)

Documentation that is too sparse to reconstruct the investigation from is not documentation — it is a liability. When the same analyst is on vacation and the incident recurs, or when a post-incident review requires understanding what was done, sparse tickets produce gaps in institutional knowledge.

Key Takeaways — Chapter 6
  • Alert classification (TP/FP/TN/FN) frames every triage decision — high false positive rates train analysts to dismiss rules, including the real fires
  • The five-minute triage framework — what fired, who is involved, is this normal, what context exists, is there corroboration — produces consistent decisions under time pressure
  • Investigation pivoting chains evidence: IP → domain → infrastructure → scope; process → parent → execution chain → payload
  • Uncertainty is escalation-worthy — closing a tentative FP that is actually a TP is the most dangerous mistake in triage
  • Documentation must be sufficient to reconstruct the investigation from the ticket alone — sparse tickets are an institutional knowledge liability
Chapter 07 · ~20 min · SIEM

SIEM Use Cases & Query Writing

SPL, KQL, and ESQL in depth — high-value use cases for brute force, lateral movement, Kerberoasting, PowerShell abuse, DNS tunnelling, and impossible travel

Query writing is the most concrete, most transferable technical skill in security operations. Every SIEM has its own query language — Splunk uses SPL, Microsoft Sentinel uses KQL, Elastic uses ESQL — but the analytical patterns are identical. Learn to express a detection hypothesis in one language and translating it to another becomes mechanical. This chapter builds the core detection queries every SOC analyst should be able to write, in all three major languages.

Platform Tags

Each query block in this chapter is tagged with the platform it applies to:

Splunk SPL Sentinel KQL Elastic ESQL

Use Case 1 — Brute Force Detection

Detect a high volume of failed authentication attempts against one or more accounts from a single source in a short time window.

Splunk SPLBrute force — 10+ failures in 5 min
index=windows EventCode=4625 | bin _time span=5m | stats count as failures, values(TargetUserName) as accounts by _time, IpAddress | where failures > 10 | sort -failures
Sentinel KQLBrute force — 10+ failures in 5 min
SecurityEvent | where EventID == 4625 | summarize failures=count(), accounts=make_set(TargetUserName) by IpAddress, bin(TimeGenerated, 5m) | where failures > 10 | order by failures desc

Use Case 2 — Password Spray Detection

Password spray differs from brute force: instead of many failures against one account, the attacker tries one password against many accounts. Detect by looking for a source with failures spread across many distinct accounts.

Sentinel KQLPassword spray — 1 source, many accounts
SecurityEvent | where EventID == 4625 and LogonType in (3,10) | summarize failures=count(), unique_accounts=dcount(TargetUserName), accounts=make_set(TargetUserName) by IpAddress, bin(TimeGenerated, 30m) | where unique_accounts > 10 and failures < 50 // High unique accounts, low total failures = spray

Use Case 3 — Lateral Movement via Pass-the-Hash

Pass-the-Hash generates Event ID 4624 with LogonType 3 (network) and AuthenticationPackage = NTLM. When a non-service account authenticates with NTLM from a workstation to another workstation, it is a lateral movement indicator.

Splunk SPLNTLM lateral movement — workstation-to-workstation
index=windows EventCode=4624 Logon_Type=3 Authentication_Package=NTLM | where NOT match(TargetUserName, "\\$$") /* exclude machine accounts */ | where match(IpAddress, "^10\.|^172\.|^192\.168") /* internal source */ | stats count as auth_count, dc(ComputerName) as dest_count, values(ComputerName) as destinations by TargetUserName, IpAddress | where dest_count > 2 | sort -dest_count

Use Case 4 — Kerberoasting Detection

Kerberoasting generates Event ID 4769 (Kerberos Service Ticket Request) with TicketEncryptionType 0x17 (RC4-HMAC). Legitimate requests typically use AES (0x12 or 0x11). A burst of RC4 service ticket requests is a strong Kerberoasting indicator.

Sentinel KQLKerberoasting — RC4 service ticket requests
SecurityEvent | where EventID == 4769 and TicketEncryptionType == "0x17" // RC4-HMAC and TicketOptions == "0x40810000" and ServiceName !endswith "$" // exclude machine accounts | summarize request_count=count(), services=make_set(ServiceName) by Account, IpAddress, bin(TimeGenerated, 10m) | where request_count > 3 | extend severity="High"

Use Case 5 — PowerShell Execution from Unusual Parents

PowerShell spawned by Word, Excel, or a web browser indicates macro or browser-based code execution. Detect by looking for powershell.exe whose parent is in a set of known-suspicious processes.

Sentinel KQLPowerShell from Office / browser parent
DeviceProcessEvents | where FileName =~ "powershell.exe" and InitiatingProcessFileName in~ ( "winword.exe", "excel.exe", "outlook.exe", "powerpnt.exe", "mspub.exe", "visio.exe", "chrome.exe", "msedge.exe", "firefox.exe", "iexplore.exe", "mshta.exe" ) | project Timestamp, DeviceName, AccountName, InitiatingProcessFileName, ProcessCommandLine, InitiatingProcessCommandLine

Use Case 6 — Impossible Travel

Two authentications from the same account from geographically distant locations within a time window that is physically impossible to travel between.

Sentinel KQLImpossible travel via AAD Sign-in logs
SigninLogs | where ResultType == 0 // successful sign-in | project UserPrincipalName, TimeGenerated, IPAddress, Location, City=tostring(LocationDetails.city), Country=tostring(LocationDetails.countryOrRegion) | sort by UserPrincipalName, TimeGenerated asc | extend prev_time=prev(TimeGenerated), prev_country=prev(Country), prev_ip=prev(IPAddress) | where UserPrincipalName == prev(UserPrincipalName) and Country != prev_country and datetime_diff('minute', TimeGenerated, prev_time) < 180 | project UserPrincipalName, TimeGenerated, Country, prev_country, IPAddress, prev_ip

Use Case 7 — DNS Tunnelling Indicators

Splunk SPLHigh DNS query volume per host to single domain
index=dns | rex field=query "[^.]+\.[^.]+$" as apex_domain | bin _time span=5m | stats count as query_count, dc(query) as unique_subdomains, avg(len(query)) as avg_query_len by _time, src_ip, apex_domain | where query_count > 100 OR unique_subdomains > 50 OR avg_query_len > 40 | sort -query_count

Use Case 8 — New Local Administrator Account

Sentinel KQLNew account added to local Administrators group
SecurityEvent | where EventID in (4728, 4732, 4756) and TargetUserName in ( "Administrators", "Domain Admins", "Enterprise Admins" ) | project TimeGenerated, EventID, Activity, Computer, SubjectUserName, MemberName=tostring(EventData.MemberName), TargetUserName | extend alert=strcat("Privileged group membership change: ", MemberName, " added to ", TargetUserName)
Key Takeaways — Chapter 7
  • Detection patterns are platform-agnostic — learn the analytical logic and translating between SPL, KQL, and ESQL becomes mechanical
  • Brute force detects many failures against one account; password spray detects few failures against many accounts — both require separate rules
  • Kerberoasting produces EventID 4769 with TicketEncryptionType 0x17 (RC4) — legitimate modern environments should use AES only
  • PowerShell from Office or browser parent processes is a high-confidence malicious execution indicator with very low legitimate use
  • Impossible travel and new admin group membership are high-fidelity, low-noise detections worth deploying with minimal tuning
Chapter 08 · ~14 min · EDR

Endpoint Detection & Response

EDR architecture, what EDR captures that SIEM cannot, remote response capabilities, MDR, tuning, platform comparison, and EDR evasion

EDR is the most operationally impactful security investment most organisations make. Unlike a SIEM, which depends on what log sources are configured to send it, EDR has direct visibility into process execution, memory activity, file operations, network connections, and registry changes at the kernel level — continuously, on every enrolled endpoint. And unlike passive monitoring, EDR enables active response: isolating a compromised host, killing a malicious process, or running live forensic commands without physical access to the device.

EDR Architecture

An EDR deployment has three components:

  • Endpoint agent — a kernel-level driver and user-space service running on each endpoint. Intercepts system calls, monitors process creation, network connections, file I/O, and registry operations. Designed to be tamper-resistant and low-performance-impact (typically <1% CPU overhead in steady state).
  • Cloud platform / backend — receives telemetry from all agents, stores it, runs detection logic (AI/ML models plus signature-based rules), correlates events across endpoints, and surfaces findings to analysts.
  • Management console — the analyst-facing interface. Provides alert investigation, endpoint search, live response capability, policy management, and threat hunting.

What EDR Captures That the SIEM Cannot

The SIEM receives logs — structured records of events that were configured to be forwarded. EDR captures telemetry — a continuous, high-fidelity record of system activity at a level the OS logging subsystem does not expose. Key differences:

  • Parent-child process relationships — Windows event logs show process creation but do not reliably capture the parent process path, hash, and command line in a single queryable record. EDR does.
  • File hash — EDR hashes every executed file and DLL load in real time. SIEM-based logging does not include file hashes natively (Sysmon adds hashes to Windows events, but this is an EDR-adjacent addition).
  • Memory events — process injection, code hollowing, and reflective DLL loading are visible in EDR telemetry through kernel hook interception. They do not appear in Windows event logs.
  • Pre-execution analysis — modern EDR platforms run machine learning models on executables at the point of launch, potentially blocking malicious files before they execute. This is a preventive capability the SIEM cannot provide.

Remote Response Capabilities

The response half of EDR is what transforms the platform from a passive monitor into an operational tool:

  • Network isolation — instantly blocks all network traffic to and from the endpoint except the EDR management channel. The endpoint can still be remotely forensically examined while completely isolated from the rest of the network. Deployable in seconds from the console — no ticket to the networking team, no physical access.
  • Live response shell — a remote interactive shell on the endpoint through the EDR channel. Run commands, collect files, examine running processes, and examine network state — all forensically documented in the EDR platform's activity log.
  • Process kill — terminate a specific process by PID from the console. Stops malware execution without requiring full isolation when the risk profile allows it.
  • File retrieval — download specific files from the endpoint to the analyst's workstation via the EDR channel. Retrieve a suspicious executable, a dropped payload, or a log file without needing to re-image the entire drive.
  • Custom script execution — deploy PowerShell, Python, or bash scripts to endpoints. Used for rapid bulk forensic data collection across a fleet during an incident.

EDR Tuning

A freshly deployed EDR with default policies will generate significant alert noise from legitimate activity — software installers, IT management tools, and security software often trigger behavioural rules designed to catch malware. Tuning reduces noise while maintaining coverage:

  • Exclusion management — path, hash, or process exclusions for known-legitimate activity. Must be documented, scoped narrowly, and reviewed periodically. Broad exclusions are a security risk — attackers who know your exclusions can exploit them.
  • Alert severity tuning — downgrade recurring benign-but-suspicious detections from high to informational. They remain visible for investigation context but do not queue for urgent triage.
  • Custom detection rules — all major EDR platforms allow custom IOA (Indicator of Attack) rules that fire on specific behavioural patterns. These are detection engineering applied to the endpoint telemetry stream.

Platform Comparison

PlatformStrengthNotable FeatureEcosystem
CrowdStrike FalconMarket-leading detection rates, cloud-native from inceptionThreat Graph — cross-customer correlationStrong SIEM/SOAR integrations; XDR via Falcon Insight XDR
Microsoft Defender for EndpointNative M365/Azure integration; included in E5Automatic attack disruptionNative Sentinel integration; best value for Microsoft-centric shops
SentinelOne SingularityAutonomous response; strong Linux/Mac coverageStoryline — automatically correlates related events into attack storyActive SIEM/SOAR marketplace
Carbon Black (VMware)Strong in VMware/virtualised environmentsContinuous recording — full EDR telemetry for post-incident replayVMware ecosystem integration

EDR Evasion — What Attackers Do

Understanding EDR evasion techniques improves both detection engineering and incident investigation. Common approaches:

  • Direct syscalls — bypasses EDR user-space hooks by invoking Windows kernel system calls directly, avoiding the hooked API functions the EDR monitors. Detect through kernel-level monitoring (driver-based EDR) or by the absence of expected API call sequences.
  • Process injection into allowlisted processes — injecting into EDR-excluded processes (backup agents, AV engines, IT management tools). This is why broad exclusions are a security risk.
  • Bring-your-own-vulnerable-driver (BYOVD) — attackers load a legitimately signed but vulnerable driver to disable or bypass the EDR driver. Detected through driver load monitoring (Sysmon Event ID 6) and kernel-level EDR that does not depend on user-space hooks.
  • Evasion-aware timing — some malware detects EDR analysis (sandbox detection patterns: no user interaction, short execution time, high entropy process names) and behaves benignly until it detects a real user environment.
Key Takeaways — Chapter 8
  • EDR captures parent-child relationships, file hashes, memory events, and pre-execution analysis that SIEM-based logging cannot
  • Network isolation via EDR (deployable in seconds) is the fastest containment action available — no network team ticket required
  • Live response shell enables remote forensic collection through the EDR channel — preserving evidence while containing the host
  • EDR exclusions are a security risk — narrow, documented, and periodically reviewed exclusions only
  • Direct syscall evasion bypasses user-space hooks — kernel-level EDR drivers are more resilient to this evasion class
Chapter 09 · ~14 min · Network

Network Detection & Monitoring

Zeek, Suricata, PCAP strategy, NetFlow, DNS as a security control, IDS vs IPS architecture, segmentation, beaconing, and east-west visibility

Network security monitoring (NSM) provides a visibility layer that is fundamentally different from endpoint monitoring. While EDR sees what happens on a device, NSM sees what crosses the wire — including traffic between devices where no agent is installed, external communications from IoT and OT devices, and lateral movement across network segments. The two disciplines are complementary: endpoint blindspots are often network-visible, and vice versa.

Zeek — Protocol Analysis and Log Generation

Zeek (formerly Bro) is an open-source network analysis framework. Rather than signature matching, Zeek passively analyses network traffic and produces structured, tab-separated log files for each protocol — conn.log (connections), dns.log (DNS queries), http.log (HTTP transactions), ssl.log (TLS metadata), files.log (transferred files), x509.log (certificates), and more. These logs are designed to be ingested directly into a SIEM for correlation and alerting.

Key Zeek Log Files

conn.log — every TCP/UDP connection: src/dst IP and port, protocol, duration, bytes in/out, connection state. The network equivalent of firewall logs but with application context.

dns.log — every DNS query and response: querying IP, query type, query name, TTL, answer IPs. Essential for C2 detection and DGA identification.

ssl.log — TLS connection metadata: server name (SNI), certificate subject, JA3/JA3S hashes, cipher suite, version. Enables encrypted traffic analysis without decryption.

http.log — HTTP transactions: URI, user agent, referrer, response code, response body mime type, response content length. Reveals web shell activity, malware downloads, and data exfiltration via HTTP.

files.log + file content — files transferred over monitored protocols can be extracted and saved for analysis. Zeek can hash extracted files for VirusTotal lookup.

Suricata — Signature Detection and Inline Prevention

Suricata is a high-performance open-source IDS/IPS that supports both passive detection (IDS mode, listening on a mirror/span port) and inline prevention (IPS mode, deployed in-line on a network bridge). It uses a rule format compatible with Snort rules and the Emerging Threats rule set, and also supports Lua scripting for complex behavioural detections.

In IDS mode, Suricata generates alerts on matching traffic without blocking. In IPS mode, it can drop or reject matching packets in real time. The operational consideration: in IPS mode, false positives block legitimate traffic. Most organisations deploy in IDS mode with alerting and use firewall automation for actual blocking, preserving human oversight of block decisions.

PCAP Strategy

Full packet capture (PCAP) is the most complete network evidence available — but also the most storage-intensive. A 1Gbps link at full utilisation generates ~450GB/hour of PCAP. Strategic PCAP requires choosing capture points, retention periods, and filtering carefully.

  • Perimeter capture — full PCAP at the internet boundary captures all inbound and outbound traffic. Even 24–72 hours of perimeter PCAP can be forensically valuable during incident investigation if the window includes the attacker's initial access or exfiltration.
  • Selective capture — rather than retaining all traffic, trigger full PCAP capture when a specific alert fires. Suricata's pcap-log feature can be configured to capture a rolling 10-minute buffer around each alert, providing context without retaining everything.
  • Flow metadata first — NetFlow/IPFIX provides traffic metadata at a tiny fraction of PCAP storage cost. Retain full NetFlow for 90+ days; retain PCAP for 24–72 hours at key points with longer retention for flagged traffic.

DNS as a Security Control

DNS is both a critical visibility source (Chapter 3) and an enforcement point. DNS-layer security can block connections to known-malicious domains before any application-layer connection is established:

  • RPZ (Response Policy Zones) — a DNS server extension that returns NXDOMAIN or a sinkhole IP for domains on a blocklist. Can be populated with threat intelligence feeds. Transparent to end users and applications.
  • DNS sinkholes — resolving C2 domains to an internal IP that logs the connection attempts. Instead of just blocking, the sinkhole reveals which internal hosts are attempting to reach C2 infrastructure — invaluable during an incident to scope compromised endpoints.
  • Recursive resolver logging — deploying a centralised recursive resolver (Pi-hole, Unbound, BIND with query logging, Infoblox) for all internal clients and routing all DNS queries through it provides a complete DNS query log for every host.

Detecting Beaconing in Network Data

A C2 beacon makes periodic connections to its command and control server — typically on a schedule with small jitter added to avoid precise-interval detection. In network data, beaconing appears as:

  • Regular connections to the same external IP with consistent inter-connection intervals (e.g. every 300 seconds ±30s)
  • Small, consistent byte counts per connection (heartbeat with no active tasking)
  • Connections at times inconsistent with user activity (e.g. 3 AM on a workstation)
  • Connections to IPs with no associated hostnames (direct IP C2 or freshly registered domain)
Splunk SPLBeaconing detection — regular interval connections
index=zeek sourcetype=zeek_conn | where dest_ip NOT IN ("10.0.0.0/8","172.16.0.0/12","192.168.0.0/16") | bin _time span=1h | stats count as conn_count, stdev(tolong(_time)) as time_stdev, avg(orig_bytes) as avg_bytes, stdev(orig_bytes) as bytes_stdev by id.orig_h, id.resp_h, id.resp_p | where conn_count > 5 AND time_stdev < 30 /* low jitter = beacon */ AND bytes_stdev < 100 /* consistent payload size */ | sort time_stdev

East-West Traffic Monitoring

The majority of enterprise network monitoring is focused on north-south traffic (between internal and external networks). East-west traffic (between internal hosts) is where lateral movement occurs — and it is where most organisations have the least visibility.

Achieving east-west visibility requires: deploying NSM sensors at internal network aggregation points (not just the perimeter), enabling NetFlow on internal switches, deploying host-based network monitoring through EDR, and leveraging network segmentation to make unexpected cross-segment traffic a high-confidence anomaly.

Key Takeaways — Chapter 9
  • Zeek generates structured logs (conn, dns, ssl, http, files) from raw traffic — designed for SIEM ingestion and correlation
  • Suricata provides signature-based IDS/IPS — IDS mode preferred operationally to avoid false-positive blocking of legitimate traffic
  • DNS sinkholes reveal compromised hosts attempting C2 connections rather than just blocking them — scope identification during incidents
  • Beaconing detection looks for regular connection intervals, consistent byte counts, and off-hours activity — statistical analysis of conn.log is the primary technique
  • East-west monitoring is the largest network visibility gap in most enterprises — lateral movement is invisible without internal sensor placement
Chapter 10 · ~17 min · Identity

Identity & Access Monitoring

Active Directory attacks and detection, Azure AD, PAM, service account monitoring, and insider threat via UEBA

Identity has become the most consequential attack surface in enterprise security. Compromising a privileged credential achieves more for an attacker than exploiting most vulnerabilities — it provides legitimate access through legitimate channels that blend with normal administrative activity. This chapter covers the identity monitoring techniques that detect credential abuse, privilege escalation, and Active Directory attacks that account for the majority of serious enterprise breaches.

Active Directory — The Crown Jewel

Active Directory is the identity and access foundation of the vast majority of enterprise Windows environments. Compromising it — or specific high-value accounts within it — means compromising everything. The SOC must monitor AD with the same vigilance applied to the most critical production systems, because AD is more critical than any of them.

Tier-0 / Tier-1 / Tier-2 Asset Model

Microsoft's AD tiering model separates assets by the level of control they provide over the AD environment:

  • Tier-0 — assets that can control the entire AD forest: Domain Controllers, Active Directory itself, accounts in Domain Admins / Enterprise Admins / Schema Admins, ADFS servers, AAD Connect servers, privilege management systems. Compromise of any Tier-0 asset means full domain compromise.
  • Tier-1 — servers and applications that host business-critical services. Compromise can lead to Tier-0 via credential harvesting from memory of Tier-1 admin sessions.
  • Tier-2 — workstations and end-user devices. The entry point for most attacks; lateral movement proceeds upward through tiers.

Every login of a Tier-0 privileged account should generate an alert for review. Tier-0 accounts should only log in to Tier-0 systems — a Domain Admin logging into a workstation is an anomaly that warrants immediate investigation.

High-Value AD Attack Detections

AttackDetection SignalEvent ID / Log
KerberoastingMultiple Event ID 4769 with RC4 (0x17) encryption type from a single sourceSecurity log on DCs — 4769
AS-REP RoastingEvent ID 4768 with pre-authentication not required (0x10) flag setSecurity log on DCs — 4768
DCSyncEvent ID 4662 with Object Access on domain object with 1131f6aa (DS-Replication-Get-Changes) GUIDDC Security log — 4662 (requires DS Access auditing)
Golden TicketEvent ID 4769 with unusual service ticket attributes (anomalous RC4, ticket lifetime >10h, non-standard user fields)Security log on DCs
BloodHound collectionUnusual LDAP queries enumerating group memberships, ACLs, and trusts — high-volume LDAP traffic from unexpected sourcesDC LDAP log, network monitoring
AdminSDHolder abuseChanges to AdminSDHolder object ACL — Event ID 5136 with ObjectDN ending in CN=AdminSDHolderDC Security log — 5136
Pass-the-TicketKerberos service ticket used from a host that did not request it — requires correlation of 4768 (TGT request) and 4769 (service ticket) source IPsDC Security log — 4768, 4769

Azure AD and Hybrid Identity

Modern enterprises operate hybrid identity environments — on-premises AD synchronised with Azure AD via AAD Connect. This creates an attack surface that spans both environments: an attacker who compromises Azure AD can pivot to on-premises AD through the sync account, and vice versa. The AAD Connect sync account has Domain Admin-equivalent permissions on the on-premises directory — it must be treated as a Tier-0 asset.

Key Azure AD monitoring signals:

  • Conditional Access policy bypass — a sign-in that succeeded where Conditional Access should have blocked it. Azure AD's Conditional Access Insights workbook in Sentinel surfaces these.
  • Risky Sign-in alerts — Azure AD Identity Protection scores each sign-in for risk (leaked credentials, impossible travel, anonymous IP, etc.). All high-risk sign-ins should queue for immediate analyst review.
  • MFA registration changes — an attacker who has compromised an account may register their own authenticator device, enabling persistent MFA-bypassed access. Monitor for MFA method registration on accounts that are not new.
  • Service principal secret / certificate creation — attackers frequently create new credentials on existing service principals to maintain persistent access. Monitor Azure AD audit logs for Add service principal credentials events.

Privileged Access Management (PAM)

PAM platforms (CyberArk, BeyondTrust, Delinea) manage privileged credentials and sessions — storing passwords in a vault, checking them out for use, and rotating them after each session. As a monitoring tool, PAM session recording provides video-quality records of every privileged session that can be reviewed during investigations. PAM also provides analytics on privileged account usage patterns — baselining normal administrative activity and alerting on deviations.

Service Account Monitoring

Service accounts are a chronic security weakness. They often have high privileges, static passwords (sometimes never rotated), and are used by multiple systems. Attackers target them specifically because compromising a service account provides persistent, stable access that is harder to detect than compromising a human account (no user to notice unusual activity or report suspicious emails).

Detection patterns for service account abuse:

  • Interactive login (Logon Type 2) from a service account — service accounts should never have interactive logons
  • Service account authentication from an unexpected source IP (not the expected application server)
  • Service account used at an unusual time (service accounts typically operate on schedules)
  • Service account added to a new group — privilege creep via group membership

UEBA — Insider Threat Detection

User and Entity Behaviour Analytics (UEBA) establishes a baseline of normal behaviour for each user and entity, then alerts when behaviour deviates significantly from that baseline. Unlike signature-based detection, UEBA can detect novel threats and insider activity that does not match any known-bad pattern.

UEBA is most effective for insider threat detection because insiders use legitimate credentials and legitimate access — there is no malicious file to detect, no known C2 domain to block. What is detectable is the pattern: unusual access hours, access to data outside the user's normal scope, unusual data volume, access from a new device, or bulk download followed by resignation notice.

Key Takeaways — Chapter 10
  • Tier-0 AD assets (DCs, privileged accounts, AAD Connect) must be monitored with the highest vigilance — their compromise means full domain compromise
  • DCSync detection requires DS Access auditing enabled on DCs — a configuration often missed; verify it is enabled
  • Azure AD hybrid identity means the AAD Connect sync account is a critical Tier-0 attack target in both environments
  • Service accounts authenticate on schedules and from known sources — any deviation is a high-confidence indicator
  • UEBA detects insider threats through behavioural deviation, not signature matching — the only viable approach for legitimate-credential abuse
Chapter 11 · ~14 min · Cloud

Cloud Security Monitoring

AWS GuardDuty findings, Security Hub, Azure Defender, cloud attack patterns, cloud-native vs SIEM-integrated monitoring, and multi-cloud SOC challenges

Extending SOC coverage to cloud environments is one of the most pressing operational challenges facing security teams today. The migration of workloads to AWS, Azure, and GCP has outpaced the adaptation of SOC tooling, processes, and analyst skills in most organisations. Cloud environments generate large volumes of security-relevant telemetry — but it arrives in different formats, through different APIs, with different data models than the on-premises logs analysts are accustomed to. This chapter covers the monitoring approach for major cloud platforms and the cloud-specific attack patterns the SOC must be able to detect.

AWS Security Monitoring

GuardDuty Findings in Depth

AWS GuardDuty is a managed threat detection service that analyses CloudTrail, VPC Flow Logs, DNS query logs, and EKS audit logs for threat indicators. It produces findings that are pre-mapped to ATT&CK techniques and pre-prioritised by severity. Understanding what each finding type means is essential for triage.

Finding TypeWhat It MeansTriage Priority
UnauthorizedAccess:IAMUser/ConsoleLoginSuccess.BConsole login from Tor exit node or anonymising proxyHigh — investigate immediately
Recon:IAMUser/UserPermissionsIAM permission enumeration — attacker mapping what they can doHigh — indicates post-compromise reconnaissance
PrivilegeEscalation:IAMUser/AdministrativePermissionsAttacker granted themselves admin-level IAM permissionsCritical — full account compromise possible
Persistence:IAMUser/UserPermissionsUnexpected IAM policy or role change — establishing persistenceHigh
CryptoCurrency:EC2/BitcoinTool.BEC2 instance communicating with known cryptocurrency mining poolMedium — likely resource theft, not targeted attack
Backdoor:EC2/C&CActivity.BEC2 instance communicating with known C2 infrastructureCritical — active compromise
Exfiltration:S3/ObjectRead.UnusualUnusual S3 read volume or pattern — potential data exfiltrationHigh — scope the bucket and data classification
Stealth:IAMUser/CloudTrailLoggingDisabledCloudTrail logging was disabled — attacker covering tracksCritical — re-enable immediately; all activity during gap is blind

AWS Security Hub

Security Hub aggregates findings from GuardDuty, Inspector (vulnerability scanning), Macie (sensitive data discovery), IAM Access Analyser, Firewall Manager, and third-party tools, normalises them to the AWS Security Finding Format (ASFF), and provides a centralised console for prioritisation. Security Hub findings can be forwarded to the SIEM via EventBridge → Lambda or direct SIEM connectors, providing a single aggregation point for all AWS security findings.

Azure Security Monitoring

Microsoft Defender for Cloud (formerly Azure Security Center + Azure Defender) provides security posture management and threat detection across Azure workloads. It integrates natively with Microsoft Sentinel, making it the preferred monitoring approach for Azure-heavy environments.

  • Defender for Servers — extends Microsoft Defender for Endpoint capabilities to Azure VMs and Arc-enabled servers, providing EDR coverage for cloud workloads
  • Defender for Containers — runtime threat detection for Azure Kubernetes Service (AKS) and container registries
  • Defender for Databases — threat detection for Azure SQL, Cosmos DB, and open-source databases, including SQL injection detection and unusual access patterns
  • Defender CSPM — cloud security posture management; continuously assesses resource configuration against security benchmarks

Cloud Attack Patterns the SOC Must Detect

IAM Privilege Escalation

The most common high-impact cloud attack pattern. An attacker with limited IAM permissions exploits policy misconfigurations to gain administrative access. Detection: CloudTrail events showing IAM policy attachment (AttachUserPolicy, AttachRolePolicy), new role creation (CreateRole), and iam:PassRole combined with EC2 or Lambda execution. Any IAM permission grant to a user or role should alert when performed by an account that did not previously have that right.

SSRF to EC2 Metadata Service

Server-Side Request Forgery in a cloud-hosted application allows an attacker to reach the EC2 instance metadata service (http://169.254.169.254/), retrieving temporary IAM credentials for the instance's role. Detection: CloudTrail AssumeRole events where the role is an EC2 instance profile role but the source IP is not the expected EC2 instance IP. GuardDuty's Recon:EC2/MetadataServiceApiCall finding covers this pattern.

S3 Data Exfiltration

Bulk GetObject operations from S3 buckets containing sensitive data. Detection: S3 access logs (must be explicitly enabled) showing unusual GetObject volume, presigned URL generation for large numbers of objects, cross-account replication configuration changes. Macie findings on buckets containing PII or sensitive data should be treated as high-priority.

CloudTrail Logging Disabled

The first action many attackers take after establishing cloud access is disabling CloudTrail to eliminate the audit trail. This must be detected and remediated immediately. AWS Config rule CLOUD_TRAIL_ENABLED can trigger automatically when CloudTrail is disabled; GuardDuty produces a Stealth finding; and an EventBridge rule on the StopLogging API call can send an immediate notification regardless of GuardDuty configuration.

Cloud-Native vs SIEM-Integrated Monitoring

A recurring architectural decision: should cloud security findings stay in cloud-native tools (GuardDuty console, Defender for Cloud) or be ingested into the central SIEM?

  • Cloud-native: lowest latency, deepest context, no ingestion cost. Best when cloud specialists investigate cloud findings and the SOC is cloud-platform-trained.
  • SIEM-integrated: enables correlation with on-premises and endpoint data, allows a single analyst console, and enables cross-cloud correlation. Essential when cloud incidents involve lateral movement to or from on-premises environments.

Most mature SOCs use both: cloud-native tools for initial alert investigation, with findings forwarded to the SIEM for correlation, case management, and long-term retention.

Key Takeaways — Chapter 11
  • GuardDuty provides pre-correlated, pre-prioritised findings from CloudTrail, VPC Flow Logs, and DNS — enable it in all accounts and all regions
  • CloudTrail logging disabled is a critical finding requiring immediate response — an EventBridge rule on StopLogging provides faster alerting than GuardDuty
  • IAM privilege escalation is the highest-impact cloud attack pattern — any IAM permission grant by a non-break-glass account should alert
  • SSRF to the metadata service can be detected through CloudTrail AssumeRole anomalies even without payload inspection
  • SIEM integration enables cross-environment correlation — essential for incidents spanning cloud and on-premises environments
Chapter 12 · ~13 min · Vuln Mgmt

Vulnerability Management

Scanning tools, CVSS vs EPSS vs KEV, patch SLAs, attack surface management, and threat-informed prioritisation

Vulnerability management is the SOC-adjacent function responsible for identifying, prioritising, and tracking the remediation of security vulnerabilities across the organisation's attack surface. "SOC-adjacent" is deliberate — in many organisations vulnerability management sits in a separate team (often under risk or IT operations). But the relationship between the SOC and vulnerability management is critical: new vulnerabilities drive urgent detection engineering (can we detect exploitation attempts?), and threat intelligence from the SOC informs which vulnerabilities are being actively exploited against organisations like yours.

Scanning Foundations

Vulnerability scanners assess hosts and applications for known vulnerabilities by: probing service banners to determine software versions, comparing identified versions to vulnerability databases (NVD, vendor advisories), running authenticated checks against OS configuration and installed software lists, and in some cases actively testing for exploitability.

Leading platforms:

  • Tenable Nessus / Tenable.io — market-leading vulnerability scanner with the largest plugin library. Agent-based and agentless scanning. Strong compliance scanning alongside vulnerability detection.
  • Qualys VMDR — cloud-native vulnerability management with strong asset inventory integration. TruRisk score incorporating EPSS and asset criticality.
  • Rapid7 InsightVM — strong remediation workflow integration with IT ticketing systems. Live dashboards showing risk reduction in real time.
  • Microsoft Defender Vulnerability Management — integrated into the MDE console; provides vulnerability data on enrolled endpoints without a separate scanner deployment.

Vulnerability Prioritisation — Beyond CVSS

CVSS (Common Vulnerability Scoring System) scores vulnerabilities on a 0–10 scale based on intrinsic characteristics: attack vector, attack complexity, privileges required, and impact. It is the standard severity reference — but it is a poor prioritisation tool because it describes a vulnerability's theoretical severity, not the actual risk it poses in your environment.

Two more useful prioritisation signals:

EPSS — Exploit Prediction Scoring System

EPSS (maintained by FIRST) predicts the probability that a vulnerability will be exploited in the wild within 30 days, based on machine learning models trained on historical exploitation data. A CVE with CVSS 9.8 and EPSS 0.1% is theoretically severe but rarely exploited in practice — a CVE with CVSS 7.2 and EPSS 85% is actively being used by attackers right now. EPSS inverts the prioritisation for roughly 90% of published CVEs: most critical-CVSS vulnerabilities have low EPSS scores, and some medium-CVSS vulnerabilities have high EPSS scores because exploits are actively circulating.

CISA KEV — Known Exploited Vulnerabilities Catalogue

The CISA KEV catalogue is a curated list of vulnerabilities confirmed to be actively exploited in real-world attacks. US federal agencies are required to remediate KEV entries within binding deadlines (2 weeks for critical, 6 months for others). The KEV is the highest-confidence signal for urgent patching: if a vulnerability is in the KEV, it is being exploited against real organisations right now. Every organisation should treat KEV additions as triggering immediate patching review regardless of their internal CVSS-based SLAs.

Patch Management SLAs

SeverityCVSS RangeTypical SLAKEV Override
Critical9.0–10.072 hours – 7 daysImmediate (24–48 hours)
High7.0–8.914–30 days7 days
Medium4.0–6.960–90 days30 days
Low0.1–3.9Next patch cycle / best effortAs above

SLAs must be realistic — an organisation that sets a 72-hour SLA for critical vulnerabilities but has no automated patching capability and a change control process requiring a two-week CAB review will miss every SLA. The right SLA is the most aggressive timeline that the organisation can actually meet, with emergency procedures for KEV-class vulnerabilities.

Attack Surface Management (ASM)

Attack Surface Management is the continuous discovery and assessment of internet-exposed assets. The foundational problem it addresses: organisations do not have a complete, current inventory of their internet-facing assets. Shadow IT, forgotten development servers, acquired-company infrastructure, and cloud sprawl create unknown exposure that scanners can only assess if they know the assets exist.

ASM platforms (Censys, Runzero, Mandiant ASM, CrowdStrike Falcon Surface) continuously scan the internet from an attacker's perspective, discovering assets associated with your organisation by tracking IP ranges, ASN ownership, certificate subjects, and domain registration patterns. They surface new exposures — an RDP port suddenly open on a public IP, an expired certificate, a staging server running a vulnerable version — before attackers find them.

Threat-Informed Prioritisation

The highest maturity in vulnerability prioritisation combines EPSS and KEV data with threat intelligence specific to your sector and the threat actors most likely to target you. A vulnerability being actively exploited by an APT that has historically targeted your industry sector warrants emergency treatment even if its CVSS score is moderate. Intelligence from ISACs, CISA advisories, and commercial threat intel feeds provides this context.

Key Takeaways — Chapter 12
  • CVSS describes theoretical severity — EPSS predicts actual exploitation probability; use EPSS for operational prioritisation
  • CISA KEV is the highest-confidence signal for urgent patching — treat every KEV addition as triggering immediate patching review
  • Patch SLAs must match organisational patching capability — unrealistic SLAs produce systematic non-compliance and false assurance
  • ASM discovers internet-exposed assets you don't know you have — attacker-perspective scanning surfaces unknown exposures
  • Threat-informed prioritisation combines KEV, EPSS, and sector-specific actor intelligence for highest-risk accuracy
Chapter 13 · ~15 min · SOAR

SOAR & Automation

What SOAR is and isn't, automation use cases, playbook building, platform comparison, API-based automation, and measuring ROI

SOAR — Security Orchestration, Automation, and Response — is one of the most impactful technologies in the modern SOC when deployed thoughtfully, and one of the most expensive ways to automate yourself into false confidence when deployed carelessly. The fundamental value proposition is real: automating repetitive, well-defined tasks frees analysts for the complex work that requires human judgment. The risk is equally real: automated response actions that fire on false positives cause outages, and over-automated SOCs lose the analytical depth that makes human analysts effective.

Definition

SOAR platforms provide three capabilities: orchestration (connecting security tools through APIs and workflows), automation (executing defined response actions without human intervention), and response (structured playbooks that guide human analysts through complex decisions). The ratio of automation to human guidance is a design choice — not a product feature.

What to Automate — and What Not to

The automation decision matrix is simple: automate tasks that are high-volume, low-risk, well-defined, and repetitive. Do not automate decisions that are low-volume, high-stakes, context-dependent, or likely to affect production systems on false positives.

AutomateDo Not Automate
IOC reputation lookups (VirusTotal, Shodan, AbuseIPDB)Host isolation on a production database server
Asset enrichment (who owns this IP? what team owns this host?)Firewall blocks on critical infrastructure paths
User context (what department? manager? recent HR flags?)Account lockout for C-suite accounts without human review
Ticket creation and routing based on alert typeAny action on a poorly-tuned, high-FP rule
Notification to affected users (phishing reports)Decisions requiring environmental context the playbook cannot know
Passive data collection (screenshots, log export)Irreversible actions (file deletion, data purge)

Building an Enrichment Playbook

Enrichment automation is the highest-ROI SOAR use case: automatically pulling context for every alert so the analyst receives a pre-enriched case rather than a raw alert. A typical enrichment playbook for an alert containing an external IP address:

  1. Extract IP from alert
  2. Check internal asset database — is this IP one of ours?
  3. Query VirusTotal API — reputation, AV hits, historical passive DNS
  4. Query Shodan — what services is this IP running? ASN? Country?
  5. Query AbuseIPDB — community abuse reports
  6. Check internal threat intel platform — has this IP appeared in previous incidents?
  7. Query SIEM — has this IP appeared in other alerts in the last 30 days?
  8. Write enrichment summary to case notes
  9. Update alert priority based on enrichment (confirmed malicious → escalate; known clean → close)

This workflow, executed manually, takes 10–20 minutes per alert. Automated, it completes in under 30 seconds and produces a richer result. If a SOC handles 200 alerts per day and 60% of them involve external IP enrichment, this single playbook saves 20–40 analyst-hours daily.

Tines — Enrichment Story (pseudocode)Automatic IOC enrichment on new SIEM alert
# Trigger: New high-severity alert from SIEM action: HTTP_Request url: "https://www.virustotal.com/api/v3/ip_addresses/{{alert.src_ip}}" headers: {"x-apikey": "{{VT_API_KEY}}"} on_success: store_as "vt_result" action: HTTP_Request url: "https://api.shodan.io/shodan/host/{{alert.src_ip}}" params: {"key": "{{SHODAN_KEY}}"} on_success: store_as "shodan_result" action: Update_Case case_id: "{{alert.case_id}}" notes: | "VT: {{vt_result.data.attributes.last_analysis_stats.malicious}}/ {{vt_result.data.attributes.last_analysis_stats.total}} engines malicious Shodan: {{shodan_result.org}} | {{shodan_result.country_name}} Ports: {{shodan_result.ports}}" action: Route_Case if: vt_result.data.attributes.last_analysis_stats.malicious > 5 then: escalate_to_tier2, set_priority "Critical" else: route_to_normal_queue

Containment Automation

Automated containment — host isolation on confirmed malware, account lockout on confirmed credential compromise — is high-impact but requires rigorous false-positive management and human oversight mechanisms. Best practices:

  • Only automate on high-confidence, low-FP rules — a rule with a 50% false positive rate should never trigger automated containment
  • Mandatory human notification before action — even if the isolation happens automatically, a notification goes to the analyst immediately and an easy "undo" is one click away
  • Exclusion for critical assets — production database servers, AD domain controllers, and network infrastructure should be in an exclusion list that prevents automatic isolation regardless of the alert; these require manual review
  • Time-bounded actions — automated isolation that expires after 4 hours unless actively extended forces human review of every automated action

SOAR Platforms

PlatformStrengthBest For
Splunk SOAR (Phantom)Largest app marketplace, Python-native playbooks, deep Splunk integrationSplunk SIEM shops; teams with Python development capability
Palo Alto XSOAREnterprise-grade, broad integration library, strong incident managementLarge enterprises with complex multi-tool environments
TinesNo-code/low-code story builder, fast to build, easy to maintainSOCs without dedicated development resources; getting value quickly
TorqModern no-code interface, strong AI integration, fast deploymentCloud-native SOCs; automation-first programmes
Key Takeaways — Chapter 13
  • Automate high-volume, low-risk, well-defined tasks; keep human judgment for high-stakes, context-dependent decisions
  • Enrichment automation (IOC lookup, asset context, user risk) is the highest-ROI SOAR use case — measurable time savings on every alert
  • Automated containment requires: high-confidence rules, mandatory human notification, critical asset exclusions, and time-bounded actions
  • SOAR playbooks are code — version-control them, review them, test them before deployment
  • Measure SOAR ROI: analyst hours saved, MTTD/MTTR reduction, alert volume handled per analyst — without measurement, automation investment is unjustified
Chapter 14 · ~13 min · Maturity

SOC Metrics, Maturity & Shift Left

MTTD, MTTR, coverage metrics, maturity models, staffing, DevSecOps integration, and building a learning culture

A SOC without measurement is a SOC that cannot improve. The metrics discussed in this chapter serve two purposes: operational (understanding how well the SOC is performing right now) and strategic (understanding what investment would most improve performance). Getting both right requires choosing metrics that reflect genuine security outcomes rather than activity proxies — alert count is activity, not security; detection coverage percentage is a meaningful security metric.

Core SOC Metrics

MetricFormula / DefinitionTarget / BenchmarkImproves With
MTTDMean Time to Detect — average time from attacker access to SOC alert<24 hours (mature SOC); industry median ~200 daysDetection coverage, log completeness, threat hunting
MTTRMean Time to Respond/Remediate — detection to incident closure<4 hours for SEV-1; <24 hours for SEV-2Playbook quality, SOAR automation, team coordination
MTTCMean Time to Contain — detection to threat contained<1 hour for active threatsEDR isolation capability, playbook automation
False Positive RateFP alerts / total alerts (per rule and overall)<10% per rule; <30% overall queueDetection engineering tuning, SOAR pre-filtering
Detection Coverage %ATT&CK techniques with validated detection / techniques in threat model70%+ for high-priority techniquesDetection engineering backlog execution
Alerts per Analyst per DayTotal triaged alerts / analyst headcount20–50 for Tier-1 (high quality); >100 indicates process problemsSOAR automation, detection tuning, staffing
Repeat Incident RateIncidents sharing root cause with prior incidents / total incidents<10%Post-incident learning, root cause remediation

SOC Maturity Models

The SANS SOC Survey maturity framework describes five levels:

  1. Level 1 — Reactive: respond to alerts from SIEM and security tools. No proactive hunting. Incident response is ad hoc.
  2. Level 2 — Managed: documented processes for common incident types. Metrics tracked. Tuning of detection content occurs reactively.
  3. Level 3 — Defined: detection engineering programme. Regular tabletop and purple team exercises. SOAR deployed for enrichment automation. ATT&CK coverage mapped.
  4. Level 4 — Measured: all core metrics tracked and trended. Detection coverage targets set and measured. Threat hunting conducted on regular cadence. SOAR covers containment automation for high-confidence rules.
  5. Level 5 — Optimising: continuous improvement embedded. Intelligence-led detection. Red team and purple team findings drive detection backlog. SOC feeds into product security and architecture decisions.

SOC Staffing Models and Burnout Prevention

Analyst burnout is the most significant operational risk to most SOCs — more so than any technical gap. The causes are well-understood: high alert volume, repetitive work, lack of learning opportunities, 24/7 on-call pressure, and the absence of visible impact. The mitigations:

  • Alert volume management — a Tier-1 analyst handling 200+ alerts per day is not performing triage; they are rubber-stamping. This is both a detection engineering problem and a management failure.
  • Rotation across Tier-1/2/3 work — analysts who only ever work the alert queue do not develop skills. Structured rotation that includes detection engineering projects, hunting exercises, and incident response exposure develops analysts and reduces monotony.
  • Learning culture — internal CTF competitions, "detection of the week" discussions, post-incident learning meetings, conference attendance, and certification support all signal that skill development is valued.
  • On-call load management — if on-call pages regularly at 3 AM, the detection content is miscalibrated. On-call should be genuinely exceptional, not routine.

Shift Left — Security in the Development Lifecycle

"Shift left" means moving security activities earlier in the development and deployment lifecycle — addressing security issues when they are cheapest to fix (in design and development) rather than when they are most expensive (after deployment, or after breach). For the SOC, shift left means:

  • Security requirements in development — working with developers to ensure new applications generate the logs the SOC needs to monitor them before they deploy
  • Supply chain monitoring — tracking the security posture of dependencies, container base images, and third-party components used by internal software — the SolarWinds and Log4Shell incidents demonstrated the SOC's need to detect exploitation of supply chain components
  • SAST/DAST integration — Static and Dynamic Application Security Testing findings feeding the same vulnerability management workflow as infrastructure vulnerabilities, with SOC visibility into application-layer exposure
Key Takeaways — Chapter 14
  • MTTD and MTTR are the primary SOC health indicators — industry median MTTD is ~200 days; a mature SOC targets under 24 hours
  • Detection coverage percentage (ATT&CK techniques with validated detection) is a meaningful security metric; alert count is an activity proxy
  • SOC maturity progresses from reactive → managed → defined → measured → optimising; the gap between Level 3 and Level 4 is measurement and accountability
  • Analyst burnout is the highest operational risk to most SOCs — alert volume management, skill development, and learning culture are the primary mitigations
  • Shift left extends SOC responsibilities into the development lifecycle — the SOC must be able to monitor new applications from day one of deployment
Chapter 15 · ~13 min · Future

The Future of the SOC

AI/ML in the SOC, autonomous response, LLM-assisted investigation, XDR, Zero Trust implications, Purple Team as continuous improvement, and next-generation threats

The SOC of 2030 will be meaningfully different from the SOC of today — not because artificial intelligence will replace analysts, but because the analyst role will shift further from alert processing toward detection engineering, threat hunting, and adversarial simulation. The technology layer will handle more of the repetitive analytical work. The human layer will be responsible for the strategic and creative work that technology cannot reliably perform: understanding attacker intent, designing robust detection architectures, and anticipating novel threats. This final chapter addresses the realistic trajectory of the discipline.

AI and Machine Learning in the SOC — What Is Real

The gap between the AI capabilities that vendors market and the AI capabilities that actually improve SOC operations is significant. The following distinctions matter for practitioners evaluating current-generation tools:

What ML Does Well Today

  • Anomaly detection at scale — identifying unusual behaviour in high-volume telemetry streams (network traffic patterns, authentication timing, data access volumes) where rule-based approaches would require thousands of individual rules. UEBA platforms are the most mature application.
  • Alert triage scoring — ranking alerts by predicted importance based on historical analyst dispositions and contextual features. Reduces cognitive load on Tier-1 analysts by surfacing the most likely true positives first.
  • Malware classification — ML models classifying PE files as malicious or benign based on static features (PE structure, imported functions, section entropy) at high accuracy for known malware families. Less reliable for novel malware.
  • Natural language processing for phishing detection — identifying phishing emails through content analysis, domain similarity detection (typosquatting), and brand impersonation at scale.

What LLMs Add (and Their Limitations)

Large language models — integrated into security platforms like Microsoft Copilot for Security, CrowdStrike Charlotte AI, and Google's Security AI Workbench — offer a new category of capability: natural language interaction with security data. An analyst can ask "summarise the last 24 hours of activity for this endpoint and highlight anything unusual" and receive a coherent, contextualised response in seconds. They can generate draft incident reports from case notes, translate alert logic between query languages, and explain unfamiliar malware techniques in plain language.

Limitations that practitioners must internalise: LLMs hallucinate — they produce confident, plausible-sounding but factually incorrect answers. In a security context, this is dangerous. An LLM that incorrectly summarises an alert, misattributes a technique, or generates an incorrect remediation step can cause serious harm if the analyst treats its output as authoritative without verification. LLMs are productivity tools for the skilled analyst — not replacements for analytical judgment.

Autonomous SOC — Realistic Timeline

Full autonomous SOC operation — where AI detects, investigates, and remediates incidents without human involvement — is technically possible for a narrow class of high-confidence, well-defined incidents (known malware on an isolated endpoint, confirmed phishing with no credential compromise). It is not realistic for complex, multi-stage incidents involving novel techniques, business context, and regulatory considerations in the near-to-medium term. The realistic trajectory: increasing automation of routine work, with human analysts focused on complex investigations, detection engineering, and threat hunting. The alert monkey role (Tier-1 processing a high-volume alert queue) will shrink; the detection engineer and threat hunter roles will grow.

Extended Detection and Response (XDR)

XDR is a platform architecture that integrates telemetry across endpoint, network, identity, cloud, and email into a single detection and investigation interface — breaking down the silos between EDR, SIEM, and network monitoring. Major XDR platforms: Microsoft Defender XDR (endpoint + email + identity + cloud apps), CrowdStrike Falcon XDR (endpoint + network + cloud), Palo Alto Cortex XDR.

XDR delivers on its promise when it reduces the analyst's context-switching overhead across multiple tools and enables cross-domain correlation that SIEM correlation rules approximate but native telemetry integration achieves more cleanly. It does not eliminate the need for a SIEM for long-retention data storage, compliance reporting, and integration of non-XDR data sources.

Zero Trust and SOC Implications

Zero Trust architecture — never trust, always verify; least-privilege access; assume breach — changes what the SOC monitors. In a Zero Trust network, every access request generates an authentication and authorisation decision that is logged, and lateral movement is constrained by microsegmentation. This produces different monitoring priorities:

  • Identity becomes the primary perimeter — authentication logs become the highest-value data source
  • East-west lateral movement becomes harder and more detectable — microsegmentation means unexpected cross-segment traffic is a high-fidelity anomaly
  • The monitoring surface shifts from network perimeter to identity plane and application layer

The Purple Team as Continuous Improvement Engine

The Purple Team model — combining offensive red team techniques with defensive blue team participation in real time — is the most effective mechanism for continuously improving SOC detection capability. Rather than a quarterly red team exercise that produces a report the SOC reads and sets aside, purple team exercises actively measure detection coverage and tuning in your real environment, produce specific detection improvements, and build collaboration between offensive and defensive teams. Organisations that run regular purple team exercises progressively close detection gaps that tabletop exercises and theoretical coverage mapping cannot reveal.

Next-Generation Threats

The threat landscape the SOC of 2030 will face is already emerging:

  • AI-generated phishing at scale — LLMs can generate highly personalised, grammatically perfect phishing emails at zero marginal cost. Volume and quality will both increase. Detection must shift from linguistic analysis to authentication-based and behavioural signals.
  • AI-assisted malware — LLMs can help attackers write custom exploits, generate polymorphic malware variants, and optimise evasion techniques. The barrier to creating novel, detection-resistant malware decreases.
  • Deepfake social engineering — audio and video deepfakes used to impersonate executives in BEC-style attacks. Several significant wire fraud cases have already involved deepfake audio. Detection requires out-of-band verification procedures for high-value transactions.
  • Supply chain as primary vector — the SolarWinds and XZ Utils incidents demonstrated that trusted software supply chain compromise provides access to thousands of targets simultaneously. SOC monitoring of software update activity and third-party component behaviour will become a standard control.
Key Takeaways — Chapter 15
  • ML excels at anomaly detection, alert scoring, and malware classification; LLMs add productivity but hallucinate — treat their output as a draft, not a verdict
  • Autonomous SOC for routine, high-confidence incidents is realistic; full automation of complex investigations is not near-term
  • XDR reduces analyst context-switching by integrating cross-domain telemetry; it complements rather than replaces the SIEM
  • Zero Trust shifts the primary monitoring surface to identity — authentication logs become the highest-value data source in a Zero Trust architecture
  • Purple Team exercises continuously improve detection coverage more effectively than any theoretical mapping — the gap between what you think you can detect and what you actually detect is revealed only by real testing