OSINT Investigation — Learn

Red Team · Medium

Open Source Intelligence

Master the full OSINT methodology — from passive DNS enumeration and Google dorking through breach data correlation, Shodan infrastructure mapping, job posting analysis, and the defensive hardening strategies that limit what attackers can discover before touching a single system.

Medium Red Team Path ⏱ 22 min read CEH Aligned

Learning Progress

What is it?

OSINT — Open Source Intelligence

CEH DomainModule 02 — Footprinting & Reconnaissance · Passive vs Active Reconnaissance

Open Source Intelligence is the collection and analysis of information from publicly available sources. In penetration testing, OSINT is the first phase of every engagement — gathering maximum information about the target without making a single connection to their systems. The goal is to build a complete picture of the attack surface before active scanning begins, so that scanning and exploitation can be directed with precision rather than noise.

Effective OSINT can reveal employee names and emails (phishing targets), technology stack (exploitable software versions), infrastructure (IP ranges and hosting providers), vendor relationships (supply chain pivot paths), and sometimes credentials from past breaches — all without the target knowing you are looking. For the CEH exam, OSINT sits squarely within Module 02: Footprinting and Reconnaissance, and distinguishing between passive and active reconnaissance is a direct exam objective.

💡Passive vs Active Reconnaissance: OSINT is entirely passive — you query public third-party sources, never the target's own systems. This makes it legal regardless of authorisation. Active reconnaissance (connecting to target systems to probe them) requires written permission. The CEH tests this distinction explicitly — knowing which category a given technique falls into is exam-critical.

Why OSINT Comes First — Every Time

The reconnaissance phase determines the quality of everything that follows. An attacker who skips OSINT and jumps straight to scanning will generate more noise, miss more attack vectors, and look less credible in a professional engagement than one who arrives knowing the organisation's internal tool names, their key personnel, and which of their subdomains is still running a three-year-old CMS.

OSINT-first is not just a methodology preference — it is operationally significant. Network intrusion detection systems log active scans. Firewalls record connection attempts. But no alarm fires when someone reads a company's LinkedIn page, searches their GitHub repositories, or queries a third-party breach database. The asymmetry is striking: an attacker can spend days building a detailed picture of a target's infrastructure while the target has no awareness whatsoever that reconnaissance is underway.

📌 Non-Technical Analogy

Before a professional burglar cases a target building, they do not walk up and rattle the front door handle. They observe from a distance. They read the building's planning permission records at the council office. They check the company website for office hours. They watch which delivery services come and go, noting the entry points used. They read reviews left by employees on job sites to understand the internal layout and security culture. None of this requires them to set foot on the property — and none of it triggers an alarm. OSINT reconnaissance is exactly this systematic pre-approach observation, conducted entirely through publicly accessible records that the target has no ability to monitor or restrict.

How it works

The OSINT Methodology

CEH ObjectiveFootprinting methodology: competitive intelligence, website footprinting, email footprinting, DNS footprinting, network footprinting

Effective OSINT follows a structured expansion: start with the single seed entity you know (usually a domain name or company name) and systematically branch outward through every data source that accepts that entity as input. Each discovery becomes a new seed for the next query. The graph of connections grows until you reach a point of diminishing returns — typically when new queries produce only data you've already seen from other paths.

OSINT Investigation Framework — Expansion Tree

Domain     → DNS records, WHOIS, subdomains, cert logs, ASN/IP ranges
Employees  → LinkedIn, job postings, email pattern, GitHub profiles
Tech Stack → Job ads, Wappalyzer, Shodan banners, cert SAN entries
Breaches   → HaveIBeenPwned, DeHashed, paste sites, dark web monitors
Code       → GitHub/GitLab public repos, secrets in commits, internal hostnames
Shodan     → Internet-facing infrastructure, open ports, software versions
Documents  → Google-indexed PDFs, XLSX, DOCX with metadata
Social     → Twitter/X, Instagram, press releases, conference talks

The CEH Footprinting Categories

The CEH organises footprinting techniques into specific categories. Each maps to a distinct set of tools and data sources, and the exam tests which category a given technique belongs to:

Website footprinting: Crawling public web content, analysing page source for comments and metadata, checking robots.txt and sitemap.xml for hidden paths, extracting document metadata with ExifTool.
DNS footprinting: MX, NS, A, AAAA, TXT, and CNAME record enumeration; zone transfer testing (AXFR); reverse DNS lookup of IP ranges; passive DNS history via SecurityTrails or DNSDB.
Network footprinting: WHOIS for registrant data; BGP/ASN lookup for IP ranges; traceroute for network path mapping; geolocation of IP addresses.
Email footprinting: MX record analysis; SPF/DMARC policy review; email header analysis from sample messages; email harvesting from public sources.
Competitive intelligence: Job postings revealing technology choices and team structure; press releases announcing acquisitions and partnerships; financial filings disclosing infrastructure and vendors.

Examples

OSINT Techniques in Practice

Example 01Google dorking

Google advanced operators narrow searches to find sensitive files, login pages, and exposed data indexed by search engines. These queries work because Google has already crawled and indexed the content — the attacker is simply filtering its results with precision.

# Find login pages
site:example-corp.com inurl:login
# Find exposed documents
site:example-corp.com filetype:pdf OR filetype:xlsx
# Find config files accidentally indexed
site:example-corp.com ext:env OR ext:config OR ext:sql
# Find employee info
site:linkedin.com "example corp" "security engineer"

Example 02Subdomain enumeration

Subdomains reveal internal systems, staging environments, APIs, and admin panels that may be less secured than the main site. Combining multiple passive sources produces more complete results than any single tool.

subfinder -d example-corp.com -silent
mail.example-corp.com
staging.example-corp.com
api.example-corp.com
jira.example-corp.com
vpn.example-corp.com
amass enum -passive -d example-corp.com
dev-internal.example-corp.com  ← internal system exposed

Example 03GitHub secret scanning

Developers often accidentally commit API keys, passwords, and private keys to public repositories. Even if the secret is deleted later, it remains in git history — permanently accessible to anyone who clones the repo.

truffleHog github --org=example-corp
Reason: High Entropy String
Path: config/database.yml
Branch: main (commit: a3f92b1)
  password: "Pr0ductionDB!2023"
# Search GitHub manually:
site:github.com "example-corp" "API_KEY"

Example 04Breach database lookup

If employees use corporate email for external services that were breached, those credentials may still work — especially if passwords are reused across personal and corporate accounts.

haveibeenpwned.com → check domain example-corp.com
Found in 3 breaches:
- LinkedIn 2012: 47 accounts
- Adobe 2013: 12 accounts
- Collection #1 2019: 8 accounts
# Credential stuffing risk:
# If these employees reuse passwords → VPN / O365 at risk

Key Concepts

What You Need to Know

🌐

Google Dorking

Advanced search operators (site:, filetype:, inurl:, intitle:) narrow searches to find sensitive indexed content — a surprisingly powerful passive technique that requires zero tools beyond a browser.

📜

Certificate Transparency

All TLS certs are publicly logged at crt.sh. Searching by domain reveals every subdomain ever certificated — including internal and staging systems the organisation never intended to expose publicly.

💾

Breach Data

Past data breaches expose employee credentials. Even old passwords reveal patterns — and many users reuse passwords across systems, making historical breaches current threats.

🔑

Credential Stuffing Risk

Using breached credentials against corporate login portals. One reused password can provide VPN or email access to the entire organisation — no exploitation required.

Advanced Sources

The Full OSINT Source Landscape

CEH ObjectiveModule 02 — Footprinting Tools: WHOIS, DNS interrogation, Shodan, social engineering through OSINT

A professional OSINT investigation draws from dozens of specialised data sources simultaneously. Understanding what each source reveals — and crucially what it does not — prevents both gaps in coverage and wasted time querying sources that won't yield useful data for the current target type.

🔍

Shodan

Internet-facing device index. Shows open ports, software banners, TLS certificates, and geolocation for any IP. The attacker's view of your perimeter.

🌐

crt.sh

Certificate transparency log search. Every TLS cert ever issued for a domain, revealing all subdomains including long-forgotten ones.

📡

SecurityTrails

Historical DNS records. Shows what IPs a domain pointed to in the past — surfacing origin IPs hidden behind CDNs.

🔓

HaveIBeenPwned

Breach notification. Domain-level search shows how many employee accounts appeared in known data breaches and which ones.

🐙

GitHub / GitLab

Public repos and commit history. Source of hardcoded secrets, internal hostnames, infrastructure-as-code revealing network topology.

💼

LinkedIn / Job Boards

Employee enumeration, org chart inference, technology stack disclosure ("experience with AWS, Kubernetes, FortiGate required").

📋

WHOIS / RDAP

Domain registrant data, registration dates, name servers, registrar. Privacy-protected domains reduce yield but NS records always remain visible.

🗄️

Wayback Machine

Historical web snapshots. Finds pages that no longer exist — old login portals, deprecated APIs, removed documentation with internal details.

📊

BuiltWith / Wappalyzer

Technology fingerprinting from HTTP headers, page source, and cookies. Identifies CMS, analytics, CDN, framework, and hosting provider.

Shodan — The Search Engine for the Internet of Things

Shodan deserves special attention because it is the most operationally significant OSINT source for infrastructure mapping. Unlike Google, which indexes web page content, Shodan indexes the responses that internet-connected devices give to direct connection probes — port banners, TLS certificates, and protocol handshakes. Every device that responds to a connection on any port gets catalogued.

From a defender's perspective, what Shodan reveals about your organisation is exactly what an attacker sees before they have done anything active. A Shodan search for your company's ASN or IP ranges produces a real-time inventory of every publicly reachable port on every device you operate — including devices your IT team may not know are public-facing.

Example 05Shodan infrastructure mapping

Query Shodan for an organisation's infrastructure using their ASN, IP range, or organisation name. The results reveal open ports, software versions, and devices the organisation may not realise are internet-facing.

# Search by organisation name:
org:"Example Corp Technologies"
Results: 47 hosts
Ports found: 22, 80, 443, 3389, 8080, 8443, 9200
9200/tcp open — Elasticsearch (no auth required!)

# Search for specific vulnerable software across an IP range:
net:104.21.44.0/22 product:"Apache httpd" version:"2.4.49"
# Apache 2.4.49 is vulnerable to CVE-2021-41773 (path traversal / RCE)
# Shodan finds every host in the range running this exact version

# Find exposed RDP (common misconfiguration):
org:"Example Corp" port:3389
3 hosts with RDP exposed directly to internet — high risk

Job Posting Intelligence — Technology Stack Disclosure

Job postings are one of the most underutilised OSINT sources, yet they reliably reveal more about an organisation's internal technology stack than almost any other public source. When a company posts a role for a "Senior DevOps Engineer — experience with Terraform, AWS EKS, Datadog, and HashiCorp Vault required," they have just disclosed their cloud provider, container orchestration platform, monitoring tool, and secrets management system to every attacker who searches for it.

This disclosure is not accidental — organisations need to attract qualified candidates. But it creates a detailed map of the attack surface that maps directly to CVE databases. An attacker who identifies the specific versions in use (often disclosed in more detailed job requirements or in conference talks by the organisation's engineers) can pre-research relevant exploits before touching the target at all.

Example 06Job posting technology stack extraction

Systematically mine job postings for technology disclosures. A single senior engineering role posting can reveal the entire infrastructure stack.

# Search for target's engineering job postings:
site:linkedin.com OR site:indeed.com "example corp" "senior engineer"

# Extracted technology intelligence from a single posting:
Cloud:      AWS (EC2, RDS, S3, CloudFront)
Container:  Docker, Kubernetes (EKS)
IaC:        Terraform, Ansible
Monitoring: Datadog, PagerDuty
Auth:       Okta SSO, HashiCorp Vault
DB:         PostgreSQL 14, Redis 7
WAF:        AWS WAF + Cloudflare

# Cross-reference with CVE databases:
# AWS WAF bypass techniques, known Cloudflare misconfigs,
# PostgreSQL 14.x vulnerabilities, Redis auth bypass patterns

DNS Footprinting

DNS and WHOIS — The Infrastructure Map

CEH ObjectiveModule 02 — DNS footprinting: record types, zone transfers, reverse lookup, passive DNS history

DNS is the telephone directory of the internet — it translates human-readable names into IP addresses. Because DNS is a public lookup system by design, it exposes a significant amount of an organisation's infrastructure to anyone willing to query it. The CEH tests knowledge of specific DNS record types and what each reveals about a target.

A / AAAA records: Map hostnames to IPv4/IPv6 addresses. The primary address mapping — reveals hosting IP, which can be correlated with Shodan for port data.
MX records: Mail exchange servers. Reveals the email provider (Google Workspace, Microsoft 365, or self-hosted), which informs phishing infrastructure choices.
NS records: Name servers authoritative for the domain. Reveals the DNS provider — sometimes also the hosting provider. Multiple NS records from the same provider can link related domains.
TXT records: Arbitrary text — but always contains SPF policy, DMARC policy, often DKIM selectors. A missing or permissive SPF record means the domain can be spoofed for phishing.
CNAME records: Aliases — one hostname points to another. Dangling CNAMEs (pointing to deprovisioned third-party services) are targets for subdomain takeover attacks.
SOA records: Start of Authority — contains the primary nameserver and admin email. Often reveals internal naming conventions and technical contact details.

Example 07Comprehensive DNS footprinting

Systematic DNS enumeration extracts every record type from a target domain, building a complete picture of their mail infrastructure, CDN usage, and any dangling records that might be takeover candidates.

# Enumerate all common record types:
dig example-corp.com ANY +noall +answer
example-corp.com.  A      104.21.45.67      (behind Cloudflare)
example-corp.com.  MX     mail.example-corp.com (self-hosted SMTP)
example-corp.com.  TXT    "v=spf1 include:_spf.google.com ~all"
example-corp.com.  TXT    "v=DMARC1; p=none; rua=..."
                         ↑ p=none = no enforcement, domain can be spoofed!

# Check for zone transfer (misconfiguration — reveals ALL records):
dig axfr example-corp.com @ns1.example-corp.com
; Transfer failed. (Correctly configured — zone transfers disabled)

# Passive DNS history — find origin IP behind CDN:
curl "https://api.securitytrails.com/v1/history/example-corp.com/dns/a"
Historical A record: 203.0.113.45 (pre-Cloudflare, direct server IP)
# Direct IP bypasses CDN-level WAF — direct attack surface now known

⚠️DMARC Policy Matters: A DMARC policy of p=none means the domain has monitoring but no enforcement — phishing emails spoofing the domain will not be blocked. p=quarantine routes them to spam. p=reject blocks them entirely. Any organisation with p=none or no DMARC record at all is trivially spoofable for phishing attacks targeting their customers and employees.

Attack Chain

The Full OSINT-to-Access Chain

CEH ObjectiveModule 02 — Combining footprinting results to plan targeted attacks

Individual OSINT findings have limited value in isolation. The real power of OSINT is in combining discoveries across sources to construct a chain — where each piece of information enables the next, and the chain ends in actionable access or a credible attack path.

Attack ChainFrom Company Name to VPN Access — Pure OSINT

Step 1 — Domain enumeration: Starting from the company name, WHOIS reveals the registered domain and the registrant email. DNS enumeration surfaces vpn.example-corp.com from certificate transparency logs.

Step 2 — Technology identification: The VPN subdomain's TLS certificate and Shodan banner identify the VPN product as Cisco AnyConnect 4.9.x. A job posting confirms "Cisco AnyConnect administration experience required."

Step 3 — Employee enumeration: LinkedIn enumeration produces 47 employees at the company. Email pattern is confirmed as [email protected] from a press release signatory. A full employee list is constructed from LinkedIn names.

Step 4 — Breach correlation: HaveIBeenPwned shows the domain appeared in the 2016 LinkedIn breach. DeHashed returns 12 matching email/hash pairs. Three hashes crack to weak passwords — Welcome2016!, Summer16!, Corp2016!.

Step 5 — Access: The cracked passwords are tested against the VPN portal. One account — a senior network engineer still employed — reused their 2016 LinkedIn password. VPN access granted. No exploit was used. No scan was performed. The entire chain was built from public data.

Hidden Data

Document Metadata — The Overlooked Intelligence Source

CEH ObjectiveModule 02 — Website footprinting: metadata extraction from public documents

When organisations publish documents — PDFs, Word files, spreadsheets, presentations — those files carry embedded metadata that was generated automatically by the software that created them. This metadata is invisible to a casual reader but trivially extractable with free tools. It frequently reveals internal usernames, file paths exposing server naming conventions, software versions, and GPS coordinates from photos embedded in documents.

Example 08ExifTool metadata extraction from public documents

Every document an organisation publishes publicly is a potential intelligence source. ExifTool extracts all embedded metadata in seconds, revealing information the document authors never intended to share.

# Download a publicly indexed PDF from the target and extract metadata:
exiftool annual_report_2024.pdf
Creator        : Microsoft Word 2019
Author         : j.harrington
Last Modified  : 2024-03-14 09:22:11
Company        : Example Corp Technologies Inc.
Template       : \\FILESERVER01\templates\corp_template.dotx
                  ↑ Internal file server hostname revealed!
Software       : Adobe Acrobat 23.6.20320.6

# From a single PDF we now know:
# - An employee username (j.harrington) → likely email: [email protected]
# - An internal file server hostname (FILESERVER01)
# - The internal UNC path format used for templates
# - Exact software versions for CVE lookup

Defence

Reducing Your OSINT Footprint

CEH ObjectiveModule 02 — Countermeasures: footprinting prevention techniques

The most important defensive insight from OSINT methodology is that the attacker's information gathering happens entirely outside your visibility. You cannot detect it with a firewall, an IDS, or endpoint monitoring. The only defence is reducing the volume and quality of information that is publicly available in the first place — and accepting that some exposure is inevitable, then building resilience to compensate.

What to Reduce

WHOIS privacy: Enable domain privacy protection to prevent registrant details appearing in public WHOIS records.

Document metadata: Strip metadata before publishing any document. Adobe Acrobat, LibreOffice, and dedicated tools (ExifTool, mat2) can remove all embedded metadata before release.

DMARC enforcement: Move from p=none to p=reject to prevent domain spoofing in phishing attacks targeting your customers and staff.

GitHub hygiene: Enforce pre-commit hooks scanning for secrets. Audit all public repos. Remove sensitive data from git history using git-filter-repo.

What to Monitor

Your own CT logs: Subscribe to certificate transparency monitoring (Certspotter, Facebook CT Monitor) to be alerted when new certs are issued for your domains — including ones you didn't create.

Breach databases: Monitor HaveIBeenPwned and similar services for your domain. Alert immediately when employee emails appear in new breach data.

Shodan alerts: Set up Shodan monitors for your IP ranges. Receive alerts when new ports open or software versions change — often the first sign of a misconfiguration.

Paste sites: Monitor Pastebin and similar sites for your domain name, IP ranges, and employee usernames appearing in credential dumps.

✅Defensive Mindset Shift: The most effective organisations treat their own OSINT footprint as a metric — tracking the number of exposed subdomains, credentials in breach databases, and public GitHub secrets over time, with targets for reduction. The question is never "are we exposed?" (you always are) — it is "how much, where, and are we improving?" Running OSINT against yourself before an attacker does is the highest-ROI defensive activity available at no cost beyond staff time.

Reference

Core Concepts Summary

🌐

Google Dorking

site:, filetype:, inurl:, intitle:, ext: operators narrow Google to find sensitive indexed content. The Google Hacking Database (GHDB) catalogues thousands of proven dorks for specific vulnerability types.

📜

Certificate Transparency

All TLS certs publicly logged at crt.sh. Reveals every subdomain ever certificated. Dangling CNAMEs on old subdomains = subdomain takeover candidates. Monitor your own CT logs for unauthorised cert issuance.

💾

Breach Data

HaveIBeenPwned for domain-level exposure. DeHashed for specific hash/password recovery. Even 10-year-old breaches are current threats because password reuse rates remain above 50% in most organisations.

🔑

Credential Stuffing

Automated testing of breached credentials against login portals. Defended by MFA (eliminates PtH/stuffing entirely), impossible travel detection, and rate limiting on authentication endpoints.

🛰️

Shodan / Censys

Index internet-facing devices by scanning every IP. Shows open ports, software banners, TLS certs for any org. Reveals what attackers see before touching your systems. Run it against yourself regularly.

🐙

GitHub Secret Scanning

TruffleHog and GitLeaks scan commit history for high-entropy strings and known secret patterns. Secrets survive deletion in git history. Pre-commit hooks are the only reliable prevention.

📋

Document Metadata

PDFs, DOCX, images embed usernames, file server paths, software versions, and GPS coordinates. ExifTool extracts all of it. Strip metadata before publishing any document externally.

💼

Job Posting Intelligence

Technology stack disclosure in job requirements enables targeted CVE research before any active probing. Conference talks, blog posts, and press releases supplement — employees routinely disclose more than they realise.

Ready to put it into practice?

Proceed to the Lab

You've covered the theory. Now apply it hands-on in the simulated environment.

Start Lab — OSINT→
← Return to all labs