Resources / Blog / Data Engineering for Cybersecurity, Part 5: Enrichment

April 8, 2025

Data Engineering for Cybersecurity, Part 5: Enrichment

Blog

Security breaches rarely occur because teams didn't have enough alerts or log data. Instead, incidents often happen because someone (or some tool) overlooked a crucial sequence of events or alerts hidden in a sea of noise. Handling telemetry from 10s to 100s of sources can quickly become overwhelming. 

Think of a SIEM continuously being flooded with massive volumes of raw logs from different data sources, but no enrichment to tie it all together. Each atomic event on its own fails to paint a comprehensive picture of what that activity may mean to your organization. You might notice a Lambda function running a bit more than usual, but miss the attacker who compromised a developer’s IAM keys and set up a sneaky Lambda function with a generic name mimicking your production services to quietly copy sensitive data from an S3 bucket to their external server using scheduled GetObject API calls during normal business hours. The S3 requests seem like legit, authorized traffic from expected IP ranges with proper authentication, so without proper correlation and context, they're just routine logs buried among millions of others… until you realize something’s off down the line.

The data to detect threats is often already in your tools, it's just scattered across disconnected logs, waiting for someone (or something) to connect the dots into a clear attack story.

When security teams spend hours investigating false positives and chasing down context for each “critical” severity alert, they miss actual threats and risk burning out. The solution isn't more data, it’s contextualized data and automation that ultimately helps security teams make better decisions faster and stop breaches.

In this post, we’ll cover: 

  • What is security data enrichment
  • Common enrichment sources and categories
  • AWS CloudTrail event enrichment example (Before and After)
  • How, when, and where enrichment is applied  

What Is Security Data Enrichment? 

Simply put, data enrichment adds relevant context to raw security events, findings, and alerts. It helps answer the big questions like: 

  • Who accessed what?
  • What actions were performed? 
  • What’s the importance of the asset impacted? 
  • Is this behavior normal? 
  • What's the potential impact?
  • Is this tied to a known threat campaign?

Why it matters: Without enrichment, analysts spend hours manually investigating alerts, jumping between tons of tools, copying and pasting IPs, cross-checking usernames, and trying to piece together behavior patterns. It’s a massive time sink, and it leads to alert fatigue (false positives) and missed threats (false negatives). Enrichment flips the script by bringing key context to the log data rather than analysts having to hunt for it. 

Note: Throughout the post, we refer to different enrichment types including enrichments for alerts, logs, findings and other telemetry. In reality, you can enrich just about any type of telemetry so we tried to cover the many different enrichment methods, use cases, sources, and limitations.

Common Enrichment Sources

External Context:

  • Threat Intelligence: Maps IPs, domains, and hashes to known bad actors, campaigns, or C2 servers (e.g. VirusTotal, Recorded Future, AbuseIPDB).
  • Vulnerability Data: Ties CVEs to exploitability, patch status, and real-world severity (e.g. GitHub Advisory DB, VulnCheck, CISA KEV).
  • Geolocation: Resolves IP addresses to physical locations, ASNs, and network reputation data (e.g. MaxMind GeoIP, Shodan, Tor Exit Lists).
  • MITRE ATT&CK Mapping: Classifies events based on known attacker tactics and technique (e.g. ATT&CK framework, vendor mappings).

Internal Context:

  • Asset Context: Links IPs and hostnames to device type, owner, and how critical it is to the business (e.g. CMDB, AWS Config, cloud inventory).
  • Identity & Session Context: Connects users to roles, permissions, departments, and typical behavior patterns. Also includes authentication details such as sign-in method, MFA usage, and session IDs (e.g. Okta, Entra ID, Workday).
  • Business Context: Maps systems and data to business processes, compliance requirements, risk levels, and more (e.g. internal data+resource tagging, GRC tools).
  • Behavioral Baselines: Highlights anomalies by comparing current behavior to historical user or entity activity (e.g. UEBA tools, SIEM logs, behavior analytics).
  • Application & API Context: Metadata from WAFs, API gateways, or observability platforms to identify anomalies or abuse patterns (e.g. WAF logs, API gateway telemetry, observability data).
  • Risk Scoring: Assigns severity or priority based on real-time correlation and attack context (e.g. Splunk RBA, Microsoft Sentinel, Exabeam).

CloudTrail Example

Scenario: A raw AWS CloudTrail event shows someone tweaked an S3 bucket policy. Might be worth digging into depending several variables including:  

  • Data type in the bucket (sensitive or not?)
  • Importance of the resource (is it a crown jewel?)
  • Who made the change (admin or unauthorized user?)
  • The nature of the change (more secure or less?)
  • Does this deviate from “normal” activity? (time of day, MFA, userAgent etc.) 

Enriching our sample Cloudtrail event uncovers that it was a user authenticated without MFA, a connection from a Tor exit node, and activity showing access to critical financial records. With this additional context, it’s no longer just a random S3 policy change, it’s something worth investigating.

Let’s take a look at how we would enrich this event.

Before

Raw CloudTrail logs provide a snapshot of what happened, where, actor, and other metadata, but without context, they can be difficult to interpret at scale. Many modern enterprise environments see millions of CloudTrail events per day. 

Below is an example of an unprocessed CloudTrail event:

1{
2   "eventVersion": "1.09",
3   "userIdentity": {
4       "type": "IAMUser",
5       "principalId": "AIDA4EAEXAMPLE",
6       "arn": "arn:aws:iam::123456789012:user/david.miller",
7       "accountId": "123456789012",
8       "accessKeyId": "AKIA4EAEXAMPLEKEY",
9       "userName": "david.miller",
10       "sessionContext": {
11           "attributes": {
12               "creationDate": "2025-02-12T02:15:00Z",
13               "mfaAuthenticated": "false"
14           }
15       }
16   },
17   "eventTime": "2025-02-12T02:17:32Z",
18   "eventSource": "s3.amazonaws.com",
19   "eventName": "PutBucketPolicy",
20   "awsRegion": "us-east-1",
21   "sourceIPAddress": "45.92.157.198",
22   "userAgent": "aws-cli/2.15.0 Python/3.9.7",
23   "requestParameters": {
24       "bucketName": "financial-records",
25       "Policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":\"*\"},\"Action\":[\"s3:GetObject\"],\"Resource\":\"arn:aws:s3:::financial-records/*\"}]}"
26   },
27   "responseElements": "null",
28   "requestID": "9d478fc1-4f10-490f-a26b-0e9327d54c5a",
29   "eventID": "eae87c48-d421-4626-94f5-1ac994d3e932",
30   "readOnly": false,
31   "eventType": "AwsApiCall",
32   "managementEvent": true,
33   "recipientAccountId": "123456789012",
34   "eventCategory": "Management",
35   "tlsDetails": {
36       "tlsVersion": "TLSv1.2",
37       "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
38       "clientProvidedHostHeader": "s3.amazonaws.com"
39   },
40   "sessionCredentialFromConsole": "false"
41}

At first glance, this log tells us that an IAM user, David Miller, performed an action against an S3 bucket, but to the naked eye, it lacks context on what exactly he did and what the implications could be. 

In reality, David Miller modified the policy of an S3 bucket (financial-records) to allow public access, meaning anyone on the internet can now read files from the bucket. This was done without using MFA.

After

By enriching the log record, we can transform this basic log into a high-context, actionable event that helps paint a more holistic picture of the user action. As you’ll see, the enriched record screams to the user that this event is very likely nefarious activity:

1{
2 "timestamp": "2025-02-12T02:17:32Z",
3 "account_context": {
4   "account_id": "123456789012",
5   "account_name": "prod-finance"
6 },
7 "actor": {
8   "user": "david.miller",
9   "user_type": "IAMUser",
10   "authentication_context": {
11     "mfa_status": "NOT_USED",
12     "mfa_required": true
13   }
14 },
15 "action": {
16   "event_name": "PutBucketPolicy",
17   "classification": "CRITICAL",
18   "impact": "PUBLIC_DATA_EXPOSURE",
19   "policy_analysis": {
20     "changes": [
21       "granted_public_access",
22       "removed_encryption"
23     ],
24     "affected_permissions": ["s3:GetObject"],
25     "public_access": true
26   }
27 },
28 "resource_context": {
29   "resource_type": "S3 Bucket",
30   "resource_name": "financial-records",
31   "Resource_arn": "arn:aws:s3:::financial-records",
32   "region": "us-east-1",
33   "tags": {
34     "contains-pii": "true",
35     "compliance-scope": "pci-dss,sox",
36     "department": "finance"
37   }
38 },
39 "source_context": {
40   "ip": "45.92.157.198",
41   "geo_location": { "country": "Russia", "city": "Moscow" },
42   "network_context": {
43     "tor_exit_node": true
44   },
45   "threat_intel": {
46     "known_malicious": true,
47     "active_c2_node": true
48   }
49 }
50}

Quick note: The enriched CloudTrail event shown here is a representative example. The actual structure and fields will vary based on the enrichment sources, tools, and pipelines in use within your environment. This is meant to illustrate what enriched context can look like—not a canonical format.

The enriched event gathers context from threat intel feeds, GeoIP databases, asset inventory, and AWS IAM. In order for an analyst to gather this much context without enrichment, they would need to run lengthy, time-consuming correlation queries and jump between multiple tools. This could take anywhere from 20-60 minutes per event and often more. The saddest part about it is that these manual goosehunts often lead to dead ends (fka false positives) which is resoundingly one of the toughest parts of being a SecOps operator. 

By leveraging automated enrichment, security teams regain that time and energy to focus on higher priority tasks like building out investigation playbooks or better detections.

Applying Enrichments

Now that we've showcased the value of enrichment and how it can contextualize raw log records, it’s important to step back and look at the broader picture: how and when enrichment is applied.

Enrichment isn’t confined to a single stage. It can (and often does) happen at multiple points throughout an investigation or automated detection flow. Regardless of the where or the how, the goal is always the same: to provide additional context or information that enables better-informed decision-making.

There are a few factors that complicate enrichment at scale:

  • Access to additional information – Some data may not be readily available in security tools or retrievable via API hence CSV files and Lookup functions.  
  • Lack of Common identifiers – Logs from different sources may not share a common, normalized fields for correlation hence why data transformation is crucial. We covered this in Part 4 of the series.
  • Data volume – Some environments generate millions of log events per day. 
  • Time and effort – Manual enrichment by analysts doesn't scale with massive data volumes.

It probably comes as no surprise that what seems like a simple problem at a small scale becomes impossible to do at the scale of modern enterprise environments.

In security operations, time and energy are the scarcest resources. Teams must make tradeoffs in where they invest these resources to maximize impact and stop threats.

Manual enrichment is simply not scalable, so the next question becomes: How do we enrich events efficiently? And at what stage in the given workflow?

Traditionally, enrichment occurred after an analyst reviewed an event:

  1. An event occurs.
  2. A human investigates it.
  3. The analyst manually queries other systems for additional context.
  4. Once they feel that they have enough information, they make a decision.

At scale, this approach fails—hence the need for automated enrichment at different stages of the security workflow.

In this next section, we look at the how, when (use cases), and where (stage).

Methods and Use Cases

Enrichment isn’t a one-size-fits-all process. It happens at different stages of the pipeline and serves different goals depending on the use case. Whether you're enriching logs at the point of collection, calling APIs in-flight, or correlating events after the fact, each method has trade-offs in speed, cost, and context.

Early enrichment at ingestion time. Courtesy: Salesforce

Some teams rely on lookup tables for known bad IPs. Others embed enrichment logic directly in pipelines using tools like Cribl or NiFi. More advanced teams might stream real-time context from threat intelligence APIs or use SOAR playbooks to automate correlation after ingestion especially for important alerts. AI for SecOps tools have also emerged as great potential enrichment tools with the end goal of automating triaging. 

Late enrichment at reporting time. Courtesy: Salesforce

Then there is also post-processing, where scheduled queries stitch together signals over time to surface slow-burn threats or support threat hunting by surfacing patterns not obvious in isolated events.

As you can see, there are numerous ways to enrich data. Understanding the "how" behind each of these enrichment methods helps you choose the right one for the job. It can also help you avoid overloading your SIEM, delaying alerts, or missing key context when it matters most.

Wrapping Up

Enrichment is what provides context and transforms security data from noise into a narrative. By automating context collection and correlation, teams regain precious time and energy previously lost chasing false positives. Ultimately, enriched data shifts SecOps from reactive firefighting to proactive threat detection and prevention, enabling security teams to focus on more critical tasks. 

If you’re executing investigation playbooks, correlation searches, threat hunting, incident response or anything in between, you’re already doing enrichment. However, based on what we covered in this post, you can probably do it in a much more efficient and scalable fashion. Enrichment is crucial for many security functions and we hope that this post inspires you to take a fresh look at your current processes and fully embrace the power of enrichment.  I'm also excited to share that soon, this is something that we at Monad will be able to help you with!

In the next part of this series, we’ll explore data routing strategies and techniques. Let's face it - dumping everything into your SIEM isn't sustainable. Between budget constraints forcing you to drop valuable log sources and still missing threat activity, there's a better way. We'll cover how to route data to meet your unique use cases without breaking the bank or sacrificing on visibility.

Darwin Salazar

About
About

Related content

Data Engineering for Cybersecurity, Part 5: Enrichment

Darwin Salazar

|

April 8, 2025

Data Engineering for Cybersecurity, Part 5: Enrichment

Product Release Notes - March 2025

Darwin Salazar

|

March 31, 2025

Product Release Notes - March 2025

Product Release Notes - February 2025

Darwin Salazar

|

February 27, 2025

Product Release Notes - February 2025

ETL for Security, Made Easy

Effortlessly transform, filter, and route your security data. Tune out the noise, surface the signal, and achieve data nirvana with Monad.