Data Engineering For Cybersecurity, Part 4: Data Transformation

November 21, 2024
Darwin Salazar

Introduction

Welcome back to our Data Engineering for Cybersecurity series! Security teams today are overwhelmed with vast amounts of raw, unstructured data. Without effective data transformation, this data becomes a burden instead of an asset, resulting in missed threats, alert fatigue, and costly SIEM bills.

In this post, we’ll explore Data Transformation—a process for converting and manipulating raw data to meet specific goals that can improve threat detection, vulnerability prioritization, compliance reporting, operational efficiency and so much more.

Detection engineers and SOCs are overwhelmed by an endless stream of logs, low-priority alerts, irrelevant details, false positives, and drifting detection queries, struggling to identify meaningful patterns that could signal a true threat. Meanwhile, vulnerability management teams are bogged down by thousands of unprioritized high and critical severity findings with lack of context leading to not addressing the most critical vulnerabilities. Many other security domains struggle keeping up with the influx of signals that they get from their tools which keeps them from being effective at stopping threats and attacks.

Data transformation solves most of these challenges by giving security teams the power to clean, filter, standardize, and structure data. In this post, we’ll cover:

  • Why data transformation is critical for security teams.
  • Methods for applying data transformation.
  • Common transformation techniques like filtering, flattening, and normalization.
  • Choosing between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) approaches for your workflows.

Let’s dive in!

Why Data Transformation is Needed in Security

Security teams deal with an avalanche of data every day—from logs, alerts, and telemetry to compliance reports and vulnerability scans. Without proper transformation, this data can overwhelm even the most skilled teams. delaying response times and increasing costs. Here’s why transformation is critical:

  • Noise Reduction: Security data often contains low-value, redundant, empty or irrelevant fields. Filtering out noise ensures teams only focus on high-priority findings and events.
  • Data Consistency: Security data comes in diverse formats, including unstructured, semi-structured, and structured data (JSON, XML, CSV), requiring standardization for seamless analysis and correlation.
  • Improved Efficiency: Clean, structured data makes it easier for people to work with the data. Whether it's helping simplify queries, unifying data from multiple sources, or preventing detection drift, transformation enables teams to focus on getting the needed insights from the data rather than spending valuable time shaping the data to get it in a working state. 
  • Cost Optimization: Transforming data reduces unnecessary storage and processing, saving significant compute and query costs.

Performing Data Transformations

Data transformation is a complex, resource-intensive process that requires specialized tools, dedicated infrastructure, and ongoing engineering expertise. Building and scaling transformations isn’t a one-time effort—it demands continuous maintenance to keep pace with the evolving nature of data sources and organizational needs. 

For many teams, this complexity makes it impractical to manage transformations in-house, especially outside of a SIEM, and instead, they turn to purpose-built solutions to handle the heavy lifting. Here are some of the main tools and methods used:

  • Security Data Platforms: Platforms like SIEMs (e.g., Splunk, Elastic Security) and security data lakes (e.g., Amazon Security Lake) have built-in transformation capabilities. They handle tasks such as field extraction, normalization, enrichment with threat intelligence, and filtering, enabling data standardization (e.g., transforming data into the Open Cybersecurity Schema Framework) as it’s ingested.
  • ETL and ELT Pipelines: ETL (Extract, Transform, Load) pipelines transform data before storage, ideal for in-flight processing. ELT (Extract, Load, Transform) pipelines load raw data into storage, transformations are applied later for deep analysis. Data pipeline tools include Monad, Apache NiFi, Kafka, and custom Python scripts.
  • Custom Scripting: Custom scripts (e.g., in Python or PowerShell) automate transformations like filtering logs, deduplicating alerts, or flattening JSON. Regular expressions (Regex) are frequently used in these scripts for pattern matching and data extraction, especially in unstructured data. These scripts require infrastructure for hosting and maintenance, as well as engineering oversight to adapt to the data as it changes over time. 
  • SQL Queries and Database Functions: When data is stored in relational databases or data lakes, transformations are often executed through SQL queries or functions. SQL can be used to flatten JSON data, aggregate login attempts, or filter out routine internal traffic, particularly useful for in-store transformations on historical data. For large-scale queries and complex transformations, database performance must be monitored and maintained to avoid latency and other performance issues.

Key Data Transformations

Now, let’s dive into some of the most important transformations used in security and real-world before and after examples. These transformations are designed to reduce noise, cost, and manual toil to ensure that what reaches your security analysts is meaningful and actionable.

To keep this blog post concise, the examples provided here use log snippets that highlight the key aspects of each transformation. For the full log samples and more detailed transformation examples, visit our GitHub repository (also linked underneath each transformation example).

Filtering

Purpose: Remove irrelevant or noisy data to reduce the volume of information your team needs to process. You can filter at the field level or record level. 

Examples:

  • Filtering records: When filtering entire records, you will typically target one or more fields or values to drop. A common place to filter records is with vulnerability scan findings, which typically come with a severity rating.  While it would be wonderful to imagine we can patch every single vulnerability, the reality is that we need to prioritize the most risky issues first.  By dropping all records where the severity level is info, low or medium, we can eliminate an enormous amount of noise and focus on the most important vulnerabilities.  
  • Filtering record data: When we talk about filtering data in a record, things get a little more complex. Filtering a record's data usually involves changing the shape of the data and the information contained within a record, which is usually better categorized as a data mutation. We’ll talk more about this in the section on Mutating transforms.

Flattening

Purpose: Flattening nested JSON data converts complex, hierarchical structures into a simplified tabular format where each field exists at a single level, making it dramatically easier to query and analyze the data using standard tools and methods. This transformation eliminates the need to write complex traversal logic to access deeply nested values, while also enabling more efficient filtering, aggregation, and joins across what were previously nested objects within objects.

Example: Given the following identity log (Okta) snippet with several nested objects, we can flatten the data out into dot notation, which can make these fields available without needing to perform additional JSON parsing at query time.

Before flattening

1{
2  "actor": {
3    "alternateId": "jasmine@example",
4    "displayName": "Jasmine Smith",
5    "id": "00u5syyx5nWoMofVG697"
6  },
7  "client": {
8    "geographicalContext": {
9      "city": "Los Angeles",
10      "country": "United States"
11    },
12    "ipAddress": "64.185.13.18"
13  },
14  "eventType": "user.session.access_admin_app"
15}

After flattening

1{
2  "actor.alternateId": "jasmine@example",
3  "actor.displayName": "Jasmine Smith",
4  "actor.id": "00u5syyx5nWoMofVG697",
5  "client.geographicalContext.city": "Los Angeles",
6  "client.geographicalContext.country": "United States",
7  "client.ipAddress": "64.185.13.18",
8  "eventType": "user.session.access_admin_app"
9}

This flattened structure makes it much easier to query specific fields or create analytics based on any combination of fields without having to deal with nested JSON parsing.

👉 Find the full before-and-after JSON examples on our GitHub repo: Monad Community GitHub

Mutating

Purpose:  Data mutations can be used to modify the key/values or adjust data types within records to improve consistency, correct errors, or apply specific adjustments based on certain conditions, ensuring the data is accurate and actionable for security analysis.  Mutations can also be used to filter out portions of a record which may be un-needed or unwanted in the analysis process, saving on query times and storage costs.

Example:  AWS Cloudtrail logs are an invaluable resource for security practitioners operating in Amazon’s cloud.  However, cloud trail logs can contain an enormous amount of information, only some of what is usually of interest in analysis or investigation.  Mutating a Cloudtrail log can allow you to retain the important portions of a log while removing extraneous data in the record, reducing the cost to store the data as well as allowing for faster querying of the data.  In the following Cloudtrail record, we strip off the requestParameters and responseElements of a record, both of which are only used for very specific use-cases.  After removing both elements, but keeping the record itself otherwise the same (mutating) we bring the log size from from 731 bytes to 186 bytes which is a 74.56% reduction. Compounded over time, this can lead to significant cost savings on your SIEM and data platform bills.

Note:  The relevance of these fields is highly dependent on the queries being run against this data and removing it may not be feasible in all situations.

Before mutating

1{
2  "awsRegion": "us-east-1",
3  "eventCategory": "Management",
4  "eventID": "0932ccda-a8f6-27ac-2f56-3abcbefae5ed",
5  "eventName": "StartInstances",
6  "requestParameters": {
7    "instancesSet": {
8      "items": [
9        {
10          "instanceId": "i-94e4f915-2a83-523c-48e7-565a5ce1202e"
11        },
12        {
13          "instanceId": "i-998f571e-1e63-26ad-dbef-b2446348a0cf"
14        }
15      ]
16    }
17  },
18  "responseElements": {
19    "instancesSet": {
20      "items": [
21        {
22          "currentState": {
23            "code": 0,
24            "name": "pending"
25          },
26          "previousState": {
27            "code": 80,
28            "name": "stopped"
29          }
30        }
31      ]
32    }
33  },
34  "sourceIPAddress": "56.46.65.135"
35}

After mutating

1{
2  "awsRegion": "us-east-1",
3  "eventCategory": "Management",
4  "eventID": "0932ccda-a8f6-27ac-2f56-3abcbefae5ed",
5  "eventName": "StartInstances",
6  "sourceIPAddress": "56.46.65.135"
7}

👉 Find the full before-and-after JSON examples on our GitHub repo: Monad Community GitHub

Anonymization and Masking

  • Purpose: A form of data mutation, anonymization and masking can be used to protect sensitive data by obfuscating personal information, essential for compliance with regulations like GDPR or HIPAA. Anonymization is irreversible and makes it impossible to retrieve the original data while masking is reversible. There are trade-offs for both approaches and the one you should use will depend on the use case. 
  • Example: Using our previous example of an Okta identity log, we can identify the potentially sensitive fields and redact them as needed.  We’ve gone a little redaction-happy here to illustrate the point.  If it is important to maintain the ability to query and correlate the redacted values of the data, we can instead replace them with hashing function values which will maintain the same value across logs while still anonymizing the data.
1{
2  "actor": {
3    "alternateId": "<redacted for privacy>",
4    "displayName": "<redacted for privacy>",
5    "id": "00u5syyx5nWoMofVG697"
6  },
7  "client": {
8    "geographicalContext": {
9      "city": "<redacted for privacy>",
10      "country": "United States"
11    },
12    "ipAddress": "<redacted for privacy>"
13  },
14  "eventType": "user.session.access_admin_app"
15}

Sensitive fields like email addresses, names, IP addresses, and geographic locations have been redacted for privacy. If necessary, masking can replace values with reversible hashes for secure correlation.

👉 Find the full before-and-after JSON examples on our GitHub repo: Monad Community GitHub

Normalization

  • Purpose: Normalization is the process of converting disparate fields which contain the same type of data into a standard schema, allowing data to be correlated across multiple data sources.  This is an incredibly important but often overlooked necessity when dealing with multiple security data sources.
  • Example: Using our previous example of an Okta identity one more time, we can see in the first line of the log that okta shows the field actor.AlternateID but the value holds a user’s email address.  Email addresses are a very common way to identify users across many platforms, but the name associated with an email address field differs greatly across platforms.  In this case, we’ll imagine that our organization has decided to standardize on the key of emailAddress to represent all email addresses in our data.  

Before normalization

1{
2  "actor.alternateId": "jasmine@example",
3  "actor.detailEntry": null,
4  "actor.displayName": "Jasmine Smith"
5}

After normalization

1{
2  "emailAddress": "jasmine@example",
3  "actor.detailEntry": null,
4  "actor.displayName": "Jasmine Smith"
5}

Enter OCSF 

The Open Cybersecurity Schema Framework (OCSF) is an open-source initiative founded in 2022 by leading companies like Splunk, AWS, and Cisco. The initiative aims to standardize how cybersecurity data is structured. By providing a vendor-agnostic schema, OCSF enables security teams to unify and normalize data from diverse sources like cloud platforms, GRC, IAM, and endpoint tools. This reduces the need for custom mappings and simplifies data ingestion, making security operations, vulnerability management and other security functions much more efficient.

Today, OCSF is supported by many vendors and has begun seeing a fair amount of adoption by security teams. While there are other standards that aim to unify schema normalization (i.e., Elastic Common Schema), OCSF is the most widely supported by vendors and is very well worth a look if you're looking to streamline the data wrangling portion of your teams workflows.

Below is an example of the OCSF schema for User Access Management (UAM) events:

1{
2  "uid": 5,
3  "caption": "User Access Management",
4  "description": "User Access Management events report management updates to a user's privileges.",
5  "extends": "iam",
6  "name": "user_access",
7  "attributes": {
8    "activity_id": {
9      "enum": {
10        "1": {
11          "caption": "Assign Privileges",
12          "description": "Assign privileges to a user."
13        },
14        "2": {
15          "caption": "Revoke Privileges",
16          "description": "Revoke privileges from a user."
17        }
18      }
19    },
20    "privileges": {
21      "description": "List of privileges assigned to a user.",
22      "group": "primary",
23      "requirement": "required"
24    },
25    "resource": {
26      "description": "Resource that the privileges give access to.",
27      "group": "primary",
28      "requirement": "recommended"
29    },
30    "user": {
31      "description": "User to which privileges were assigned.",
32      "group": "primary",
33      "requirement": "required"
34    }
35  }
36}

Wherever OCSF is applied, it'll convert UAM events to the above schema making it much easier for security team to work with. For more details and schema resources, visit the OCSF GitHub repository.

In-Flight vs. In-Store Transformation

Now that you understand common transformation types, it's essential to know where and when to apply them.  Broadly, there are two approaches and each aligns with specific security objectives and operational requirements.

In-Flight Transformation

In-flight transformation processes data as it travels through the pipeline, before it reaches its destination. This method is particularly suited for real-time security scenarios that demand immediate action and low-latency processing.

Common Use Cases:

  • Filtering out low-value alerts before they reach your SIEM to reduce storage and compute costs.
  • Redacting sensitive fields like IP addresses or usernames before data is forwarded to external monitoring tools for privacy compliance.
  • Converting log formats (e.g., transforming raw JSON into OCSF) before delivering to an analytics platform.

In-Store Transformation

In-store transformation occurs after data has been ingested and stored in a data lake or database. This approach is better suited for retrospective analysis and complex transformations that don’t require immediate processing.


Common Use Cases:

  • Aggregating metrics over time, like tracking failed login attempts by IP or user to identify trends or anomalies.
  • Normalizing fields across multiple data sources to align with a common schema for easier analysis (e.g., mapping “username” and “user_id” to a standard field).
  • Performing forensic investigations on stored logs to identify patterns in past incidents.
  • Enabling threat hunting by correlating historical data, enriching it with threat intelligence, and identifying anomalies or suspicious activity overlooked in real time.

Choosing Between ETL and ELT

In the context of security, selecting between ETL and ELT workflows depends on the specific transformation needs and how quickly the data must be actionable:

  • ETL: Transforms data before storage, making it ideal for real-time or in-flight use cases. This is useful in a SIEM pipeline to ensure only relevant, enriched data is ingested.
  • ELT: Stores raw data first, allowing flexibility for complex transformations as needed. ELT is effective for in-store transformations, useful for deep forensic investigations or historical analysis.

Conclusion

Data transformation is crucial for optimizing security data, reducing alert fatigue, and driving operational efficiency. By filtering out irrelevant data, deduplicating alerts, and cleanly formatting data, transformations help security teams focus on critical threats and vulnerabilities while minimizing noise and toil. This reduces alert fatigue, ensuring analysts aren’t overwhelmed by low-value alerts, and leads to significant cost savings by reducing storage and processing demands. Streamlined, consistent data also enables faster, more effective analysis, improving response times and resource allocation.

Choosing the right transformation approach—in-flight or in-store—and understanding when to apply ETL versus ELT workflows, can help security teams build efficient, scalable, and cost-effective security functions.

Looking Ahead

In our next post, we’ll explore Data Routing and Integration—how to intelligently and conditionally route security data to specific destinations based on team needs. By directing only relevant data to the right teams—whether SecOps, incident response, or compliance—security operations become more focused and efficient. This targeted, conditional routing reduces noise, introduces cost and operational efficiencies, and ensures that security teams receive security signals most relevant to their domain. Stay tuned by signing up to receive our updates below!