Data Engineering For Cybersecurity, Part I: Understanding Security Data

March 11, 2024
Darwin Salazar

Welcome to the Data Engineering for Cybersecurity blog series. In the first post of this series, we'll dive into the growing complexities of managing cybersecurity data and the growing importance of data engineering in bolstering security posture, operations and compliance.

In following posts, we’ll cover: 

  • Part II: Data collection and storage
  • Part III: Data processing and transformation
  • Part IV: Security Data Analysis
  • Part V: Important data engineering tools, technologies, and frameworks
  • Part VI: The SIEM Evolution
  • Part VII: Securing the Data Engineering Lifecycle End-to-End
  • Part VIII: Real-World Applications And Use Cases for Security Data Engineering
  • Part IX: How Monad Can Alleviate Your Security Data Problems

This series is meant to be educational as much as it will be practical. Whether you’re an entry-level security analyst or a CISO, you should take away many practical insights from this series! 

The goal is to take a vendor-agnostic look at the data challenges that security teams face today and how adopting even just a few data engineering principles can help security programs become more efficient, cost-effective, and ultimately, more secure.

Introduction

Over the past several years, the importance of data engineering in supporting cybersecurity functions has increased dramatically. JupiterOne found that the average security team is responsible for 393,419 assets and attributes in 2023, a 137% increase from 2022. Meanwhile, Microsoft reports that the average large enterprise usually has about 75 security solutions. Security teams are drowning in alerts, vulnerability findings, tech debt, growing sophistication of attacks (thanks, AI!😅) and increasing regulatory pressures. These challenges seem to only be compounding over time, rendering traditional security approaches obsolete and leaving security teams overwhelmed.

The challenge of securing an organization lies not only in establishing strong defenses (i.e., secure code, configurations, firewall rules etc.), but also in the ability to integrate, correlate, and contextualize the vast array of data at hand. In order to pull this data together and have a fighting chance at defending your organization, you must be able to analyze, correlate, and contextualize this data to prioritize where your security efforts should be focused. All important security data must be in a place where it is easily accessible and in a state (i.e., normalized) that makes it easy to work with. Without data engineering, none of this is possible.

What is Data Engineering?

Before we go any further, it's important to establish a clear definition of data engineering. For this, let's turn to the insights of Joe Reis and his acclaimed book, Fundamentals of Data Engineering:

"Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases… Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering."

Simply put, data engineering builds the bridges that allow us to move and make sense of data, connecting systems, applications, and people across the globe. Data engineering has become increasingly important with the recent rise in AI and the importance of data quality.

Now that we have a firm understanding of data engineering, let's explore the intricacies of security data that underscore its unique challenges and opportunities.

Intricacies of Security Data

Security-relevant data comes from many places including IoT devices, cloud services, applications, and on-premise systems. Data formats vary greatly, from structured types like CSV, JSON, and YAML to unstructured forms like log files. 

The sheer amount of data generated every day adds to the challenge. A mid-size organization may generate a few GBs of security data per day while some Fortune 100s may generate a few TBs or more. Not to mention that many of these data sources have different latency and velocity requirements. Audit logs for assets supporting critical applications need near real-time processing while vulnerability scan results can be processed once a day or even 2-3x a week.

To sum it up, these are the key factors to consider: 

  • Data source
  • Variability in data formats
  • Extraction methods
  • Data latency and velocity requirements
  • Volume of data
  • The why. The significance of that data source

As you can see, handling and making the most of your security data to defend against threats is not easy. This is why so many security solutions exist today. Yet, with the evolving nature of the threat landscape and of company environments, a more tailored approach is needed. This is why companies like Brex, Coinbase, Rippling, ScaleAI, and many others are adopting sophisticated data engineering practices to 10x their security programs.  

Deeper Look at Security Data

To provide a structured overview of all the security-relevant data that is out there, we have classified 22 security-relevant data sources into four main categories: 

  • Logs 
  • Findings 
  • Metadata 
  • Other

We’ll be covering examples of each data source, the formats they are output in, and why they are important to security teams. Please note that this is not an exhaustive list of all security-relevant data.

Logs

Logs are records of activities that have occurred within computer systems, applications, or networks. They typically include details such as timestamps, event types, source and destination information, and action performed. Logs serve as a chronological account, providing visibility into the operational behavior of users, systems and applications, making them fundamental for security. 

Logs serve as the basis for many critical security functions such as detection engineering, threat hunting, incident response, and more. Below are a few of the different flavors of logs:

  • Cloud Control Plane Logs: Records actions performed on the management layer of cloud services, such as resource provisioning, configuration changes, and access control modifications. These logs are essential for auditing and tracking administrative operations within the cloud environment.
  • Cloud Data Plane Logs: Captures operational interactions with cloud resources and services, such as data read/write operations and user transactions with deployed applications. These logs are crucial for identifying what activity has transpired inside of a resource.
    • Example: AWS CloudTrail Data Events Logs, which capture detailed information about object-level activities on S3 buckets, such as GetObject, DeleteObject, and PutObject API calls. (JSON format)
  • Domain Name System (DNS) Logs: DNS logs capture DNS server transactions, revealing network activity and potential threats like command and control communications or data exfiltration. Analyzing these logs aids in detecting malicious domains, unauthorized access, and visits to known harmful sites.
  • Identity and Access Management (IAM) Logs: Record activities related to user activity and access management, including authentication attempts and changes to user permissions.
Sample Azure Network Flow Log Entry
  • Network Traffic Logs: Captures the flow of data across a network. Helps with detecting unusual patterns or malicious activities.
  • System Logs: This includes Operating System (OS) and Database logs. These logs provide a detailed account of events such as user activities, system errors, modifications, and potential security incidents. 
  • Telemetry: Captures a spectrum of operational data from applications and systems, including error reports, crash dumps, and usage statistics. This data can reveal abnormal patterns that suggest security breaches or flaws in system integrity. There’s actually an Application Performance Monitoring (APM) segment for solutions to monitor this telemetry.
    • Example: Application telemetry from a web server, detailing response times, error rates, and traffic sources. (JSON or XML for integration with monitoring tools.

Metadata

​​Metadata is crucial for providing context that enriches events, findings, and alerts. Without the right metadata, it’s very difficult to know what to prioritize. It includes detailed information like file attributes, network packet details, and transaction specifics, which illuminate the who, what, when, and how of security events. Through metadata, security teams gain a deeper understanding of what they are securing and the business context which they support, leading to more informed decision-making and robust security strategies.

  • Asset Metadata: Information that details the characteristics, configurations, and security settings of assets (i.e., IoT devices, cloud resources, on-prem servers), essential for managing and securing cloud environments.
  • Configuration Data: Settings and parameters that define the configuration of applications and systems. 
  • User Identity Metadata: Information related to user accounts, roles, and permissions within a system or application, crucial for access control and auditing.
Sample email message header. Courtesy of Proton.
  • Email Metadata: Data extracted from email headers and structure that reveals patterns and characteristics of email communications. This metadata is crucial for identifying abnormal sender behavior, potential phishing attempts, and targeted attacks. 
    • Example: SMTP headers with information such as sender IP, recipient, subject, and sending server, which can be analyzed for anomalies and threat detection (typically available in raw or structured text format).

Findings

Findings are insights derived from the analysis of various data sources, including logs, and can reflect the outcome of point-in-time assessments or anomalies detected during continuous monitoring. Trend analysis and historical comparisons are subcategories of continuous monitoring which we’ll cover in the analysis portion of this blog series.

These insights highlight specific issues, vulnerabilities, or anomalous activity within an environment. Findings are actionable, meaning they require attention or remediation to address the identified security threats and vulnerabilities. Findings are what you get when you distill raw data and derive insights from it. Findings are generated across all sorts of security tools including SIEMs, vulnerability scanners, CNAPPs, GRC tools and more. 

Alerts

Sample Microsoft Defender for Cloud Alert

The first type of finding we’ll discuss are alerts. Alerts are crucial in security because they notify teams, often in real-time, when a potential threat or vulnerability has been detected. If it’s a critical severity alert and in a sensitive environment, it should elicit a strong response and investigation in order to mitigate potential impact to the business.

There are many different types of alerts. They can be generated from detection logic or rules that scan for specific anomalous conditions which if noticed in an environment, triggers an alert.  Alerts can also come from off-the-shelf solutions like a User and Entity Behavior Analytics (UEBA) solution. 

Alerts are arguably the most important security data type because without them, teams would have to rely on manual identification of anomalous behavior to thwart attacks. That said, there are many challenges with security alerts. High alert volumes, false positives and alerts that lack context often leads to alert fatigue. Alert fatigue can lead to a burned out security team that may let anomalous activity slip through the cracks. This is why activities like enrichment, fine-tuning alerts, and cross-pollination across data sources are key. They help triage and filter out noise, enabling teams to prioritize actual potential threats and the vulnerabilities that matter most.  

Below are a few other types of findings: 

  • Vulnerability Scan Findings: Provide detailed insights into potential weaknesses and exposures in software or hardware. These findings emerge from scans that assess systems, applications, and networks against known vulnerabilities databases, such as the Common Vulnerabilities and Exposures (CVE) list.
  • Compliance Findings: Reports and analyses that evaluate adherence to regulatory standards and frameworks such as PCI-DSS, SOC2, HIPAA and more. These findings  pinpoint areas of non-compliance.
  • Threat Intelligence Feeds: External sources of information about emerging threats, providing data on indicators of compromise (IoCs) such as malicious IP addresses and domains. Threat intelligence is useful for identifying if there is any activity generating IoCs from known threat campaigns in your environment. As shown here, threat intelligence feeds can output in various formats.
  • Social Engineering Test Results: Outcomes from simulated social engineering attacks, such as phishing, designed to evaluate the human element of security. 
    • Example: KnowBe4 (PDF and CSV formats)
  • Risk Assessment Findings: Evaluations that identify potential vulnerabilities and threats to an organization's assets. They often include a risk register, risk scores and mitigation recommendations. 
    • Example: Automated risk assessment tools (i.e., Vanta) generating detailed reports with identified risks, exploitation likelihood, impact assessments, and recommended remediations (JSON format)
  • Third-Party and Supply Chain Security Assessments: Evaluations of the security postures and practices of third-party vendors and suppliers. This is crucial for managing supply chain risks. These assessments produce findings related to the security and compliance status of external partners.
    • Example: Vendor security assessment platforms that provide comprehensive reports on the security measures, vulnerabilities, and compliance status of third-party service providers (PDF format).
  • Penetration Test Assessments: Simulated attacks conducted to evaluate the security of systems. The resulting reports detail vulnerabilities, exploitation paths, and remediation steps.
    • Example: Penetration testing software or services generating reports that outline discovered vulnerabilities, the methods used to exploit them, and recommendations for strengthening security postures.(PDF format)

Other 

The Other category encompasses various types of security-relevant data that do not neatly fit into the Logs, Findings, or Metadata categories. Below are just a few examples: 

  • Business Transactions: Encompasses records of business activities and operations, offering insights into the flow of transactions that, when monitored, can reveal anomalies or fraud.
    • Example: SAP ERP for tracking business operations (various formats, including CSV and proprietary SAP formats). 
  • Digital Forensics and Incident Response (DFIR) Data: Includes detailed evidence collected during the investigation of security incidents, such as disk images, memory captures, and logs. These are crucial for forensic analysis and investigations.
    • Example: Autopsy and The Sleuth Kit for disk imaging and analysis (various formats, including raw disk image format).
  • Software Bill of Materials (SBOM): A detailed list of all components in a software application, covering open-source and proprietary elements. SBOMs enhance security through vulnerability management, license compliance, and supply chain risk assessment. They enable the detection of vulnerabilities and the assessment of third-party components.
    • Example: Tools like FOSSA and Ox Security produce SBOMs in formats such as SPDX, CycloneDX, or JSON.

Exporting and normalizing a wide array of data types in multiple formats is only part of the challenge; equally critical is the speed and latency of data processing required by different organizational needs. For instance, real-time monitoring and alerting on endpoint security data is crucial to thwarting and limiting the blast radius of an ongoing attack, while periodic processing and analysis may suffice for user access review logs, where threats are less immediate but still significant over time. 

The degree of scrutiny that teams place on each data source also depends on the importance of the asset or environment being monitored or scanned. A database storing customer credit card information will be treated differently than a storage account hosting an archive of public marketing materials. 

Each organization must determine the relative importance of each data source, as this will dictate the level of security and monitoring applied to it. Leveraging data engineering practices and building a robust data pipeline architecture can help security teams immensely in building security and compliance data patterns that are uniquely tailored to their digital fingerprint for maximum operational and cost efficiency.

Conclusion

As we see security teams overwhelmed by increasing volumes of alerts, vulnerabilities, and other data, the need for sophisticated data engineering practices has never been clearer.  It's not just about making sense of the story that the data is telling you, but also, leveraging it to configure detections, alerts, and to remediate the vulnerabilities that matter most at any given time. Without the capabilities that data engineering affords security teams, most are exposed to blindspots and take a reactive approach by depending on their tools to keep their organization safe. 

This is at the core of why we started Monad. We're committed to delivering high-performance security data ETL pipelines enhanced with features like custom transforms and in-line enrichments to enable security and compliance teams to precisely tailor data handling to their needs. 

Lastly, a very special thank you to the wonderful folks who reviewed and contributed to part one of this series including Christian Almenar, Jacob Berry, Matt Jane, AJ Yawn, Santiago Fernandez, and Ashley Penney.

In part two of our series, we’ll be going in-depth on data collection and storage methods.

Stay Tuned!

Stay ahead of emerging security challenges with our innovative approach to security data ETL. Subscribe now for our monthly newsletter, sharing valuable insights on building a world-class, data-driven security program and to be notified when our early access program launches!