Next-generation cybersecurity platforms are embracing a new approach to storage—the security data lake. While security data lakes have significant advantages over the monolithic datastores of legacy SIEMs, they also come with additional overhead. If they aren’t managed properly, you can run into problems with messy data, query performance issues, and high compute costs. To maximize the benefits of security data lakes—and minimize performance limitations—you need a data platform that’s optimized for log data.
In this post, you’ll learn about:
- Why cybersecurity platforms are using security data lakes
- The limitations of security data lakes—and how to avoid them
- How Hydrolix can power your security data lake
Why Cybersecurity Platforms Are Using Security Data Lakes
The last few years have seen a great deal of success for cybersecurity platforms built on security data lakes. Companies like Panther and Anvilogic, which combine a security operations center (SOC) with a data lake, are driving down storage costs compared to expensive incumbent solutions with monolithic data structures. And companies building on security data lakes are collectively worth billions.
As a result, the term “security data lake” has generated a lot of buzz—like the term “data lake” did before it. These terms boil down to a single key concept—they use decoupled, cloud-based object storage. Object storage is cost-effective, highly scalable, and effective for both structured and unstructured data. Cybersecurity platforms generally work with vast amounts of data—whether that’s customer data or their own—so scalable, cost-effective storage is a requirement. Security data lake architecture is scalable and cost-effective, unlike expensive SIEM solutions that use legacy data storage architecture.
On a fundamental level, a data lake is just decoupled object storage. Because it’s highly scalable and cost-effective, object storage is often the best approach for big data, including security use cases.
The Benefits of Security Data Lakes
Security data lakes (or, put more simply, decoupled object storage) offers several major benefits over legacy data storage, including:
- Cost-effective: Decoupled object storage runs on cost-effective commodity hardware. This is in contrast to tightly coupled storage, which needs expensive specialty hardware for performance at scale.
- Long-term data retention: Performant “hot” storage is expensive on legacy storage architectures. For that reason, data must be moved to “cold” storage that’s time-consuming to query. Many incumbent SIEM solutions that run on legacy data architectures have short retention windows before data becomes unavailable. These retention windows are often as short as a few days to a few weeks. In contrast, you don’t need to move data to tiered, inaccessible cold storage if you’re using cost-effective object storage. Instead, you can offer your customers long-term data retention for threat hunting, forensic analysis, AI training runs, and more.
- Highly scalable: Object storage is easy to scale horizontally. For example, an AWS storage bucket automatically grows as you add more data to it. Because of its flexibility, you pay as you go for storage without limits to scaling. In contrast, legacy data architectures are tightly coupled, and the entire system (including compute resources) must scale together. You have to preallocate resources and scaling is vertical, which is much more expensive than horizontal scaling because it requires more specialized hardware.
- Fewer data silos: Traditional SIEM solutions store security-specific data. However, as Ross Haleliuk writes in Venture In Security, “There is no such thing as ‘security data’, ‘marketing data’ or ‘financial data’ – there is business data that needs to be accessed and analyzed from different angles and by different teams; security is just one of many use cases.” Because many traditional SIEM solutions silo security data, it either needs to be duplicated for other use cases (which is costly) or can’t be used in broader business data use cases. On the other hand, security data lake infrastructure can potentially hold many types of data, not just security data—another way in which security data lakes provide more flexibility.
While security data lakes—or rather, the decoupled object storage they’re built on—offer an elegant solution to the problems of cost and scalability, they also have inherent limitations. You can use them to build a high-performance, cost-effective cybersecurity solution with the right data platform. However, with the wrong data platform, or the wrong DIY approach, you can get saddled with poor performance and high query compute costs.
The Limitations of Security Data Lakes—and How to Avoid Them
To successfully build a platform on a security data lake, you need to maximize the performance of decoupled object storage and take a structured approach to data management. Otherwise, you’ll run into issues that can sink your platform.
Maximizing the Performance of Decoupled Object Storage
Data lakes aren’t inherently performant—they are simply repositories for data that can be structured, unstructured, or both. All read and write operations are HTTP requests, and large-scale operations like real-time data ingestion and ad hoc queries can trigger thousands of concurrent requests. For example, a query that retrieves a large volume of data and then filters it will have extremely poor performance. To offer analytic and query capacities in near real-time, you need a data platform that:
- Maximizes parallelism with major cloud providers to ingest data at scale
- Structures, transforms, and enriches data before writing it to storage
- Uses a high-efficiency query engine to filter large datasets for ad hoc queries
This is just the tip of the iceberg—you also need to manage compute clusters (typically using Kubernetes), build efficient partitioning and merge services, and much more.
While you can build a data platform in-house, this approach comes with both risks and unknowns. In addition to investing more engineering resources in data management and optimization (which will siphon resources away from building cybersecurity features), you’ll be reinventing solutions that already exist. And there’s no guarantee that you’ll achieve the performance you need to be competitive in a red ocean environment. Even if you build cutting edge threat detections and threat hunting capabilities, they won’t be effective if end users experience subpar query performance.
Generally, you’re better off using a cloud data platform that optimizes each part of the data lifecycle for you. While Snowflake offers security data lake capabilities, Hydrolix offers a number of advantages over Snowflake for security use cases.
Taking a Structured Approach to Data Management
Poorly-managed data lakes turn into data swamps. Even a well-managed data lake can get mucky at the edges. And any issues with data management will only become more problematic at scale. For example, without standardized naming conventions, it becomes difficult to reason about and query data from different sources. Incoming data must be enriched so it doesn’t lose context. And unstructured data is more inefficient to query and search. This last point is absolutely critical when working with data at scale. The performance tradeoff—and increased compute costs—of querying large amounts of unstructured data can result in frustration and gaps in understanding for end users.
The result of poor design is dark data: data that’s simply kept for storage purposes and never analyzed. This is the same issue that cybersecurity teams run into with data kept in cold storage. A security data lake is no better than a legacy SIEM solution if data is too expensive and difficult to query and analyze.
To avoid this issue, you should first transform and structure data before it goes into object storage. With Hydrolix, you can transform your data, even when ingesting millions of rows a second. This includes custom transformations (such as enrichment, standardizing, and aggregating) as well as built-in optimizations such as high density compression (reducing your data footprint up to 50x), per-column indexing, and data partitioning to provide sub-second query latency even on data sets with billions of rows.
Building a Security Data Lake with Hydrolix
Hydrolix is a cloud data platform that’s optimized for time and built for log data use cases with several distinct advantages over all-purpose data platforms like Snowflake. With Hydrolix, you get clear, upfront pricing, unlike Snowflake’s opaque usage-based pricing. Hydrolix users typically get 20x-50x data compression compared to 4x for Snowflake. With a dramatically lower data footprint, you get more cost-effective storage. Generally, Hydrolix users get 75% or more savings on total cost of ownership (TCO) compared to their previous solution. You can pass the savings along to customers, offer longer data retention windows, widen your margins, and have more budget to focus on building security features.
With Hydrolix, you can ingest data in near real-time and get sub-second query latency even at terabyte scale. You can use Hydrolix for real-time data use cases such as alerting, threat detection, security dashboards, and more. Hydrolix uses advanced query optimization features including partition pruning, predicate pushdown, and micro-indexing (indexing the location of data in storage). Together, these features provide low-latency query performance on petabyte-sized datasets—perfect for threat hunting and other query use cases.
Hydrolix transforms your data in real-time at scale before writing it to storage—so even if you decide to call it a “security data lake,” it never becomes a “security data swamp.” And Hydrolix also features a zero-egress approach to data and runs entirely on your cloud architecture. You can customize and control Hydrolix infrastructure from within your own VPCs.