Migrating from Rockset? Find out if Hydrolix is right for you >>

RSS

The Dark Data Problem in Cybersecurity—and How to Solve It

Learn why dark data poses a serious threat for security platforms—and how to bring that data to light with cost-effective, long-term hot storage.

Franz Knupfer

Published:

Jan 16, 2024

7 minute read

“Hello darkness, my old friend…” A great song opener, but a terrible philosophy for log data, especially when it comes to security logs. Yet according to estimates, the majority of all log data is dark, which means it’s never analyzed. Dark data presents a huge problem for cybersecurity platforms, which need data to be readily accessible for pattern detection, historical analysis, and artificial intelligence applications. However, there’s a reason data is dark in the first place. Highly performant, long-term data storage has traditionally been too expensive to scale.

To provide the most value to their customers—and to differentiate themselves from their competitors—cybersecurity platforms must bring dark data to light in a cost-effective manner. Considering the large volume of incoming log data, offering both long-term, performant storage and affordability may seem impossible. However, cloud data platforms such as Hydrolix, which can handle petabyte-scale volumes of log data, can dramatically lower long-term storage costs while providing high performance queries no matter how large or old the dataset.

This post covers the challenges and opportunities of dark data for cybersecurity platforms, including:

  • Why security log data goes dark
  • The real and hidden costs of dark data
  • Solving the problem of dark data with Hydrolix

Why Security Log Data Goes Dark

In the security industry, long-term data retention isn’t optional—it’s part of the cost of doing business. Enterprises need to retain security data long-term for many reasons, including for potential forensic analysis of security incidents, retaining customer trust, and policy compliance. However, the cost of keeping all that data available for analysis has traditionally been too expensive, so the data goes dark.

Long-Term Retention Is Required for Security Log Data…

In the case of tracking and remediating security incidents, breaches can remain undetected for months or even years, so organizations must keep data long-term for potential forensic analysis. According to a report from IBM, organizations took 207 days on average to identify breaches in 2022. Meanwhile, breaches related to compromised credentials took on average 328 days to identify. Once you detect a breach, it’s important to investigate when and how it happened. That way, you can fully understand the scope of the problem. If you don’t have evidence from the original scene of the crime, you’ll be missing information you need to reconstruct what happened—and prevent it from happening again. You’ll also lose the trust of your customers if you can’t clearly convey the impact of the breach as well as remediation and prevention steps.

Many industries also have compliance requirements for log data, which includes both security and audit logs. Examples include international banking (3-7 years) and health care institutions (up to 6 years in the US). Many other industries such as energy and eCommerce also have compliance requirements. Failure to comply can lead to hefty penalties.

…But Long-Term Hot Storage Is Too Expensive, So Data Goes Dark

Long-term data retention is necessary, but for the majority of platforms, keeping that data readily available in hot storage, which offers performant queries, is prohibitively expensive. This is because hot storage typically uses expensive hardware like solid state drives (SSDs) and compute-intensive software such as in-memory databases, dramatically increasing costs.

The alternative is to store that data for a short period of time in hot storage before moving it to cold storage. Cold storage uses hard disk drives (HDDs) or other relatively inexpensive storage systems, but it’s time-consuming to rehydrate and use the data, so it’s no longer easily available for analysis. Cold storage is where data goes dark.

The Real and Hidden Costs of Dark Data

Dark data often presents security teams and decision makers with a catch-22: it’s too expensive to keep in hot storage for analysis, but there are significant opportunity costs as well as real, tangible costs, to letting it go dark.

The Real Cost of Dark Data

Let’s start with the real, tangible cost. 52% of the average data storage budget is spent on dark data. As long as the data is dark, it provides no value to the organization. It is simply there for compliance and auditing purposes. If it contains essential information about attempted intrusions, malicious attacks, and breaches, you will not know until after the breach is detected, the damage is done, and you rehydrate the data for deeper forensic analysis. On a per GB basis, dark data on cold storage may be significantly cheaper than traditional hot storage solutions, but it’s still the majority of the average data storage budget.

Dark Data Increases the Risk of Costly Data Breaches

The opportunity costs are harder to quantify, but they become very real when a major security breach surfaces—whether that’s six months, a year, or even more since the initial intrusion happened. According to an IBM report, the average cost globally of a security breach in 2023 was $4.45 million. At least some of these breaches could be prevented—or their impact minimized—with readily available historical data.

For example, when malicious actors try to penetrate a system, they often take a stealthy, patient approach. If you’re analyzing security logs, you might see a handful of failed login attempts, but not enough to trigger alerts. Now imagine the same malicious actors continuing their attempts over a period of many months. Each attempt triggers more logs showing failed attempts from a specific IP.

If your data is only in hot storage for a short period of time such as 30 days, you’ll have an incomplete record. You might see a handful of failed login attempts from the last 30 days, but you won’t see an ongoing, concerted attack that might span months or even years. The irony is that you have all the data you need to correlate those logs—but most of it is in the dark. It will only come to light in a worst-case scenario where the malicious actors have already succeeded.

This example is not uncommon. Too often, cybersecurity professionals have limited data, putting them in a situation where they have no choice but to react to breaches. In order to level the playing field, cybersecurity professionals must have all their historical data available for analysis. The typical retention period—30 days or less—just isn’t enough.

Dark Data Limits Decision Making

How many security threats did your organization face in the past year? What was the breakdown by attack vector? Which IPs had the most traffic during that time, and can you correlate any of those logs to DDoS attacks, failed logins, or other security intrusions? Can you use historical data to determine which of your services are most resilient and which are most vulnerable? When CVEs (common vulnerabilities and exposures) are discovered, can you search logs from the last year to determine whether the CVE was exploited?

If you have a short retention period for hot data, then the answer to all of these questions is almost certainly no. Decision makers need to be proactive, but with limited data, they are in the dark.

Dark Data Limits Predictive Analysis

Predictive analysis can help cybersecurity professionals understand when their systems are threatened. Not surprisingly, many security platforms are building and using AI models to detect anomalies and predict potential incidents like DDoS attacks. Security log data is a rich trove of training data that can teach an AI how to identify and stop the next security intrusion before it turns into a major breach. And models, once trained, can process security log data and find patterns much more quickly than humans can. According to a whitepaper on dark data, “Dark data provides an enormous, untapped resource of information that AI can analyze. And AI-powered analytics tools can help make dark data ready for analysis on a scale that would be impossible with current methods.”

However, AI requires readily available hot storage to run—and large datasets are important to ensure models are accurate. Otherwise, AI-powered analytics tools will not be able to illuminate the tremendous potential of dark data, either.

Solving the Problem of Dark Data With Hydrolix

To solve the problem of dark data, security platforms must offer their customers long-term hot storage at reasonable prices. However, for many security platforms, the high costs of hot storage make this impossible. Hydrolix solves this problem by writing data to inexpensive cloud storage and using high-density compression to reduce the footprint of incoming data by 20-50x. Generally, Hydrolix users get 75% or more savings on total cost of ownership (TCO) compared to solutions they previously used. Security platforms can use Hydrolix to offer their customers long-term, cost-effective hot storage.

Hydrolix offers ingest at terabyte scale and sub-second latency even with large datasets—regardless of whether the dataset is an hour or a year old. Hydrolix is optimized for time scale data, making it an especially good fit for log data, including security logs. You can also transform your data on the fly at scale, allowing you to obfuscate PII (personally identifiable information) for compliance purposes, enrich your logs with metadata to make it easier to analyze later, and much more.

Ultimately, these benefits only matter if the underlying architecture is secure. Hydrolix features a zero-egress approach and runs entirely on your cloud architecture. Security platforms can customize and control Hydrolix infrastructure from within their own VPCs. For more details, see Using a VPC With Hydrolix for Your Log Data.

Hydrolix is GDPR compliant and SOC 2 certified, enforces user authentication on every query and streaming endpoint, and uses TLS 1.3 for secure, encrypted data.

With Hydrolix, you can ensure that your data never disappears into the dark. By doing so, you can give customers long-term, affordable visibility into their security log data, build powerful predictive models to quickly detect, prevent, and mitigate incidents, and give cybersecurity professionals the tools they need to keep their organizations secure. To build a next-generation security platform, you need next-generation tools like Hydrolix.

Next Steps

Learn more about Hydrolix and contact us for a POC.

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

“Hello darkness, my old friend…” A great song opener, but a terrible philosophy for log data, especially when it comes to security logs. Yet according to estimates, the majority of all log data is dark, which means it’s never analyzed. Dark data presents a huge problem for cybersecurity platforms, which need data to be readily accessible for pattern detection, historical analysis, and artificial intelligence applications. However, there’s a reason data is dark in the first place. Highly performant, long-term data storage has traditionally been too expensive to scale.

To provide the most value to their customers—and to differentiate themselves from their competitors—cybersecurity platforms must bring dark data to light in a cost-effective manner. Considering the large volume of incoming log data, offering both long-term, performant storage and affordability may seem impossible. However, cloud data platforms such as Hydrolix, which can handle petabyte-scale volumes of log data, can dramatically lower long-term storage costs while providing high performance queries no matter how large or old the dataset.

This post covers the challenges and opportunities of dark data for cybersecurity platforms, including:

  • Why security log data goes dark
  • The real and hidden costs of dark data
  • Solving the problem of dark data with Hydrolix

Why Security Log Data Goes Dark

In the security industry, long-term data retention isn’t optional—it’s part of the cost of doing business. Enterprises need to retain security data long-term for many reasons, including for potential forensic analysis of security incidents, retaining customer trust, and policy compliance. However, the cost of keeping all that data available for analysis has traditionally been too expensive, so the data goes dark.

Long-Term Retention Is Required for Security Log Data…

In the case of tracking and remediating security incidents, breaches can remain undetected for months or even years, so organizations must keep data long-term for potential forensic analysis. According to a report from IBM, organizations took 207 days on average to identify breaches in 2022. Meanwhile, breaches related to compromised credentials took on average 328 days to identify. Once you detect a breach, it’s important to investigate when and how it happened. That way, you can fully understand the scope of the problem. If you don’t have evidence from the original scene of the crime, you’ll be missing information you need to reconstruct what happened—and prevent it from happening again. You’ll also lose the trust of your customers if you can’t clearly convey the impact of the breach as well as remediation and prevention steps.

Many industries also have compliance requirements for log data, which includes both security and audit logs. Examples include international banking (3-7 years) and health care institutions (up to 6 years in the US). Many other industries such as energy and eCommerce also have compliance requirements. Failure to comply can lead to hefty penalties.

…But Long-Term Hot Storage Is Too Expensive, So Data Goes Dark

Long-term data retention is necessary, but for the majority of platforms, keeping that data readily available in hot storage, which offers performant queries, is prohibitively expensive. This is because hot storage typically uses expensive hardware like solid state drives (SSDs) and compute-intensive software such as in-memory databases, dramatically increasing costs.

The alternative is to store that data for a short period of time in hot storage before moving it to cold storage. Cold storage uses hard disk drives (HDDs) or other relatively inexpensive storage systems, but it’s time-consuming to rehydrate and use the data, so it’s no longer easily available for analysis. Cold storage is where data goes dark.

The Real and Hidden Costs of Dark Data

Dark data often presents security teams and decision makers with a catch-22: it’s too expensive to keep in hot storage for analysis, but there are significant opportunity costs as well as real, tangible costs, to letting it go dark.

The Real Cost of Dark Data

Let’s start with the real, tangible cost. 52% of the average data storage budget is spent on dark data. As long as the data is dark, it provides no value to the organization. It is simply there for compliance and auditing purposes. If it contains essential information about attempted intrusions, malicious attacks, and breaches, you will not know until after the breach is detected, the damage is done, and you rehydrate the data for deeper forensic analysis. On a per GB basis, dark data on cold storage may be significantly cheaper than traditional hot storage solutions, but it’s still the majority of the average data storage budget.

Dark Data Increases the Risk of Costly Data Breaches

The opportunity costs are harder to quantify, but they become very real when a major security breach surfaces—whether that’s six months, a year, or even more since the initial intrusion happened. According to an IBM report, the average cost globally of a security breach in 2023 was $4.45 million. At least some of these breaches could be prevented—or their impact minimized—with readily available historical data.

For example, when malicious actors try to penetrate a system, they often take a stealthy, patient approach. If you’re analyzing security logs, you might see a handful of failed login attempts, but not enough to trigger alerts. Now imagine the same malicious actors continuing their attempts over a period of many months. Each attempt triggers more logs showing failed attempts from a specific IP.

If your data is only in hot storage for a short period of time such as 30 days, you’ll have an incomplete record. You might see a handful of failed login attempts from the last 30 days, but you won’t see an ongoing, concerted attack that might span months or even years. The irony is that you have all the data you need to correlate those logs—but most of it is in the dark. It will only come to light in a worst-case scenario where the malicious actors have already succeeded.

This example is not uncommon. Too often, cybersecurity professionals have limited data, putting them in a situation where they have no choice but to react to breaches. In order to level the playing field, cybersecurity professionals must have all their historical data available for analysis. The typical retention period—30 days or less—just isn’t enough.

Dark Data Limits Decision Making

How many security threats did your organization face in the past year? What was the breakdown by attack vector? Which IPs had the most traffic during that time, and can you correlate any of those logs to DDoS attacks, failed logins, or other security intrusions? Can you use historical data to determine which of your services are most resilient and which are most vulnerable? When CVEs (common vulnerabilities and exposures) are discovered, can you search logs from the last year to determine whether the CVE was exploited?

If you have a short retention period for hot data, then the answer to all of these questions is almost certainly no. Decision makers need to be proactive, but with limited data, they are in the dark.

Dark Data Limits Predictive Analysis

Predictive analysis can help cybersecurity professionals understand when their systems are threatened. Not surprisingly, many security platforms are building and using AI models to detect anomalies and predict potential incidents like DDoS attacks. Security log data is a rich trove of training data that can teach an AI how to identify and stop the next security intrusion before it turns into a major breach. And models, once trained, can process security log data and find patterns much more quickly than humans can. According to a whitepaper on dark data, “Dark data provides an enormous, untapped resource of information that AI can analyze. And AI-powered analytics tools can help make dark data ready for analysis on a scale that would be impossible with current methods.”

However, AI requires readily available hot storage to run—and large datasets are important to ensure models are accurate. Otherwise, AI-powered analytics tools will not be able to illuminate the tremendous potential of dark data, either.

Solving the Problem of Dark Data With Hydrolix

To solve the problem of dark data, security platforms must offer their customers long-term hot storage at reasonable prices. However, for many security platforms, the high costs of hot storage make this impossible. Hydrolix solves this problem by writing data to inexpensive cloud storage and using high-density compression to reduce the footprint of incoming data by 20-50x. Generally, Hydrolix users get 75% or more savings on total cost of ownership (TCO) compared to solutions they previously used. Security platforms can use Hydrolix to offer their customers long-term, cost-effective hot storage.

Hydrolix offers ingest at terabyte scale and sub-second latency even with large datasets—regardless of whether the dataset is an hour or a year old. Hydrolix is optimized for time scale data, making it an especially good fit for log data, including security logs. You can also transform your data on the fly at scale, allowing you to obfuscate PII (personally identifiable information) for compliance purposes, enrich your logs with metadata to make it easier to analyze later, and much more.

Ultimately, these benefits only matter if the underlying architecture is secure. Hydrolix features a zero-egress approach and runs entirely on your cloud architecture. Security platforms can customize and control Hydrolix infrastructure from within their own VPCs. For more details, see Using a VPC With Hydrolix for Your Log Data.

Hydrolix is GDPR compliant and SOC 2 certified, enforces user authentication on every query and streaming endpoint, and uses TLS 1.3 for secure, encrypted data.

With Hydrolix, you can ensure that your data never disappears into the dark. By doing so, you can give customers long-term, affordable visibility into their security log data, build powerful predictive models to quickly detect, prevent, and mitigate incidents, and give cybersecurity professionals the tools they need to keep their organizations secure. To build a next-generation security platform, you need next-generation tools like Hydrolix.

Next Steps

Learn more about Hydrolix and contact us for a POC.