RSS

Are You Ready for AI Observability?

To incorporate generative AI, you need to be ready for the challenges of AI observability, including ingesting, querying, and analyzing data at scale.

Franz Knupfer

Published:

Nov 15, 2023

8 minute read
,

Next Up: AI Observability

The torrid pace of innovation in artificial intelligence has left regulatory bodies struggling to keep up, but it appears President Biden’s recent  Executive Order on artificial intelligence aims to change that. The order provides actions to support, promote, and collaborate on artificial intelligence—but that also promises that regulatory guidelines are in the works. According to Bruce Reed, a White House deputy chief of staff, “President Biden is rolling out the strongest set of actions any government in the world has ever taken on A.I. safety, security and trust.”

But what exactly will those new actions look like? That’s still a major unknown. Considering that training models generally use a black box approach, how can you prove your models will be compliant with future regulations? To start, you need to retain and monitor all of your data, and you need an observability platform that is robust enough to handle data at petabyte scale.

The Challenges of AI Observability

Many organizations are aware of the challenges around adopting AI. According to research from TechTarget’s Enterprise Strategy Group on generative AI adoption, 44% of organizations expect to invest in information management for generative AI while 37% expect the need for investment in data privacy, compliance, and risk. While those numbers are significant, they still represent a minority of organizations. Does that mean most organizations are prepared? Or simply that they aren’t fully aware of the challenges of managing AI-related data?

Unfortunately, there’s a significant risk that many organizations aren’t prepared. Most organizations haven’t adopted generative AI yet. According to the same research from Enterprise Strategy Group, only 4% of organizations that took part in the research were using mature generative AI while 14% were in early production. Meanwhile, 65% were in various stages ranging from consideration to pilot/proof of concept. Altogether, only 15% of the participating organizations in the research had no plans to incorporate generative AI. So 85% of organizations are at least considering generative AI, but only 18% are in production or early production. Early in the process, organizations will likely focus mostly on getting the right talent and building effective models. They may use a data lakehouse along with an all-in-one observability platform that’s sufficient for their current needs—and they may believe that these tools will scale with them.

However, in order to plan for information management around compliance, security, and model training, you need to think about very big numbers in terms of storage, ingest, and data analysis. Even if your current data management and observability tools are serving you well, will you have the capacity to ingest data at terabyte scale every day? Can you analyze petabyte volumes of data efficiently and store that same data at a reasonable cost? Large language models often have billions of parameters, and once they are in production, they are likely to generate huge volumes of data that you’ll need to analyze for anomalies and store for later compliance.

Models are already expensive to train and launch commercially. It’s not just about the compute and storage to generate and analyze that data now—you also have to plan for future costs of that storage. You will need to keep your data long-term, but just how long is unknown. When it comes to security, you need to prepare for worst-case scenarios, and when it comes to legal and regulatory compliance, you cannot risk gaps in your data. To reign in costs, you may be forced to sample your data, meaning you no longer have true observability. What might seem like the right platforms now in terms of storage and observability could be the wrong choices in terms of cost and retention just a year or two down the road—and you’ll wish you’d planned differently in hindsight.

AI Needs Hot Data to Train Models

There are two sides to AI observability. One is observability of AI systems—the other is using AI to improve the observability of other systems, which can help you find anomalies and patterns that you might not otherwise be able to detect. So when your observability systems are generating data, it’s not just dark data that’s being kept for future compliance and security reasons. It’s also a rich trove of information that can be used for training models and enable AI-assisted observability.

However, AI models need data that is always “hot” (available) for training runs. To manage the storage costs of data, enterprises typically use tiered storage which ranges from “hot” to “cold.” Hot storage is optimized for low latency queries, which comes at an increased cost. Cold storage is much cheaper but takes as much as 48 hours to restore, leading to bottlenecks for model training and data science teams. Since AI models use all of the data available in a training run, cold storage will slow down this process. That means you cannot manage your storage costs by moving model data into cold storage. And if you move other data into cold storage, you will limit your ability to use AI to use that data for both training and insights.

To properly use and store this data for observability and compliance purposes, you need a tool that grows with your model data and that’s cost-effective at petabyte scale. Most SaaS observability platforms are too expensive for large volumes of data, leaving you to patch together complex pipelines and cold storage solutions that make it challenging to rehydrate your data. Raw object storage, on the other hand, while inexpensive, isn’t indexed, making queries slow.

If your data isn’t hot, you’ll have a vast amount of dark data. You’ll never know what business questions you might be able to answer, and how you’d be able to further harness the benefits of machine learning, unless you find a way to bring that data to light.

The Basic Necessities of AI Observability

To manage the needs of AI observability, whether that’s observability of AI systems or AI-assisted observability of other systems, observability platforms need to offer high-performance storage at scale while managing costs. To do so, observability platforms must provide the following:

  • Data ingest at terabyte scale: Models generate and use vast amounts of data which you should be able to monitor in near-real time.
  • “Always hot” storage with low-latency queries: Models use all of the data in a dataset. To ensure your models are performant, you need sub-second query latency no matter how large or how old the data—with no data rehydration needed. With always hot storage, AI platforms can use all of your data for model learning and actionable insights.
  • Cost-effective storage: You’ll need so much storage that a difference of even pennies per gigabyte could lead to a huge explosion in costs. Anything you can do to reduce the size of your data footprint, whether that’s better compression or more affordable storage, will be necessary for your AI endeavors to be profitable.
  • Integrations with AI and ML platforms. Data should be available in the tools that data scientists use, whether that’s Apache Spark, Databricks, or other machine-learning platforms.

And here’s the crux of the issue: most observability platforms weren’t built to handle the compute and storage needed for true AI observability.

Most Observability Platforms Can’t Manage the Scale and Cost of AI Observability Yet

Major players in the observability space such as Splunk, New Relic, Elastic, and DataDog, were not built to handle the vast amount of data that AI needs in a cost-effective manner. Many of these larger platforms are now being acquired—examples include Cisco’s acquisition of Splunk and Francisco Partner’s acquisitions of New Relic and Sumo Logic. As Cisco’s CEO Chuck Robbins said when announcing the acquisition, “Our combined capabilities will drive the next generation of AI-enabled security and observability.”

The goal is clear: lead the next generation of AI observability. However, all of these platforms were originally built with yesterday’s infrastructure in mind. Splunk was originally released in 2003. New Relic, Elastic, and DataDog, were all first released between 2008 and 2012. Over the years, they have adapted and matured, but it is extraordinarily challenging to change the underlying architecture of a system. In 2010, 2 zettabytes of data were generated over the course of the year. (A zettabyte equals a billion terabytes.) In 2023, that number will be 120 zettabytes—more than 60 times as much data as 2010. These platforms are still dealing with storage and compute costs that were reasonable in a two zettabyte world, but those costs are no longer competitive—or even feasible. And this doesn’t even account for the astronomical amount of data that AI will consume and generate in the coming years.

Don’t count these big platforms out—they have gotten where they are today by creating mature, user-friendly solutions. But many challenges lay ahead, and it’s not yet clear how these industry players will make compute and storage more cost-effective for the next generation of big data.

Next Generation Tools for AI Observability

The next generation of tools for AI observability needs to ingest data at terabyte scale in near real-time and offer storage that is always hot while also being cost-effective. These tools need to include integrations for data scientists to work with the platforms and languages they use most, such as Apache Spark, Databricks, and Python. And to work effectively, they need to leverage more recent technologies like Kubernetes, which allow for tremendous scalability, decoupled components, and stateless architecture.

Finally, while this may seem like a contrarian opinion in the age of SaaS solutions, it’s time for enterprises to have the option of keeping all their data from the moment it’s ingested—not on bare-metal servers, but on the cloud providers of their choice. By doing so, you get the benefits of cost-effective object storage and you can negotiate volume discounts for your data.

If you are still in the consideration or pilot phase for incorporating AI in your systems, you should start thinking about AI observability before it breaks the bank. By keeping your costs down and ensuring that you’ll be able to ingest, store, and analyze all your observability data at scale, you can focus on acquiring the talent you need, building models, and ensuring that they are effective and provide a positive user experience.

Next Steps

Hydrolix is built to handle your log data at terabyte scale—and give you the data and insights you need without limits. Learn more about Hydrolix.

Photo by Clark Van Der Beken on Unsplash

Share this post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

The torrid pace of innovation in artificial intelligence has left regulatory bodies struggling to keep up, but it appears President Biden’s recent  Executive Order on artificial intelligence aims to change that. The order provides actions to support, promote, and collaborate on artificial intelligence—but that also promises that regulatory guidelines are in the works. According to Bruce Reed, a White House deputy chief of staff, “President Biden is rolling out the strongest set of actions any government in the world has ever taken on A.I. safety, security and trust.”

But what exactly will those new actions look like? That’s still a major unknown. Considering that training models generally use a black box approach, how can you prove your models will be compliant with future regulations? To start, you need to retain and monitor all of your data, and you need an observability platform that is robust enough to handle data at petabyte scale.

The Challenges of AI Observability

Many organizations are aware of the challenges around adopting AI. According to research from TechTarget’s Enterprise Strategy Group on generative AI adoption, 44% of organizations expect to invest in information management for generative AI while 37% expect the need for investment in data privacy, compliance, and risk. While those numbers are significant, they still represent a minority of organizations. Does that mean most organizations are prepared? Or simply that they aren’t fully aware of the challenges of managing AI-related data?

Unfortunately, there’s a significant risk that many organizations aren’t prepared. Most organizations haven’t adopted generative AI yet. According to the same research from Enterprise Strategy Group, only 4% of organizations that took part in the research were using mature generative AI while 14% were in early production. Meanwhile, 65% were in various stages ranging from consideration to pilot/proof of concept. Altogether, only 15% of the participating organizations in the research had no plans to incorporate generative AI. So 85% of organizations are at least considering generative AI, but only 18% are in production or early production. Early in the process, organizations will likely focus mostly on getting the right talent and building effective models. They may use a data lakehouse along with an all-in-one observability platform that’s sufficient for their current needs—and they may believe that these tools will scale with them.

However, in order to plan for information management around compliance, security, and model training, you need to think about very big numbers in terms of storage, ingest, and data analysis. Even if your current data management and observability tools are serving you well, will you have the capacity to ingest data at terabyte scale every day? Can you analyze petabyte volumes of data efficiently and store that same data at a reasonable cost? Large language models often have billions of parameters, and once they are in production, they are likely to generate huge volumes of data that you’ll need to analyze for anomalies and store for later compliance.

Models are already expensive to train and launch commercially. It’s not just about the compute and storage to generate and analyze that data now—you also have to plan for future costs of that storage. You will need to keep your data long-term, but just how long is unknown. When it comes to security, you need to prepare for worst-case scenarios, and when it comes to legal and regulatory compliance, you cannot risk gaps in your data. To reign in costs, you may be forced to sample your data, meaning you no longer have true observability. What might seem like the right platforms now in terms of storage and observability could be the wrong choices in terms of cost and retention just a year or two down the road—and you’ll wish you’d planned differently in hindsight.

AI Needs Hot Data to Train Models

There are two sides to AI observability. One is observability of AI systems—the other is using AI to improve the observability of other systems, which can help you find anomalies and patterns that you might not otherwise be able to detect. So when your observability systems are generating data, it’s not just dark data that’s being kept for future compliance and security reasons. It’s also a rich trove of information that can be used for training models and enable AI-assisted observability.

However, AI models need data that is always “hot” (available) for training runs. To manage the storage costs of data, enterprises typically use tiered storage which ranges from “hot” to “cold.” Hot storage is optimized for low latency queries, which comes at an increased cost. Cold storage is much cheaper but takes as much as 48 hours to restore, leading to bottlenecks for model training and data science teams. Since AI models use all of the data available in a training run, cold storage will slow down this process. That means you cannot manage your storage costs by moving model data into cold storage. And if you move other data into cold storage, you will limit your ability to use AI to use that data for both training and insights.

To properly use and store this data for observability and compliance purposes, you need a tool that grows with your model data and that’s cost-effective at petabyte scale. Most SaaS observability platforms are too expensive for large volumes of data, leaving you to patch together complex pipelines and cold storage solutions that make it challenging to rehydrate your data. Raw object storage, on the other hand, while inexpensive, isn’t indexed, making queries slow.

If your data isn’t hot, you’ll have a vast amount of dark data. You’ll never know what business questions you might be able to answer, and how you’d be able to further harness the benefits of machine learning, unless you find a way to bring that data to light.

The Basic Necessities of AI Observability

To manage the needs of AI observability, whether that’s observability of AI systems or AI-assisted observability of other systems, observability platforms need to offer high-performance storage at scale while managing costs. To do so, observability platforms must provide the following:

  • Data ingest at terabyte scale: Models generate and use vast amounts of data which you should be able to monitor in near-real time.
  • “Always hot” storage with low-latency queries: Models use all of the data in a dataset. To ensure your models are performant, you need sub-second query latency no matter how large or how old the data—with no data rehydration needed. With always hot storage, AI platforms can use all of your data for model learning and actionable insights.
  • Cost-effective storage: You’ll need so much storage that a difference of even pennies per gigabyte could lead to a huge explosion in costs. Anything you can do to reduce the size of your data footprint, whether that’s better compression or more affordable storage, will be necessary for your AI endeavors to be profitable.
  • Integrations with AI and ML platforms. Data should be available in the tools that data scientists use, whether that’s Apache Spark, Databricks, or other machine-learning platforms.

And here’s the crux of the issue: most observability platforms weren’t built to handle the compute and storage needed for true AI observability.

Most Observability Platforms Can’t Manage the Scale and Cost of AI Observability Yet

Major players in the observability space such as Splunk, New Relic, Elastic, and DataDog, were not built to handle the vast amount of data that AI needs in a cost-effective manner. Many of these larger platforms are now being acquired—examples include Cisco’s acquisition of Splunk and Francisco Partner’s acquisitions of New Relic and Sumo Logic. As Cisco’s CEO Chuck Robbins said when announcing the acquisition, “Our combined capabilities will drive the next generation of AI-enabled security and observability.”

The goal is clear: lead the next generation of AI observability. However, all of these platforms were originally built with yesterday’s infrastructure in mind. Splunk was originally released in 2003. New Relic, Elastic, and DataDog, were all first released between 2008 and 2012. Over the years, they have adapted and matured, but it is extraordinarily challenging to change the underlying architecture of a system. In 2010, 2 zettabytes of data were generated over the course of the year. (A zettabyte equals a billion terabytes.) In 2023, that number will be 120 zettabytes—more than 60 times as much data as 2010. These platforms are still dealing with storage and compute costs that were reasonable in a two zettabyte world, but those costs are no longer competitive—or even feasible. And this doesn’t even account for the astronomical amount of data that AI will consume and generate in the coming years.

Don’t count these big platforms out—they have gotten where they are today by creating mature, user-friendly solutions. But many challenges lay ahead, and it’s not yet clear how these industry players will make compute and storage more cost-effective for the next generation of big data.

Next Generation Tools for AI Observability

The next generation of tools for AI observability needs to ingest data at terabyte scale in near real-time and offer storage that is always hot while also being cost-effective. These tools need to include integrations for data scientists to work with the platforms and languages they use most, such as Apache Spark, Databricks, and Python. And to work effectively, they need to leverage more recent technologies like Kubernetes, which allow for tremendous scalability, decoupled components, and stateless architecture.

Finally, while this may seem like a contrarian opinion in the age of SaaS solutions, it’s time for enterprises to have the option of keeping all their data from the moment it’s ingested—not on bare-metal servers, but on the cloud providers of their choice. By doing so, you get the benefits of cost-effective object storage and you can negotiate volume discounts for your data.

If you are still in the consideration or pilot phase for incorporating AI in your systems, you should start thinking about AI observability before it breaks the bank. By keeping your costs down and ensuring that you’ll be able to ingest, store, and analyze all your observability data at scale, you can focus on acquiring the talent you need, building models, and ensuring that they are effective and provide a positive user experience.

Next Steps

Hydrolix is built to handle your log data at terabyte scale—and give you the data and insights you need without limits. Learn more about Hydrolix.

Photo by Clark Van Der Beken on Unsplash