Build A Data Lake For Your Security Logs With Scanner
Jan 29, 2024
auto_awesome
Learn about Scanner, a fast querying platform for security log data. Discover the challenges of managing data lakes and the benefits of using a search index. Explore the design philosophies of the Scanner platform and its integration into security log analysis workflows. Understand the indexing strategies for variegated data and the importance of regulatory compliance and data security. Also, find out about the need for better visibility and queryability in data management.
Scanner enables fast querying of high scale log data for security auditing.
Scanner leverages AWS S3 for storing log data, allowing for efficient ad hoc searches and cross-correlations.
Scanner focuses on making search on massive datasets affordable and fast for security teams.
Deep dives
Scanner: An Efficient Security Data Lake Platform
Scanner is a security data lake platform that offers fast and cost-effective analysis of security logs. It was created to tackle the challenges faced by security teams in managing and searching through massive amounts of log data. Scanner enables users to build correlations and relationships between different log sources, allowing for in-depth investigations and threat detection. It indexes the content of logs stored in AWS S3, making it easy to search and explore logs that are often in JSON format. The platform's serverless architecture leverages AWS ECS Fargate for indexing compute, providing scalability and agility. Scanner focuses on decoupling storage and compute, ensuring that user data remains under their control in their own S3 buckets. It offers a user-friendly interface that allows for iterative and collaborative investigations, combining search results from multiple log sources into a single view.
Efficient Log Analysis with Scanner for Security Purposes
Scanner addresses the limitations of traditional log analysis tools, such as Elasticsearch and Splunk, by providing a cloud-first solution with a focus on speed and scalability. It leverages the power of AWS S3 for storing log data, enabling efficient ad hoc searches and cross-correlations across large volumes of logs. With Scanner, security teams can easily detect and investigate security incidents by searching through vast amounts of data, including high-volume log sources like VPC flow logs. The platform's indexing capabilities, combined with its fast query execution, make it possible to identify relationships and patterns in log data, enabling better threat detection and response. Scanner's iterative and collaborative user experience allows for seamless exploration and hypothesis testing, making it an invaluable tool for security engineers.
The Future of Scanner: Enhanced Data Acquisition and Advanced Capabilities
Scanner is continuously evolving to support a broader range of data acquisition paths, enabling users to acquire logs from various tools and systems. Currently, it focuses on tools that naturally upload logs to S3, like AWS CloudTrail and CrowdStrike Falcon Data Replicator. However, the platform is expanding to include data connectors that can pull logs from API-based sources, providing a unified data acquisition experience within the scanner interface. The team behind Scanner is also exploring possibilities for multi-hop queries and automated relationship discovery, allowing users to uncover complex attack patterns that span multiple stages. Future enhancements may include integrating with other security tools for automated response and collaboration, while maintaining its core strengths of fast search, scalability, and user control over data.
Balancing search speed with ingestion speed
One of the main challenges in building the scanner product is finding a balance between search speed and ingestion speed. While tools like Elasticsearch prioritize fast ingestion, they suffer from slow query performance. The goal of the scanner is to enable querying for specific information quickly, such as searching for a needle in a haystack of petabytes of logs in under 100 seconds. By using a coarse-grained index system and efficient scanning techniques, scanner aims to provide fast querying capabilities while ensuring low-cost ingestion and efficient handling of high traffic.
Focusing on affordable and fast search capabilities
The primary focus of the scanner is to make search on massive datasets both affordable and fast. The goal is to decrease the cost of search by tenfold while maintaining exceptional speed. This is especially important for security teams that work with large volumes of logs and regularly need to run investigations over historical data. Scanner aims to provide fast ad hoc searching as well as robust detection queries that help identify threats. Rather than trying to build every feature in the security space, scanner aims to integrate well with other security tools and provide an excellent search experience, saving time for overworked security teams.
Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Scanner is and the story behind it?
What were the shortcomings of other tools that are available in the ecosystem?
What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM)
A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.
What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?
Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies?
Can you describe the architecture of the Scanner platform?
What were the motivating constraints that led you to your current implementation?
How have the design and goals of the product changed since you first started working on it?
Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies?
What are the personas of the end-users for Scanner?
How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?
For teams who are working with Scanner can you describe how it fits into their workflow?
What are the most interesting, innovative, or unexpected ways that you have seen Scanner used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner?
When is Scanner the wrong choice?
What do you have planned for the future of Scanner?
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.