Data Sharing Across Business And Platform Boundaries
Feb 11, 2024
auto_awesome
Data sharing across business and platform boundaries is complex due to business rules, regulations, and technical considerations. Andrew Jefferson discusses building a robust system for data sharing, the techno-social considerations, and the Bobsled platform that aims to simplify the process. Topics include challenges of data sharing across cloud platforms, boundaries in data transfer systems, innovative applications of data sharing, shift left and shift right mentality, and the lack of AI and vector database solutions.
Building a unified data sharing solution across different cloud platforms is complex due to their unique abstractions and limitations.
Careful evaluation of the need for data sharing is important, and data clean rooms or native tools can be more appropriate in certain scenarios.
Bob Sled's API provides innovative data sharing capabilities, such as real-time updates and seamless integration, while addressing concerns of auditability, governance, and compliance.
Deep dives
Data Sharing Challenges and Abstractions
The complexity of building an abstraction over different cloud systems is a major challenge in data sharing. Each platform has its own unique abstractions and limitations, making it difficult to create a unified solution. The devil lies in the details of managing these different platforms, and the challenge is intensified by the similarities and differences across clouds. For example, AWS has access points, which is absent in other clouds like Google Cloud Storage. Execution of serverless functions also requires building an abstraction for different clouds. The nuanced differences in storage and access to shared data make it challenging to build a cohesive solution across platforms.
Considerations for Data Sharing
When considering data sharing, it is important to carefully evaluate the need for it. Sometimes, it may not be necessary and can be avoided altogether. Additionally, if stringent assurances around data visibility, extraction, and usage are required, a data clean room solution might be more appropriate. Data clean rooms provide controlled environments and utilize techniques like differential privacy to perform aggregate queries without revealing underlying data. In migration scenarios where data needs to be moved from one platform to another, using native tools provided by the destination platform might be a better choice.
Innovation and Unexpected Applications
Some interesting and innovative applications of data sharing include auto fulfillment from CRM systems and leveraging Bob Sled's API to integrate with platforms like Salesforce. The ability to share data and receive updates in real-time enhances productivity and streamlines workflows. Bob Sled has also been used as a means to bootstrap organizations from non-cloud native sharing protocols, like CSV files, to cloud-native sharing. Additionally, leveraging Bob Sled's data loading capabilities from sources like CSV and using the storage as a source for further sharing with other platforms have shown to be effective and efficient.
Simplifying Data Sharing Across Multiple Platforms
Bob's Lead aims to provide a simple and straightforward experience for data sharing across various platforms. By using Bob's Lead, users can easily share specific views, tables, or data from their storage to recipients on platforms like Databricks, BigQuery, Azure, or Blob storage. Recipients can seamlessly access and utilize the shared data on their native platform without the need to relocate their usage or perform complex data joining and processing tasks. Bob's Lead solves the challenge of efficient and seamless data sharing in high cardinality, many-to-many environments, even across different cloud regions or platforms.
Ensuring Auditability and Governance in Data Sharing
Data sharing involves important considerations regarding auditability, governance, and compliance, especially for sensitive data like healthcare information. Bob's Lead addresses these concerns by providing a protocol that helps establish clear and shared understanding of agreements between data providers and recipients. It facilitates compliance with data processing rules, such as the right to be forgotten, by standardizing communication and deletion processes. Additionally, Bob's Lead offers abstractions, API calls, and two-way sharing capabilities to track the access, actions, and receipts of shared data. This improves auditability, allows recipients to demonstrate compliance, and streamlines the management of governance and access controls.
Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation?
What is the current state of the ecosystem for data sharing protocols/practices/platforms?
What are some of the main challenges/shortcomings that teams/organizations experience with these options?
What are the technical capabilities that need to be present for an effective data sharing solution?
How does that change as a function of the type of data? (e.g. tabular, image, etc.)
What are the requirements around governance and auditability of data access that need to be addressed when sharing data?
What are the typical boundaries along which data access requires special consideration for how the sharing is managed?
Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform?
What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing?
When is Bobsled the wrong choice?
What do you have planned for the future of data sharing?
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.