Open||Source||Data

Charna Parkey
undefined
Jan 4, 2023 • 34min

Workflow Engines and Building a Domain Specific Language for Data Quality with Tom Baeyens

This episode features an interview with Tom Baeyens, Co-founder and CTO of Soda, where he oversees the company's product development, software architecture, and technology strategy. He is passionate about open source and committed to building a community where data engineers can succeed using the Soda Data Monitoring Platform. Tom is the inventor of the widely-used open source project jBPM and Activiti. He also co-founded Effektif, a cloud process automation company.In this episode, Sam and Tom discuss the evolution of open source workflow engines, data contracts, and why data quality needs a language approach.-------------------“Where we're heading is what I think is exactly the same as with software engineering in the testing. Test-driven development was a radical new thing back then. But then it turns out, you can much more reliably release software. And this is exactly the same here. If you don't inject data testing, data observability throughout your data stack, then how are you going to trust the data that you put into your machine learning model? This is something that people are realizing, but we're still figuring out the best practices, the dos, the don'ts. We've come a long way, but there's still a way to go before this is as common and as normal as in the test-driven development software engineering space.” - Tom Baeyens-------------------Episode Timestamps:(01:23): What open source data means to Tom(04:34): Tom’s motivations for creating jBPM(09:39): What led Tom to building Soda(13:57): Why data quality needs a language approach(19:24): The community of Soda(22:47): The future of Soda as a technology(24:59): A question Tom wishes to be asked(30:24): Tom’s advice for engineers who want to leverage data observability tools-------------------Links:LinkedIn - Connect with TomTwitter - Follow TomVisit SodaCL
undefined
Dec 14, 2022 • 44min

Enabling Edge Workers, AI & ML, and The Future of Data Science with Matthew Rocklin

This episode features an interview with Matthew Rocklin, CEO of Coiled, the scalable Dask-based cloud platform. Prior to founding Coiled, Matthew worked on Dask at Anaconda and then NVIDIA where his teams focused on accelerating Dask through parallel computing and GPUs. Matthew is an industry speaker, author, and founding member of Pangeo, whose mission is to develop open source analysis tools for ocean, atmosphere, and climate science.In this episode, Sam sits down with Matthew to discuss enabling edge workers, the future of data science, and the revolution of AI and ML.-------------------“There's all sorts of fun people using these tools and that's the most fun part of this job. You get to learn so much about so many different applications that are all so different and all so fascinating. You were thinking about all these different tools and technologies and I was talking to someone once, it's like, ‘Oh, it's like you're standing on the shoulders of giants.’ That's not quite right. There's lots of sort of normal size people all standing on each other's shoulders in like a massive pyramid. [...] Dask was designed to scale up an existing ecosystem. There's a legacy Python ecosystem that’ll provide a layer of parallel computing on top of it. You can do that either by rewriting the whole thing, which is not feasible, or you can do it by talking to lots of people and getting them to integrate in interesting, fun ways. That's actually been the fun parts of Dask. I think I've probably talked to every major maintainer group ever. I have worked with them to find out the ways to get everything to work smoothly together. And that's super fun. There's an interesting sort of technical and social hacking that occurs, which I think Python has done pretty well at, historically. Which is why it has success.” – Matthew Rocklin-------------------Episode Timestamps:(00:58): What open source data means to Matthew(03:29): Matthew’s motivations behind Python(18:58): How Matthew is enabling edge workers (34:46): What the future of data Python space looks like(39:29): Matthew’s advice for the technical data audience(41:36): Executive producer, Audra Montenegro's backstage takeaways-------------------Links:LinkedIn - Connect with MatthewTwitter - Follow MatthewVisit Matthew’s WebsiteVisit DaskDask ExamplesVisit CoiledSciPy Mission
undefined
Dec 7, 2022 • 35min

OSPOs, Measuring Community Success, and Self Knowledge with Nithya Ruff

This episode features an interview with Nithya Ruff, Head of Open Source Program Office at Amazon. At Amazon, she drives open source culture and coordination and engagement with external communities. Prior to Amazon, Nithya spearheaded and grew Open Source Program Offices (OSPOs) for Comcast and Western Digital. She has also served as the Director-At-Large on the Linux Foundation Board since 2016, where she works to advance the mission of building sustainable ecosystems that are built on open collaboration.In this episode, Sam and Nithya discuss OSPOs, how to measure success, and the evolution of the data ecosystem.-------------------“I think if we look at what matters to customers, which is innovation, trust, and being a force for change with open source, then we can really deliver on the metrics that the company cares about.” – Nithya Ruff-------------------Episode Timestamps:(04:02): What open source data means to Nithya(06:29): What interested Nithya about open source software(12:34): What Nithya learned at Western Digital and Comcast that she uses now at Amazon(18:23): What Nithya teaches people in OSPO curriculum(22:06): How the open source data ecosystem has evolved in the last decade(27:44): One question Nithya wishes to be asked(30:37): Nithya’s advice for folks who want to create an OSPO-------------------Links:LinkedIn - Connect with NithyaTwitter - Follow NithyaOpen Source Law, Policy and PracticeLinkedIn - Connect with AmazonTwitter - Follow AmazonVisit Amazon
undefined
8 snips
Nov 23, 2022 • 37min

IoT Databases, Digital Twins, and Real Holodecks with Jonathan Beri

This episode features an interview with Jonathan Beri, Founder & CEO of Golioth, a commercial IoT development platform built for scale. Previously, Jonathan was a Product Manager at Particle, Google/Nest, Magneto, and Myspace where he spent his time building IoT solutions.In this episode, Sam sits down with Jonathan to discuss the concept of digital twins, the future of IoT databases, and how to build a real holodeck.-------------------“I think about IoT when I started at Nest, we had some of the best engineers I've ever worked with. Starting from first principles, defining networking protocols, and introducing new specifications that became parts of the fabric of the internet. And fast forward 10 years later, a lot of that exists now as building blocks. Someone who's not a PhD with a lifetime and achievement award from the ITF can go actually design systems that are highly productive, integrated, and enabling. And that's where I get excited. And the through line I think is enabling teams of developers to really create more with their own bare hands. And the technology around it, that is that enabler.” – Jonathan Beri-------------------Episode Timestamps:(01:33): Jonathan’s motivation for starting Golioth(08:59): The role of data in IoT(11:01): What is a digital twin and why does it matter?(17:12): The classes of problems Jonathan is trying to solve(20:35): The future of IoT databases in the next five years(31:04): What open source data means to Jonathan(32:24): Jonathan explains how to build a real holodeck(33:42): Jonathan’s advice for those excited about industrial data-------------------Links:LinkedIn - Connect with JonathanTwitter - Follow JonathanVisit Jonathan’s WebsiteLinkedIn - Connect with GoliothTwitter - Follow GoliothVisit Golioth
undefined
Nov 9, 2022 • 46min

Healthcare Infrastructure, ALS Research and Reliable Data with Indu Navar

This episode features an interview with Indu Navar, CEO and Founder of EverythingALS, a patient-driven non-profit, bringing technological innovations and data science to support efforts from care to cure, for people with ALS. Indu’s impressive career includes being an original member of the WebMD engineering team, where she was instrumental in using emerging technologies to achieve application scalability and performance.In this episode, Sam sits down with Indu to discuss healthcare infrastructure applications, her strategies for providing reliable patient data, and the future of ALS research.-------------------“We said, ‘Okay, we're going to make this a citizen-driven research.’ That means patients are going to come and enroll because it's their project and it's patient-driven. So, it's a patient-driven, open innovation. So, once you do open patient-driven, open innovation, now we are the custodians of the data. Patients own the data, so all the data is shared with the patient. That was not done before in any of the research. And so, we give all the data back to the patients. And of course, we give them metrics as well. What was the rate of their speed of their speech? And if they don't want to see it, it's fine, at least they have it. And that data, we are the custodians and as custodians we share the data. So, once we did this model, we got almost close to one thousand people enrolled, consented, within 16 months. As supposed to about 25 people in one year or 50 people in one to two years.” – Indu Navar-------------------Episode Timestamps:(01:19): What’s changed for Indu in the last tear(05:46): What data infrastructure was like 25 years ago to solve for health outcomes(13:00): Indu’s personal experience with healthcare data(16:47): What Indu is looking forward to in ALS research(20:43): How regulatory establishments have shifted in healthcare(30:31): Where Indu wants to see EverythingALS go in the next year(36:28): One question Indu wishes to be asked(38:28): Indu’s advice for people inspired by EverythingALS-------------------Links:LinkedIn - Connect with InduTwitter - Follow InduTwitter - Follow EverythingALSVisit EverythingALS
undefined
Nov 2, 2022 • 3min

Shifting Left on Data with DeVaris Brown, Tomer Shiran, and Erica Brescia

This bonus episode features conversations from season 3 of the Open||Source||Data podcast. In this episode, you’ll hear from DeVaris Brown, CEO & Co-founder of Meroxa; Tomer Shiran, Founder & CPO of Dremio; and Erica Brescia, Managing Director at Redpoint Ventures.Sam sat down with each guest to discuss how they’re making data more programmable by shifting left.You can listen to the full episodes from DeVaris Brown, Tomer Shiran, and Erica Brescia by clicking the links below.-------------------Episode Timestamps:(00:12): DeVaris Brown(00:42): Tomer Shiran(01:32): Erica Brescia-------------------Links:Listen to DeVaris’ episodeListen to Tomer’s episodeListen to Erica’s episode
undefined
Oct 26, 2022 • 34min

Serial Entrepreneurship, Metadata Capture Systems, and Osquery with Tony Gauda

This episode features an interview with Tony Gauda, Head of Customer Engineering at Fleet Device Management, an open core company powered by Osquery. Tony is a serial entrepreneur and inventor with a profound history in fraud, security, and SaaS business. He holds several issued patents and his companies have raised over $40 million in venture funding. Tony is also the founder of ThinAir, a Y-Combinator backed SaaS service that tackles the insider threat problem for enterprises and government agencies.In this episode, Sam and Tony discuss calculating data usage at scale, the creativity of attackers, and how to evolve as threats increase.-------------------“The great thing about Osquery is that since it is a sensor-based system that is queryable, it literally gives you the ability to discover new indicators of compromise and then use those when doing security investigations. And Osquery allows you to create these extremely interesting queries that would find things that you would never be able to find with a traditionally static functionality agent. And, that to me, is extremely exciting. The fact that you have this agent that is extendable and it's configurable and it's deployable across multiple different platforms, at the end of the day, it feels like it's almost a superpower for visibility.” – Tony Gauda-------------------Episode Timestamps:(01:17): What Tony is curious about these days(04:39): What problems Tony is trying to solve(05:47): How Tony got into the tech world(11:09): Tony’s inspiration behind ThinAir(15:25): What open source data means to Tony(17:06): What led Tony to being an early adopter of Osquery(20:31): What’s ahead for building next level applications with open and secure data(25:37): One question Tony’s always wanted to be asked(29:24): Tony’s advice for inventors-------------------Links:LinkedIn - Connect with TonyTwitter - Follow TonyTwitter - Follow FleetdmFleetdmFleetdm GitHub Platform
undefined
Oct 12, 2022 • 35min

Code Intelligence, GraphQL, and Closing the Remediation Gap with Beyang Liu

This episode features an interview with Beyang Liu, CTO and Co-founder of Sourcegraph, a code intelligence platform. Prior to Sourcegraph, Beyang was a software engineer at Palantir Technologies, where he developed new data analysis software on a customer-facing team working with Fortune 500 companies. Beyang studied Computer Science at Stanford, where he published research in probabilistic graphical models and computer vision at the Stanford AI Lab.In this episode, Sam sits down with Beyang to discuss the power of intelligence and visualization, GraphQL versus REST API, and how Sourcegraph is drawing inspiration from Google.-------------------“When I think about the future of Sourcegraph, it's really the future of this global human knowledge base that we're constructing. Similar to the worldwide web, the internet, where that was an amazing thing that came along. We're starting to see something like that emerge in the world of code. The open source ecosystem is this amazing, decentralized, distributed store of human knowledge that encapsulates all these algorithms and data structures and systems that are then pulled into all these systems that we rely on in our lives. And, so far, no one has really tried to map that web of knowledge in the same way that Google has mapped the internet and we want to do that. [...] You just open up a web browser, open up Google, type a query and you're good to go. We want to make exploring code as easy as that experience.” – Beyang Liu-------------------Episode Timestamps:(01:21): What open source data means to Beyang(02:59): Beyang’s inspiration to create Sourcegraph(09:13): What Beyang sees in the future of power of intelligence and visualization(14:37): How Sourcegraph works(24:11): GraphQL versus REST API(27:10): What Sourcegraph’s open source community looks like(30:29): Beyang’s advice for people wanting to build new companies-------------------Links:LinkedIn - Connect with BeyangTwitter - Follow BeyangTwitter - Follow SourcegraphSourcegraphSourcegraph Discord Channel
undefined
Sep 28, 2022 • 43min

Stream Processing, Observability, and the User Experience with Eric Sammer

This episode features an interview with Eric Sammer, CEO of Decodable. Eric has been in the tech industry for over 20 years, holding various roles as an early Cloudera employee. He also was the co-founder and CTO of Rocana, which was acquired by Splunk in 2017. During his time at Splunk, Eric served as the VP and Senior Distinguished Engineer responsible for cloud platform services.In this episode, Sam and Eric discuss the gap between operating infrastructure and the analytical world, stream processing innovations, and why it’s important to work with people who are smarter than you.-------------------"The thing about Decodable was just like let's connect systems, let's process the data between them. Apache Flink is the right engine and SQL is the language for programming the engine. It doesn't need to be any more complicated. The trick is getting it right, so that people can think about that part of the data infrastructure, the way they think about the network. They don't question whether the packet makes it to the other side because that infrastructure is so burned in and it scales reasonably well these days. You don't even think about it, especially in the cloud." – Eric Sammer-------------------Episode Timestamps:(01:09): What open source data means to Eric(06:57): What led Eric to Cloudera and Hadoop(12:48): What inspired Eric to create Rocana(20:29): The problem Eric is trying to solve at Flink(29:54): What problems in stream processing we’ll have to solve in the next 5 years(36:58): Eric’s advice for advancing your career-------------------Links:LinkedIn - Connect with EricTwitter - Follow EricTwitter - Follow DecodableDecodable
undefined
Jul 20, 2022 • 16min

Season 3 Compressed Edition with Sam and Audra

Join Open||Source||Data executive producer Audra Montenegro as she and Sam discuss his learnings and takeaways from this season and what the future of open source data looks like.-------------------“There's such an open conversation about, ‘Yeah, open source,’ we usually think about open source software. How can we cross apply more of what we think about in software in general into data, and then what is it that's totally new about this domain? So, the answers cluster into three groups. It's either about the source of the data itself is open, meaning this is government data or data that's been made public and it's openly accessible. Or it could be that open source data is how the data is actually produced. Is it using open source tooling? Is it on an open source architecture? And finally, how do you trust that open source data? If it's just a whole bunch of data but it hasn't been labeled, if it hasn't been managed and produced, turned into a product. How do you understand its heritage? How do you understand the lineage of the data so that you can produce trustworthy models and trustworthy results based on it? So it's a big open field, but those are the general responses that people have when we explore that topic.” – Sam Ramji-------------------Episode Timestamps:(01:29): What open source data means to our guests(02:57): Sam discusses the themes of season 3(10:38): What Sam is looking forward to in the future of open source data-------------------Links:LinkedIn - Connect with SamLinkedIn - Connect with AudraTwitter - Follow SamTwitter - Follow Audra

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app