Open||Source||Data

Charna Parkey
undefined
Dec 14, 2022 • 44min

Enabling Edge Workers, AI & ML, and The Future of Data Science with Matthew Rocklin

This episode features an interview with Matthew Rocklin, CEO of Coiled, the scalable Dask-based cloud platform. Prior to founding Coiled, Matthew worked on Dask at Anaconda and then NVIDIA where his teams focused on accelerating Dask through parallel computing and GPUs. Matthew is an industry speaker, author, and founding member of Pangeo, whose mission is to develop open source analysis tools for ocean, atmosphere, and climate science.In this episode, Sam sits down with Matthew to discuss enabling edge workers, the future of data science, and the revolution of AI and ML.-------------------“There's all sorts of fun people using these tools and that's the most fun part of this job. You get to learn so much about so many different applications that are all so different and all so fascinating. You were thinking about all these different tools and technologies and I was talking to someone once, it's like, ‘Oh, it's like you're standing on the shoulders of giants.’ That's not quite right. There's lots of sort of normal size people all standing on each other's shoulders in like a massive pyramid. [...] Dask was designed to scale up an existing ecosystem. There's a legacy Python ecosystem that’ll provide a layer of parallel computing on top of it. You can do that either by rewriting the whole thing, which is not feasible, or you can do it by talking to lots of people and getting them to integrate in interesting, fun ways. That's actually been the fun parts of Dask. I think I've probably talked to every major maintainer group ever. I have worked with them to find out the ways to get everything to work smoothly together. And that's super fun. There's an interesting sort of technical and social hacking that occurs, which I think Python has done pretty well at, historically. Which is why it has success.” – Matthew Rocklin-------------------Episode Timestamps:(00:58): What open source data means to Matthew(03:29): Matthew’s motivations behind Python(18:58): How Matthew is enabling edge workers (34:46): What the future of data Python space looks like(39:29): Matthew’s advice for the technical data audience(41:36): Executive producer, Audra Montenegro's backstage takeaways-------------------Links:LinkedIn - Connect with MatthewTwitter - Follow MatthewVisit Matthew’s WebsiteVisit DaskDask ExamplesVisit CoiledSciPy Mission
undefined
Dec 7, 2022 • 35min

OSPOs, Measuring Community Success, and Self Knowledge with Nithya Ruff

This episode features an interview with Nithya Ruff, Head of Open Source Program Office at Amazon. At Amazon, she drives open source culture and coordination and engagement with external communities. Prior to Amazon, Nithya spearheaded and grew Open Source Program Offices (OSPOs) for Comcast and Western Digital. She has also served as the Director-At-Large on the Linux Foundation Board since 2016, where she works to advance the mission of building sustainable ecosystems that are built on open collaboration.In this episode, Sam and Nithya discuss OSPOs, how to measure success, and the evolution of the data ecosystem.-------------------“I think if we look at what matters to customers, which is innovation, trust, and being a force for change with open source, then we can really deliver on the metrics that the company cares about.” – Nithya Ruff-------------------Episode Timestamps:(04:02): What open source data means to Nithya(06:29): What interested Nithya about open source software(12:34): What Nithya learned at Western Digital and Comcast that she uses now at Amazon(18:23): What Nithya teaches people in OSPO curriculum(22:06): How the open source data ecosystem has evolved in the last decade(27:44): One question Nithya wishes to be asked(30:37): Nithya’s advice for folks who want to create an OSPO-------------------Links:LinkedIn - Connect with NithyaTwitter - Follow NithyaOpen Source Law, Policy and PracticeLinkedIn - Connect with AmazonTwitter - Follow AmazonVisit Amazon
undefined
8 snips
Nov 23, 2022 • 37min

IoT Databases, Digital Twins, and Real Holodecks with Jonathan Beri

This episode features an interview with Jonathan Beri, Founder & CEO of Golioth, a commercial IoT development platform built for scale. Previously, Jonathan was a Product Manager at Particle, Google/Nest, Magneto, and Myspace where he spent his time building IoT solutions.In this episode, Sam sits down with Jonathan to discuss the concept of digital twins, the future of IoT databases, and how to build a real holodeck.-------------------“I think about IoT when I started at Nest, we had some of the best engineers I've ever worked with. Starting from first principles, defining networking protocols, and introducing new specifications that became parts of the fabric of the internet. And fast forward 10 years later, a lot of that exists now as building blocks. Someone who's not a PhD with a lifetime and achievement award from the ITF can go actually design systems that are highly productive, integrated, and enabling. And that's where I get excited. And the through line I think is enabling teams of developers to really create more with their own bare hands. And the technology around it, that is that enabler.” – Jonathan Beri-------------------Episode Timestamps:(01:33): Jonathan’s motivation for starting Golioth(08:59): The role of data in IoT(11:01): What is a digital twin and why does it matter?(17:12): The classes of problems Jonathan is trying to solve(20:35): The future of IoT databases in the next five years(31:04): What open source data means to Jonathan(32:24): Jonathan explains how to build a real holodeck(33:42): Jonathan’s advice for those excited about industrial data-------------------Links:LinkedIn - Connect with JonathanTwitter - Follow JonathanVisit Jonathan’s WebsiteLinkedIn - Connect with GoliothTwitter - Follow GoliothVisit Golioth
undefined
Nov 9, 2022 • 46min

Healthcare Infrastructure, ALS Research and Reliable Data with Indu Navar

This episode features an interview with Indu Navar, CEO and Founder of EverythingALS, a patient-driven non-profit, bringing technological innovations and data science to support efforts from care to cure, for people with ALS. Indu’s impressive career includes being an original member of the WebMD engineering team, where she was instrumental in using emerging technologies to achieve application scalability and performance.In this episode, Sam sits down with Indu to discuss healthcare infrastructure applications, her strategies for providing reliable patient data, and the future of ALS research.-------------------“We said, ‘Okay, we're going to make this a citizen-driven research.’ That means patients are going to come and enroll because it's their project and it's patient-driven. So, it's a patient-driven, open innovation. So, once you do open patient-driven, open innovation, now we are the custodians of the data. Patients own the data, so all the data is shared with the patient. That was not done before in any of the research. And so, we give all the data back to the patients. And of course, we give them metrics as well. What was the rate of their speed of their speech? And if they don't want to see it, it's fine, at least they have it. And that data, we are the custodians and as custodians we share the data. So, once we did this model, we got almost close to one thousand people enrolled, consented, within 16 months. As supposed to about 25 people in one year or 50 people in one to two years.” – Indu Navar-------------------Episode Timestamps:(01:19): What’s changed for Indu in the last tear(05:46): What data infrastructure was like 25 years ago to solve for health outcomes(13:00): Indu’s personal experience with healthcare data(16:47): What Indu is looking forward to in ALS research(20:43): How regulatory establishments have shifted in healthcare(30:31): Where Indu wants to see EverythingALS go in the next year(36:28): One question Indu wishes to be asked(38:28): Indu’s advice for people inspired by EverythingALS-------------------Links:LinkedIn - Connect with InduTwitter - Follow InduTwitter - Follow EverythingALSVisit EverythingALS
undefined
Nov 2, 2022 • 3min

Shifting Left on Data with DeVaris Brown, Tomer Shiran, and Erica Brescia

This bonus episode features conversations from season 3 of the Open||Source||Data podcast. In this episode, you’ll hear from DeVaris Brown, CEO & Co-founder of Meroxa; Tomer Shiran, Founder & CPO of Dremio; and Erica Brescia, Managing Director at Redpoint Ventures.Sam sat down with each guest to discuss how they’re making data more programmable by shifting left.You can listen to the full episodes from DeVaris Brown, Tomer Shiran, and Erica Brescia by clicking the links below.-------------------Episode Timestamps:(00:12): DeVaris Brown(00:42): Tomer Shiran(01:32): Erica Brescia-------------------Links:Listen to DeVaris’ episodeListen to Tomer’s episodeListen to Erica’s episode
undefined
Oct 26, 2022 • 34min

Serial Entrepreneurship, Metadata Capture Systems, and Osquery with Tony Gauda

This episode features an interview with Tony Gauda, Head of Customer Engineering at Fleet Device Management, an open core company powered by Osquery. Tony is a serial entrepreneur and inventor with a profound history in fraud, security, and SaaS business. He holds several issued patents and his companies have raised over $40 million in venture funding. Tony is also the founder of ThinAir, a Y-Combinator backed SaaS service that tackles the insider threat problem for enterprises and government agencies.In this episode, Sam and Tony discuss calculating data usage at scale, the creativity of attackers, and how to evolve as threats increase.-------------------“The great thing about Osquery is that since it is a sensor-based system that is queryable, it literally gives you the ability to discover new indicators of compromise and then use those when doing security investigations. And Osquery allows you to create these extremely interesting queries that would find things that you would never be able to find with a traditionally static functionality agent. And, that to me, is extremely exciting. The fact that you have this agent that is extendable and it's configurable and it's deployable across multiple different platforms, at the end of the day, it feels like it's almost a superpower for visibility.” – Tony Gauda-------------------Episode Timestamps:(01:17): What Tony is curious about these days(04:39): What problems Tony is trying to solve(05:47): How Tony got into the tech world(11:09): Tony’s inspiration behind ThinAir(15:25): What open source data means to Tony(17:06): What led Tony to being an early adopter of Osquery(20:31): What’s ahead for building next level applications with open and secure data(25:37): One question Tony’s always wanted to be asked(29:24): Tony’s advice for inventors-------------------Links:LinkedIn - Connect with TonyTwitter - Follow TonyTwitter - Follow FleetdmFleetdmFleetdm GitHub Platform
undefined
Oct 12, 2022 • 35min

Code Intelligence, GraphQL, and Closing the Remediation Gap with Beyang Liu

This episode features an interview with Beyang Liu, CTO and Co-founder of Sourcegraph, a code intelligence platform. Prior to Sourcegraph, Beyang was a software engineer at Palantir Technologies, where he developed new data analysis software on a customer-facing team working with Fortune 500 companies. Beyang studied Computer Science at Stanford, where he published research in probabilistic graphical models and computer vision at the Stanford AI Lab.In this episode, Sam sits down with Beyang to discuss the power of intelligence and visualization, GraphQL versus REST API, and how Sourcegraph is drawing inspiration from Google.-------------------“When I think about the future of Sourcegraph, it's really the future of this global human knowledge base that we're constructing. Similar to the worldwide web, the internet, where that was an amazing thing that came along. We're starting to see something like that emerge in the world of code. The open source ecosystem is this amazing, decentralized, distributed store of human knowledge that encapsulates all these algorithms and data structures and systems that are then pulled into all these systems that we rely on in our lives. And, so far, no one has really tried to map that web of knowledge in the same way that Google has mapped the internet and we want to do that. [...] You just open up a web browser, open up Google, type a query and you're good to go. We want to make exploring code as easy as that experience.” – Beyang Liu-------------------Episode Timestamps:(01:21): What open source data means to Beyang(02:59): Beyang’s inspiration to create Sourcegraph(09:13): What Beyang sees in the future of power of intelligence and visualization(14:37): How Sourcegraph works(24:11): GraphQL versus REST API(27:10): What Sourcegraph’s open source community looks like(30:29): Beyang’s advice for people wanting to build new companies-------------------Links:LinkedIn - Connect with BeyangTwitter - Follow BeyangTwitter - Follow SourcegraphSourcegraphSourcegraph Discord Channel
undefined
Sep 28, 2022 • 43min

Stream Processing, Observability, and the User Experience with Eric Sammer

This episode features an interview with Eric Sammer, CEO of Decodable. Eric has been in the tech industry for over 20 years, holding various roles as an early Cloudera employee. He also was the co-founder and CTO of Rocana, which was acquired by Splunk in 2017. During his time at Splunk, Eric served as the VP and Senior Distinguished Engineer responsible for cloud platform services.In this episode, Sam and Eric discuss the gap between operating infrastructure and the analytical world, stream processing innovations, and why it’s important to work with people who are smarter than you.-------------------"The thing about Decodable was just like let's connect systems, let's process the data between them. Apache Flink is the right engine and SQL is the language for programming the engine. It doesn't need to be any more complicated. The trick is getting it right, so that people can think about that part of the data infrastructure, the way they think about the network. They don't question whether the packet makes it to the other side because that infrastructure is so burned in and it scales reasonably well these days. You don't even think about it, especially in the cloud." – Eric Sammer-------------------Episode Timestamps:(01:09): What open source data means to Eric(06:57): What led Eric to Cloudera and Hadoop(12:48): What inspired Eric to create Rocana(20:29): The problem Eric is trying to solve at Flink(29:54): What problems in stream processing we’ll have to solve in the next 5 years(36:58): Eric’s advice for advancing your career-------------------Links:LinkedIn - Connect with EricTwitter - Follow EricTwitter - Follow DecodableDecodable
undefined
Jul 20, 2022 • 16min

Season 3 Compressed Edition with Sam and Audra

Join Open||Source||Data executive producer Audra Montenegro as she and Sam discuss his learnings and takeaways from this season and what the future of open source data looks like.-------------------“There's such an open conversation about, ‘Yeah, open source,’ we usually think about open source software. How can we cross apply more of what we think about in software in general into data, and then what is it that's totally new about this domain? So, the answers cluster into three groups. It's either about the source of the data itself is open, meaning this is government data or data that's been made public and it's openly accessible. Or it could be that open source data is how the data is actually produced. Is it using open source tooling? Is it on an open source architecture? And finally, how do you trust that open source data? If it's just a whole bunch of data but it hasn't been labeled, if it hasn't been managed and produced, turned into a product. How do you understand its heritage? How do you understand the lineage of the data so that you can produce trustworthy models and trustworthy results based on it? So it's a big open field, but those are the general responses that people have when we explore that topic.” – Sam Ramji-------------------Episode Timestamps:(01:29): What open source data means to our guests(02:57): Sam discusses the themes of season 3(10:38): What Sam is looking forward to in the future of open source data-------------------Links:LinkedIn - Connect with SamLinkedIn - Connect with AudraTwitter - Follow SamTwitter - Follow Audra
undefined
Jul 6, 2022 • 39min

Accelerating Computation, Machine Learning, and Data Mesh with Sophie Watson

This episode features an interview with Sophie Watson, Technical Product Marketing Manager at NVIDIA. Previously, Sophie served as a software engineer and principal data scientist at RedHat where she used machine learning to solve business problems in the hybrid cloud. Sophie has a PhD in Bayesian statistics and frequently speaks about machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. In this episode, Sam and Sophie discuss Principal Component Analysis, computational acceleration, and MLOps.-------------------“We all start when we get hold of a data set by visualizing it to try to understand it. So that usually for me involves starting with a simple technique, something like PCA, Principal Component Analysis. It's been around since the eighties, probably longer, maybe the sixties. Don't quote me on that. With Principal Component Analysis, we can map our high dimensional data down to a smaller number of dimensions. Let's map it down to two so that we can visualize it. So we can go ahead and visualize it. But Principal Component Analysis is quite a simple technique in what it's doing and it's just mapping onto key components of our data. We might not be able to see, perhaps, separation of classes if we're working with data that's from a set of classes. Maybe we're looking at transactions, are they fraudulent or are they legitimate? And we might not be able to see that distinction. So that makes us think, "Is there something interesting in my data? Am I going to be able to train a machine learning model?" I don't know. Back in the day, I think the next step would've been, “Oh, let's train a model in C”, but now with accelerated compute within a really reasonable amount of time, we can go ahead and use a more sophisticated technique so we can use something like UMAP that's leaning on differential manifolds to do that projection to lower dimensions. And because this technique is slightly more sophisticated, what we find in general is that within the same amount of time, we're able to get more insight into the data. We're able to see the distinction in classes between our data sets. It keeps you in that loop. It keeps you in that productivity state.” – Sophie Watson-------------------Episode Timestamps:(01:22): What open source data means to Sophie(02:47): How Sophie is spending her time (07:52): What excites Sophia about the data science community(10:13): What Sophie is most excited about in data visibility(16:29): Data on servers versus data in the cloud(18:09): Accelerated computation on machine learning(22:27): Sophie breaks down probabilistic programming(24:21): What problem was Sophie trying to solve in her career(32:12): Sophie’s dream job of working for Taylor Swift(34:48): Sophie’s advice for those interested in open source-------------------Links:LinkedIn - Connect with SophieTwitter - Follow SophieTwitter - Follow NVIDIANVIDIA

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app