
Open||Source||Data
What can we learn from ai-native development through stimulating conversations with developers, regulators, academics and people like you that drive forward development, seek to understand impact, and are working to mitigate risk in this new world?
Join Charna Parkey and the community shaping the future of open source data, open source software, data in AI, and much more.
Latest episodes

Nov 9, 2022 • 46min
Healthcare Infrastructure, ALS Research and Reliable Data with Indu Navar
This episode features an interview with Indu Navar, CEO and Founder of EverythingALS, a patient-driven non-profit, bringing technological innovations and data science to support efforts from care to cure, for people with ALS. Indu’s impressive career includes being an original member of the WebMD engineering team, where she was instrumental in using emerging technologies to achieve application scalability and performance.In this episode, Sam sits down with Indu to discuss healthcare infrastructure applications, her strategies for providing reliable patient data, and the future of ALS research.-------------------“We said, ‘Okay, we're going to make this a citizen-driven research.’ That means patients are going to come and enroll because it's their project and it's patient-driven. So, it's a patient-driven, open innovation. So, once you do open patient-driven, open innovation, now we are the custodians of the data. Patients own the data, so all the data is shared with the patient. That was not done before in any of the research. And so, we give all the data back to the patients. And of course, we give them metrics as well. What was the rate of their speed of their speech? And if they don't want to see it, it's fine, at least they have it. And that data, we are the custodians and as custodians we share the data. So, once we did this model, we got almost close to one thousand people enrolled, consented, within 16 months. As supposed to about 25 people in one year or 50 people in one to two years.” – Indu Navar-------------------Episode Timestamps:(01:19): What’s changed for Indu in the last tear(05:46): What data infrastructure was like 25 years ago to solve for health outcomes(13:00): Indu’s personal experience with healthcare data(16:47): What Indu is looking forward to in ALS research(20:43): How regulatory establishments have shifted in healthcare(30:31): Where Indu wants to see EverythingALS go in the next year(36:28): One question Indu wishes to be asked(38:28): Indu’s advice for people inspired by EverythingALS-------------------Links:LinkedIn - Connect with InduTwitter - Follow InduTwitter - Follow EverythingALSVisit EverythingALS

Nov 2, 2022 • 3min
Shifting Left on Data with DeVaris Brown, Tomer Shiran, and Erica Brescia
This bonus episode features conversations from season 3 of the Open||Source||Data podcast. In this episode, you’ll hear from DeVaris Brown, CEO & Co-founder of Meroxa; Tomer Shiran, Founder & CPO of Dremio; and Erica Brescia, Managing Director at Redpoint Ventures.Sam sat down with each guest to discuss how they’re making data more programmable by shifting left.You can listen to the full episodes from DeVaris Brown, Tomer Shiran, and Erica Brescia by clicking the links below.-------------------Episode Timestamps:(00:12): DeVaris Brown(00:42): Tomer Shiran(01:32): Erica Brescia-------------------Links:Listen to DeVaris’ episodeListen to Tomer’s episodeListen to Erica’s episode

Oct 26, 2022 • 34min
Serial Entrepreneurship, Metadata Capture Systems, and Osquery with Tony Gauda
This episode features an interview with Tony Gauda, Head of Customer Engineering at Fleet Device Management, an open core company powered by Osquery. Tony is a serial entrepreneur and inventor with a profound history in fraud, security, and SaaS business. He holds several issued patents and his companies have raised over $40 million in venture funding. Tony is also the founder of ThinAir, a Y-Combinator backed SaaS service that tackles the insider threat problem for enterprises and government agencies.In this episode, Sam and Tony discuss calculating data usage at scale, the creativity of attackers, and how to evolve as threats increase.-------------------“The great thing about Osquery is that since it is a sensor-based system that is queryable, it literally gives you the ability to discover new indicators of compromise and then use those when doing security investigations. And Osquery allows you to create these extremely interesting queries that would find things that you would never be able to find with a traditionally static functionality agent. And, that to me, is extremely exciting. The fact that you have this agent that is extendable and it's configurable and it's deployable across multiple different platforms, at the end of the day, it feels like it's almost a superpower for visibility.” – Tony Gauda-------------------Episode Timestamps:(01:17): What Tony is curious about these days(04:39): What problems Tony is trying to solve(05:47): How Tony got into the tech world(11:09): Tony’s inspiration behind ThinAir(15:25): What open source data means to Tony(17:06): What led Tony to being an early adopter of Osquery(20:31): What’s ahead for building next level applications with open and secure data(25:37): One question Tony’s always wanted to be asked(29:24): Tony’s advice for inventors-------------------Links:LinkedIn - Connect with TonyTwitter - Follow TonyTwitter - Follow FleetdmFleetdmFleetdm GitHub Platform

Oct 12, 2022 • 35min
Code Intelligence, GraphQL, and Closing the Remediation Gap with Beyang Liu
This episode features an interview with Beyang Liu, CTO and Co-founder of Sourcegraph, a code intelligence platform. Prior to Sourcegraph, Beyang was a software engineer at Palantir Technologies, where he developed new data analysis software on a customer-facing team working with Fortune 500 companies. Beyang studied Computer Science at Stanford, where he published research in probabilistic graphical models and computer vision at the Stanford AI Lab.In this episode, Sam sits down with Beyang to discuss the power of intelligence and visualization, GraphQL versus REST API, and how Sourcegraph is drawing inspiration from Google.-------------------“When I think about the future of Sourcegraph, it's really the future of this global human knowledge base that we're constructing. Similar to the worldwide web, the internet, where that was an amazing thing that came along. We're starting to see something like that emerge in the world of code. The open source ecosystem is this amazing, decentralized, distributed store of human knowledge that encapsulates all these algorithms and data structures and systems that are then pulled into all these systems that we rely on in our lives. And, so far, no one has really tried to map that web of knowledge in the same way that Google has mapped the internet and we want to do that. [...] You just open up a web browser, open up Google, type a query and you're good to go. We want to make exploring code as easy as that experience.” – Beyang Liu-------------------Episode Timestamps:(01:21): What open source data means to Beyang(02:59): Beyang’s inspiration to create Sourcegraph(09:13): What Beyang sees in the future of power of intelligence and visualization(14:37): How Sourcegraph works(24:11): GraphQL versus REST API(27:10): What Sourcegraph’s open source community looks like(30:29): Beyang’s advice for people wanting to build new companies-------------------Links:LinkedIn - Connect with BeyangTwitter - Follow BeyangTwitter - Follow SourcegraphSourcegraphSourcegraph Discord Channel

Sep 28, 2022 • 43min
Stream Processing, Observability, and the User Experience with Eric Sammer
This episode features an interview with Eric Sammer, CEO of Decodable. Eric has been in the tech industry for over 20 years, holding various roles as an early Cloudera employee. He also was the co-founder and CTO of Rocana, which was acquired by Splunk in 2017. During his time at Splunk, Eric served as the VP and Senior Distinguished Engineer responsible for cloud platform services.In this episode, Sam and Eric discuss the gap between operating infrastructure and the analytical world, stream processing innovations, and why it’s important to work with people who are smarter than you.-------------------"The thing about Decodable was just like let's connect systems, let's process the data between them. Apache Flink is the right engine and SQL is the language for programming the engine. It doesn't need to be any more complicated. The trick is getting it right, so that people can think about that part of the data infrastructure, the way they think about the network. They don't question whether the packet makes it to the other side because that infrastructure is so burned in and it scales reasonably well these days. You don't even think about it, especially in the cloud." – Eric Sammer-------------------Episode Timestamps:(01:09): What open source data means to Eric(06:57): What led Eric to Cloudera and Hadoop(12:48): What inspired Eric to create Rocana(20:29): The problem Eric is trying to solve at Flink(29:54): What problems in stream processing we’ll have to solve in the next 5 years(36:58): Eric’s advice for advancing your career-------------------Links:LinkedIn - Connect with EricTwitter - Follow EricTwitter - Follow DecodableDecodable

Jul 20, 2022 • 16min
Season 3 Compressed Edition with Sam and Audra
Join Open||Source||Data executive producer Audra Montenegro as she and Sam discuss his learnings and takeaways from this season and what the future of open source data looks like.-------------------“There's such an open conversation about, ‘Yeah, open source,’ we usually think about open source software. How can we cross apply more of what we think about in software in general into data, and then what is it that's totally new about this domain? So, the answers cluster into three groups. It's either about the source of the data itself is open, meaning this is government data or data that's been made public and it's openly accessible. Or it could be that open source data is how the data is actually produced. Is it using open source tooling? Is it on an open source architecture? And finally, how do you trust that open source data? If it's just a whole bunch of data but it hasn't been labeled, if it hasn't been managed and produced, turned into a product. How do you understand its heritage? How do you understand the lineage of the data so that you can produce trustworthy models and trustworthy results based on it? So it's a big open field, but those are the general responses that people have when we explore that topic.” – Sam Ramji-------------------Episode Timestamps:(01:29): What open source data means to our guests(02:57): Sam discusses the themes of season 3(10:38): What Sam is looking forward to in the future of open source data-------------------Links:LinkedIn - Connect with SamLinkedIn - Connect with AudraTwitter - Follow SamTwitter - Follow Audra

Jul 6, 2022 • 39min
Accelerating Computation, Machine Learning, and Data Mesh with Sophie Watson
This episode features an interview with Sophie Watson, Technical Product Marketing Manager at NVIDIA. Previously, Sophie served as a software engineer and principal data scientist at RedHat where she used machine learning to solve business problems in the hybrid cloud. Sophie has a PhD in Bayesian statistics and frequently speaks about machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. In this episode, Sam and Sophie discuss Principal Component Analysis, computational acceleration, and MLOps.-------------------“We all start when we get hold of a data set by visualizing it to try to understand it. So that usually for me involves starting with a simple technique, something like PCA, Principal Component Analysis. It's been around since the eighties, probably longer, maybe the sixties. Don't quote me on that. With Principal Component Analysis, we can map our high dimensional data down to a smaller number of dimensions. Let's map it down to two so that we can visualize it. So we can go ahead and visualize it. But Principal Component Analysis is quite a simple technique in what it's doing and it's just mapping onto key components of our data. We might not be able to see, perhaps, separation of classes if we're working with data that's from a set of classes. Maybe we're looking at transactions, are they fraudulent or are they legitimate? And we might not be able to see that distinction. So that makes us think, "Is there something interesting in my data? Am I going to be able to train a machine learning model?" I don't know. Back in the day, I think the next step would've been, “Oh, let's train a model in C”, but now with accelerated compute within a really reasonable amount of time, we can go ahead and use a more sophisticated technique so we can use something like UMAP that's leaning on differential manifolds to do that projection to lower dimensions. And because this technique is slightly more sophisticated, what we find in general is that within the same amount of time, we're able to get more insight into the data. We're able to see the distinction in classes between our data sets. It keeps you in that loop. It keeps you in that productivity state.” – Sophie Watson-------------------Episode Timestamps:(01:22): What open source data means to Sophie(02:47): How Sophie is spending her time (07:52): What excites Sophia about the data science community(10:13): What Sophie is most excited about in data visibility(16:29): Data on servers versus data in the cloud(18:09): Accelerated computation on machine learning(22:27): Sophie breaks down probabilistic programming(24:21): What problem was Sophie trying to solve in her career(32:12): Sophie’s dream job of working for Taylor Swift(34:48): Sophie’s advice for those interested in open source-------------------Links:LinkedIn - Connect with SophieTwitter - Follow SophieTwitter - Follow NVIDIANVIDIA

Jun 29, 2022 • 6min
Democratization and Cognition with Margot Gerritsen, Rachel Chalmers, and Patricia Boswell
This bonus episode features conversations from season 1 of the Open||Source||Data podcast. In this episode, you’ll hear from Margot Gerritsen, Stanford Professor and Co-Founder/Director of WiDS; Rachel Chalmers, Partner at Alchemist Accelerator; and Patricia Boswell, Staff Technical Writer at Google.Sam sat down with each guest to discuss cognition and democratization in data. You can listen to the full episodes from Margot Gerritsen, Rachel Chalmers, and Patricia Boswell by clicking the links below.-------------------Episode Timestamps:(00:18): Margot Gerritsen(02:07): Rachel Chalmers(03:46): Patricia Boswell-------------------Links:Listen to Margot’s episodeListen to Rachel’s episodeListen to Patricia's episode

Jun 22, 2022 • 36min
Vector Search, the AI Stack and more with Bob van Luijt
This episode features an interview with Bob van Luijt, CEO and Co-Founder of SeMI Technologies and co-creator of Weaviate, an open source vector search engine. At just 15 years of age, Bob started his own software company in the Netherlands. He went on to study music at ArtEZ University of the Arts and Berklee College of Music, and completed the Harvard Business School Program of Management Excellence. Bob is also a TedX speaker, discussing the relationship between software and language.In this episode, Sam sits down with Bob to break down vector search, the AI-first ecosystem, and how music and software relate to one another.-------------------“I dare to argue that from the two big waves in database technology that we've seen, so first, in the seventies and eighties with SQL. And then the whole NoSQL wave that we have seen and the big winners that are in there, I dare to argue that we see a third wave coming up. And the third wave, I simply call it AI-first. And what I mean with that is that these models play an important role. So we do it from the perspective of the models first. And in that new segment, you see four niches. So the first niche that we see are what I like to call the embedding providers. The Hugging Faces of this world, the OpenAIs of this world, etc. Those who bring us the embeddings that we need to do the vectorization. Then secondly, we have so-called neural search frameworks. So we see frameworks like Haystack and Jina. Then third, we have the feature stores. So the feature stores take care of storing large chunks of features that we later can use to do vectorization on those kinds of things.And then we have the search engines. And Weaviate is an example of such a search engine that takes care of searching through data on a large scale that is vectorized.It might be a bold statement, but I really believe that we see this third wave of database technology happening.” – Bob van Luijt-------------------Episode Timestamps:(01:45): How Bob defines open source data (04:09): What is a vector database and why do we need them? (07:55): How data is different before and after vectorization(13:58): Orders of magnitude faster or personal(16:09): How music and software relate to each other for Bob(19:33): Bob’s inspiration behind Weaviate(25:02): The AI-first ecosystem(27:38): The distinction between vector search engines, feature stores, neural search frameworks, and embedding (32:28): Bob’s advice for folks on the OSS startup journey-------------------Links:LinkedIn - Connect with BobTwitter - Follow BobTwitter - Follow WeaviateWeaviateSeMI TechnologiesBob’s TedX TalkBob's Forbes Article on the AI-First Database Ecosystem

Jun 8, 2022 • 40min
Open Source Innovation, The GPL for Data, and The Data In to Data Out Ratio with Larry Augustin
This episode features an interview with Larry Augustin, angel investor and advisor to early-stage technology companies. Larry previously served as the Vice President for Applications at AWS, where he was responsible for application services like Pinpoint, Chime, and WorkSpaces.Before joining AWS, Larry was the CEO of SugarCRM, an open source CRM vendor. He also was the founder and CEO of VA Linux, where he launched SourceForge. Among the group who coined the term “open source”, Larry has sat on the boards of several open source and Linux organizations.In this episode, Sam and Larry discuss who owns the rights to data, the data in to data out ratio, and why Larry is an open source titan.-------------------"People are willing to give up so much of their personal information because they get an awful lot back. And privacy experts come along and say, ‘Well, you're taking all this personal information’. But then most people look at that and say, ‘But I get a lot of value back out of that.’ And it's this data ratio value question, which is: for a little in, I get a lot back. That becomes a key element in this. And I think there has to be some kind of similar thought process around open source data in general, which is if I contribute some data into this, I'm going to get a lot of value back. So this data in to data out ratio, I think it's an incredibly important one. It's a principle that I drive into application development. If you put a user in front of an app and they start using the app, you're going to ask them for things. And my principle is always, ‘How do you figure out how to never ask them and only give them?’ And you can't get 100% of the way there, but every time it's like, ‘Why did you ask them for that? Couldn't you figure it out?’ And it gets everyone in the mindset of, ‘How do I provide more and more and take less and less?’ It's a principle of application development that I like a lot. And I think there's a similar concept here around open-source data. Are there models or structures that we can come up with where people can contribute small amounts of data and as a result of that, they get back a lot of value.” – Larry Augustin-------------------Episode Timestamps:(02:14): How Larry is spending his time after AWS(06:01): What drove Larry to open source(18:04): What is the GPL for data?(23:51): Areas of progress in open source data(28:37): The data in to data out ratio(36:02): Larry’s advice for folks in open source-------------------Links:LinkedIn - Connect with LarryTwitter - Follow Larry