Open||Source||Data

Charna Parkey

What can we learn from ai-native development through stimulating conversations with developers, regulators, academics and people like you that drive forward development, seek to understand impact, and are working to mitigate risk in this new world?

Join Charna Parkey and the community shaping the future of open source data, open source software, data in AI, and much more.

Episodes

Mentioned books

Jul 6, 2022 • 39min

Accelerating Computation, Machine Learning, and Data Mesh with Sophie Watson

This episode features an interview with Sophie Watson, Technical Product Marketing Manager at NVIDIA. Previously, Sophie served as a software engineer and principal data scientist at RedHat where she used machine learning to solve business problems in the hybrid cloud. Sophie has a PhD in Bayesian statistics and frequently speaks about machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. In this episode, Sam and Sophie discuss Principal Component Analysis, computational acceleration, and MLOps.-------------------“We all start when we get hold of a data set by visualizing it to try to understand it. So that usually for me involves starting with a simple technique, something like PCA, Principal Component Analysis. It's been around since the eighties, probably longer, maybe the sixties. Don't quote me on that. With Principal Component Analysis, we can map our high dimensional data down to a smaller number of dimensions. Let's map it down to two so that we can visualize it. So we can go ahead and visualize it. But Principal Component Analysis is quite a simple technique in what it's doing and it's just mapping onto key components of our data. We might not be able to see, perhaps, separation of classes if we're working with data that's from a set of classes. Maybe we're looking at transactions, are they fraudulent or are they legitimate? And we might not be able to see that distinction. So that makes us think, "Is there something interesting in my data? Am I going to be able to train a machine learning model?" I don't know. Back in the day, I think the next step would've been, “Oh, let's train a model in C”, but now with accelerated compute within a really reasonable amount of time, we can go ahead and use a more sophisticated technique so we can use something like UMAP that's leaning on differential manifolds to do that projection to lower dimensions. And because this technique is slightly more sophisticated, what we find in general is that within the same amount of time, we're able to get more insight into the data. We're able to see the distinction in classes between our data sets. It keeps you in that loop. It keeps you in that productivity state.” – Sophie Watson-------------------Episode Timestamps:(01:22): What open source data means to Sophie(02:47): How Sophie is spending her time (07:52): What excites Sophia about the data science community(10:13): What Sophie is most excited about in data visibility(16:29): Data on servers versus data in the cloud(18:09): Accelerated computation on machine learning(22:27): Sophie breaks down probabilistic programming(24:21): What problem was Sophie trying to solve in her career(32:12): Sophie’s dream job of working for Taylor Swift(34:48): Sophie’s advice for those interested in open source-------------------Links:LinkedIn - Connect with SophieTwitter - Follow SophieTwitter - Follow NVIDIANVIDIA

Jun 29, 2022 • 6min

Democratization and Cognition with Margot Gerritsen, Rachel Chalmers, and Patricia Boswell

This bonus episode features conversations from season 1 of the Open||Source||Data podcast. In this episode, you’ll hear from Margot Gerritsen, Stanford Professor and Co-Founder/Director of WiDS; Rachel Chalmers, Partner at Alchemist Accelerator; and Patricia Boswell, Staff Technical Writer at Google.Sam sat down with each guest to discuss cognition and democratization in data. You can listen to the full episodes from Margot Gerritsen, Rachel Chalmers, and Patricia Boswell by clicking the links below.-------------------Episode Timestamps:(00:18): Margot Gerritsen(02:07): Rachel Chalmers(03:46): Patricia Boswell-------------------Links:Listen to Margot’s episodeListen to Rachel’s episodeListen to Patricia's episode

Jun 22, 2022 • 36min

Vector Search, the AI Stack and more with Bob van Luijt

This episode features an interview with Bob van Luijt, CEO and Co-Founder of SeMI Technologies and co-creator of Weaviate, an open source vector search engine. At just 15 years of age, Bob started his own software company in the Netherlands. He went on to study music at ArtEZ University of the Arts and Berklee College of Music, and completed the Harvard Business School Program of Management Excellence. Bob is also a TedX speaker, discussing the relationship between software and language.In this episode, Sam sits down with Bob to break down vector search, the AI-first ecosystem, and how music and software relate to one another.-------------------“I dare to argue that from the two big waves in database technology that we've seen, so first, in the seventies and eighties with SQL. And then the whole NoSQL wave that we have seen and the big winners that are in there, I dare to argue that we see a third wave coming up. And the third wave, I simply call it AI-first. And what I mean with that is that these models play an important role. So we do it from the perspective of the models first. And in that new segment, you see four niches. So the first niche that we see are what I like to call the embedding providers. The Hugging Faces of this world, the OpenAIs of this world, etc. Those who bring us the embeddings that we need to do the vectorization. Then secondly, we have so-called neural search frameworks. So we see frameworks like Haystack and Jina. Then third, we have the feature stores. So the feature stores take care of storing large chunks of features that we later can use to do vectorization on those kinds of things.And then we have the search engines. And Weaviate is an example of such a search engine that takes care of searching through data on a large scale that is vectorized.It might be a bold statement, but I really believe that we see this third wave of database technology happening.” – Bob van Luijt-------------------Episode Timestamps:(01:45): How Bob defines open source data (04:09): What is a vector database and why do we need them? (07:55): How data is different before and after vectorization(13:58): Orders of magnitude faster or personal(16:09): How music and software relate to each other for Bob(19:33): Bob’s inspiration behind Weaviate(25:02): The AI-first ecosystem(27:38): The distinction between vector search engines, feature stores, neural search frameworks, and embedding (32:28): Bob’s advice for folks on the OSS startup journey-------------------Links:LinkedIn - Connect with BobTwitter - Follow BobTwitter - Follow WeaviateWeaviateSeMI TechnologiesBob’s TedX TalkBob's Forbes Article on the AI-First Database Ecosystem

Jun 8, 2022 • 40min

Open Source Innovation, The GPL for Data, and The Data In to Data Out Ratio with Larry Augustin

This episode features an interview with Larry Augustin, angel investor and advisor to early-stage technology companies. Larry previously served as the Vice President for Applications at AWS, where he was responsible for application services like Pinpoint, Chime, and WorkSpaces.Before joining AWS, Larry was the CEO of SugarCRM, an open source CRM vendor. He also was the founder and CEO of VA Linux, where he launched SourceForge. Among the group who coined the term “open source”, Larry has sat on the boards of several open source and Linux organizations.In this episode, Sam and Larry discuss who owns the rights to data, the data in to data out ratio, and why Larry is an open source titan.-------------------"People are willing to give up so much of their personal information because they get an awful lot back. And privacy experts come along and say, ‘Well, you're taking all this personal information’. But then most people look at that and say, ‘But I get a lot of value back out of that.’ And it's this data ratio value question, which is: for a little in, I get a lot back. That becomes a key element in this. And I think there has to be some kind of similar thought process around open source data in general, which is if I contribute some data into this, I'm going to get a lot of value back. So this data in to data out ratio, I think it's an incredibly important one. It's a principle that I drive into application development. If you put a user in front of an app and they start using the app, you're going to ask them for things. And my principle is always, ‘How do you figure out how to never ask them and only give them?’ And you can't get 100% of the way there, but every time it's like, ‘Why did you ask them for that? Couldn't you figure it out?’ And it gets everyone in the mindset of, ‘How do I provide more and more and take less and less?’ It's a principle of application development that I like a lot. And I think there's a similar concept here around open-source data. Are there models or structures that we can come up with where people can contribute small amounts of data and as a result of that, they get back a lot of value.” – Larry Augustin-------------------Episode Timestamps:(02:14): How Larry is spending his time after AWS(06:01): What drove Larry to open source(18:04): What is the GPL for data?(23:51): Areas of progress in open source data(28:37): The data in to data out ratio(36:02): Larry’s advice for folks in open source-------------------Links:LinkedIn - Connect with LarryTwitter - Follow Larry

Jun 1, 2022 • 4min

Data Meshes, Fabrics, and Discovery with Zhamak Dehghani, David Thomas, and Shirshanka Das

This bonus episode features conversations from season 1 and 2 of the Open||Source||Data podcast. In this episode, you’ll hear from Zhamak Dehghani, Director of Emerging Technologies at ThoughtWorks North America; David Thomas, Principal at Deloitte; and Shirshanka Das, Founder of LinkedIn DataHub and Acryl Data.Sam sat down with each guest to discuss data meshes, fabrics, and discovery. You can listen to the full episodes from Zhamak Dehghani, David Thomas, and Shirshanka Das by clicking the links below.-------------------Episode Timestamps:(00:36): Zhamak Dehghani(01:41): David Thomas(02:43): Shirshanka Das-------------------Links:Listen to Zhamak’s episodeListen to David’s episodeListen to Shirshanka’s episode

Apr 27, 2022 • 35min

Investing in Communities, Differentiating, and Trusting Your Gut with Erica Brescia

This episode features an interview with Erica Brescia, Managing Director of Redpoint Ventures. At Redpoint, Erica focuses her investing on infrastructure, DevOps, and security.Erica has over 15 years of experience in the open source community and currently serves on the board of directors of the Linux Foundation. Prior to joining Redpoint, Erica was also an angel investor and advisor to companies such as Netlify, Coda, and Xata.In this episode, Sam and Erica discuss the evolution of open source data, what’s changed for practitioners, and why you should always listen to your gut.-------------------“I think there is just so much good motivation to make the world a better place, especially during my time at GitHub. When you can see what kinds of opportunity open source can bring to people in developing countries, that’s really exciting. You see people whose lives and livelihoods have literally been changed because they were able to participate in a global open source project. And then you can see the way that open source projects, even back when we were packaging things at Bitnami, we’d hear from non-profits in Africa that were never able to use open source until we made it easy to consume. When you feel like you’re really making that kind of a difference and you’re doing it in a community of great people, it’s a really great way to spend your time.” – Erica Brescia-------------------Episode Timestamps:(03:18): What open source data means to Erica(11:31): What’s changed in open source data in recent years(18:01): How the journey has evolved for innovators and practitioners(24:11): What stands out as a venture capitalist to Erica(30:03): Don’t discount junior investors(31:17): Erica’s advice: get quiet and listen to your gut-------------------Links:LinkedIn - Connect with EricaLinkedIn - Connect with Red PointTwitter - Follow EricaTwitter - Follow RedpointVisit RedpointXataDagger

Apr 20, 2022 • 4min

Data on Kubernetes with Kelsey Hightower, Lachlan Evenson, and Patrick McFadin

This bonus episode features conversations from season 1 of the Open||Source||Data podcast. In this episode, you’ll hear from Kelsey Hightower, Principal Engineer at Google Cloud; Lachlan Evenson, Principal Program Manager at Microsoft Azure; and Patrick McFadin, Head of Developer Relations at DataStax. Sam sat down with each guest to discuss Data on Kubernetes and how they’re making progress on a stateless infrastructure.You can listen to the full episodes from Kelsey Hightower, Lachlan Evenson, and Patrick McFadin by clicking the links below.-------------------Timestamps:(00:39): Kelsey Hightower(01:33): Lachlan Evenson(02:06): Patrick McFadin-------------------Links:Listen to Kelsey’s episodeListen to Lachlan’s episodeListen to Patrick’s episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app