

The Data Exchange with Ben Lorica
Ben Lorica
A series of informal conversations with thought leaders, researchers, practitioners, and writers on a wide range of topics in technology, science, and of course big data, data science, artificial intelligence, and related applications. Anchored by Ben Lorica (@BigData), the Data Exchange also features a roundup of the most important stories from the worlds of data, machine learning and AI. Detailed show notes for each episode can be found on https://thedataexchange.media/ The Data Exchange podcast is a production of Gradient Flow [https://gradientflow.com/].
Episodes
Mentioned books

Feb 6, 2020 • 33min
Building domain specific natural language applications
In this episode of the Data Exchange I speak with David Talby, co-creator of Spark NLP, an open source, highly scalable, production grade natural language processing (NLP) library. Spark NLP has become one of the more popular NLP libraries and is available on PyPI, Conda, Maven, and Spark Packages. With recent advances in research in large-scale natural language models, there is strong interest in domain specific natural language applications. Besides their work on Spark NLP, David and his collaborators are building natural language models tuned specifically for healthcare applications.Our conversation spanned many topics, including:Spark NLP: its current status and some common and surprising use cases.Recent developments in NLP research and their implications for companies.Spark NLP for HealthcareDetailed show notes can be found on The Data Exchange web site.

Jan 30, 2020 • 42min
The state of privacy-preserving machine learning
In this episode of the Data Exchange I speak with Morten Dahl, research scientist at Dropout Labs, a startup building a platform and tools for privacy-preserving machine learning. He is also behind TF Encrypted, an open source framework for encrypted machine learning in TensorFlow. The rise of privacy regulations like CCPA and GDPR combined with the growing importance of ML has led to a strong interest in tools and techniques for privacy-preserving machine learning among researchers and practitioners. Morten brings the unique perspective of being a longtime security researcher who has also worked as a data scientist in industry.Our conversation spanned many topics, including:Morten’s unique background as an experienced security researcher, developer, and data scientist.The current state of TF Encrypted.Federated learning (FL) and secure aggregation for FL.Privacy-preserving ML solutions will employ a variety of techniques, and thus we also discussed related topics such as differential privacy, homomorphic encryption, and RISELab’s stack for coopetitive learning (MC2).Detailed show notes can be found on The Data Exchange web site.

Jan 23, 2020 • 38min
Taking messaging and data ingestion systems to the next level
Sijie Guo on how Apache Pulsar is able to handle both queuing and streaming, and both online and offline applications.In this episode of the Data Exchange I speak with Sijie Guo, founder of StreamNative, a new startup focused on making enterprise messaging technologies - specifically Apache Pulsar - easy to use on the cloud. Sijie was previously a cofounder of Streamlio (acquired by Splunk) and prior to that he led the messaging team at Twitter. He is also the main organizer behind the Pulsar Summit (April in San Francisco), a new conference whose Call for Speakers closes on January 31st. Our conversation spanned many topics, including:The role of messaging in modern data applications and platforms.The two main types of messaging applications: queuing and streaming.Apache Pulsar as a unified messaging platform, able to handle both queuing and streaming, and both online and offline applications.A status update on Apache Pulsar.Detailed show notes can be found on The Data Exchange web site.

Jan 16, 2020 • 41min
Business at the speed of AI: Lessons from Rakuten
The Data Exchange Podcast: Bahman Bahmani on attracting and retaining talent, and the importance of delivery-oriented teams.In this episode of the Data Exchange I speak with Bahman Bahmani, VP of Data Science and Engineering at Rakuten, a large Japanese ecommerce and online retail company. When I first met Bahman several years ago, he was finishing up his Computer Science PhD at Stanford, and at the time he was giving technical talks on machine learning algorithms and their applications to computer security. Today he leads a large team at Rakuten, and in my opinion he has established an organizational structure, processes and an AI practice that other companies should study.Our conversation spanned many topics, including:The impact that AI, machine learning, and data have had on Rakuten’s businesses.Attracting, nurturing, and retaining talent in an environment when data scientists, data engineers, and analysts who all have many other options.The trio of strategic options: operational excellence, product leadership, customer intimacy.Organization and culture, including key roles within an AI practice.The power of delivery-oriented teams with end-to-end responsibility.Detailed show notes can be found on The Data Exchange web site.

Jan 9, 2020 • 30min
The combination of the right software and commodity hardware will prove capable of handling most machine learning tasks
In this episode of the Data Exchange I speak with Nir Shavit, Professor of EECS at MIT, and cofounder and CEO of Neural Magic, a startup that is creating software to enable deep neural networks to run on commodity CPUs (at GPU speeds or faster). Their initial products are focused on model inference, but they are also working on similar software for model training.Our conversation spanned many topics, including:Neurobiology, in particular the combination of Nir’s research areas of multicore software and connectomics – a branch of neurobiology.Why he believes the combination of the right software and CPUs will prove capable of handling many deep learning tasks.Speed is not the only factor: the “unlimited memory” of CPUs are able to unlock larger problems and architectures.Neural Magic’s initial offering is in inference, model training using CPUs is also on the horizon.Detailed show notes can be found on The Data Exchange web site.

Dec 26, 2019 • 36min
Key AI and Data Trends for 2020
In this episode of the Data Exchange, I speak with my podcast co-organizer Mikio Braun, data scientist at GetYourGuide, and a former machine learning researcher and data architect. Mikio and I go out on a limb and speculate about new trends in AI and Data that we think people should pay attention to in 2020.Our conversation spanned many topics, and we listed trends in:Models: reinforcement learning, deep learning, language models, and related topics.Applications: including emerging use cases for reinforcement learning.Infrastructure and Tools: end-to-end machine learning platforms, the importance of distributed computing, etc.Managing risks: privacy, security, safety, fairness, etc.Emerging technologies to watch for in 2020.Detailed show notes can be found on The Data Exchange web site.

Dec 12, 2019 • 36min
The evolution of TensorFlow and of machine learning infrastructure
In this episode of the Data Exchange I speak with Rajat Monga, one of the founding members of the TensorFlow Engineering team. Up until recently Rajat was the engineering manager for TensorFlow at Google. Our conversation spanned many topics, including:TFX, a production scale machine learning platform based on TensorFlow.Distributed training.MLIR (Multi-Level Intermediate Representation), “a representation format and library of compiler utilities that sits between the model representation and low-level compilers/executors that generate hardware-specific code.”Deep learning in the enterprise.The state of machine learning infrastructure.[full show notes can be found on the Data Exchange web site.]

Nov 26, 2019 • 40min
Building large-scale, real-time computer vision applications
In this episode of the Data Exchange I speak with Reza Zadeh, founder and CEO of Matroid, a startup focused on making computer vision applications easy to build and deploy. Reza is also an adjunct professor at Stanford.This particular conversation spanned many topics pertaining to computer vision, including:Challenges in building large-scale, real-time computer vision applications.Robustness of computer vision applications (adversarial attacks, deepfakes).Impact of computer vision technologies on society: security, privacy and surveillanceWe also preview the upcoming 2020 edition of the ScaledML conference: Reza is the main organizer behind one of my favorite conferences in the SF Bay Area.[full show notes can be found on the Data Exchange site.]

Nov 12, 2019 • 45min
Taking stock of foundational tools for analytics and machine learning
In this episode of The Data Exchange, I speak with Paco Nathan, author, teacher, and founder of Derwen.ai, a boutique consulting firm specializing in Data, ML, and AI. Paco consults with companies and speaks before audiences all over the world, and I plan to have him as a frequent guest on this podcast to draw on his observations of diverse organizations.This particular conversation spanned many topics, including:Data Governance: Paco’s talk on the topicAutoML: Paco’s talk on the topicPyTorch and TensorFlow: posts we discussed - [1], [2]Reproducibility and feature selectionThe Streamlit open source project for ML app development, and Grus Law (“if you can think up something crazy and/or dangerous to do with notebooks, people are doing it.”)I want this to be more than just a podcast. I want to create a community to help people make better decisions. A key part of this is getting you involved. I have ideas on how this community will grow, but as a first step, I want to ask a question related to one of the topics that Paco and I discussed: PyTorch and TensorFlow. I'd love to have you weigh in by filling out the survey form. I'll report on results and key insights in a future episode of this podcast.[full show notes can be found on the Data Exchange site.]