Hugo Bowne-Anderson and Matthew Rocklin, co-founders of Coiled, are reshaping the data science landscape. They dive into Dask, the open-source library that optimizes parallel computing for Python, making it easier to handle large datasets. The duo discusses the challenges of scaling data science, navigating cloud complexities, and the vital role of data literacy in organizations. They also share insights on community engagement in open source, the evolution of OSS, and the advantages of Dask over tools like Spark, emphasizing its future in distributed computing.
56:42
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
question_answer ANECDOTE
Dask's Origin
Dask was initially designed as a parallel NumPy at Anaconda to scale Python's data science tools.
It evolved to become a general-purpose parallel computing library after other libraries adopted its core engine.
insights INSIGHT
Data Science's Difficulty
Data science is difficult partly because it isn't a single, unified field.
The tools, methods, and desired outcomes vary greatly between applications, from distributed machine learning to analytics dashboards.
insights INSIGHT
Tooling and Best Practices
Tooling encodes best practices, implicitly teaching users better approaches.
Data scientists may lack expertise in areas like security, so tools can bridge these gaps by handling these practices automatically.
Get the Snipd Podcast app to discover more snips from this episode
Dask
What is it?
Parallelism for analytics
What is parallelism?
Doing a lot at once by splitting tasks into smaller subtasks which can be processed in parallel (at the same time)
Distributed work across multiple machines and then combining the results
Helpful for CPU bound - doing a bunch of calculations on the CPU. The rate at which process progresses is limited by the speed of the CPU
Concurrency?
Similar but a but things don’t have to happen at the same time, they can happen asynchronously. They can overlap.
Shared state
Helpful to I/O bound - networking, reading from disk, etc. The rate at which a process progresses is limited by the speed of the I/O subsystem.
Multi-core vs distributed
Multi-core is a single processor with 2 or more cores that can cooperate through threads - multithreading
Distributed is across multiple nodes communicating via HTTP or RPC Why is this hard?
Python has it challenges due to GIL, other languages don't have this problem
Shared state can lead to potential race conditions, deadlocks, etc
Coordination work across the machines
For analytics?
Calculating some statistics on a large dataset can be tricky if it can’t fit in memory
Join our Slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Matthew on LinkedIn: https://www.linkedin.com/in/matthew-rocklin-461b4323/
Timestamps:
0:00 - Intro to Matthew Rocklin and Hugo Bowne-Anderson
0:37 - Matthew Rocklin's Background
1:17 - Hugo Brown-Anderson's Background
3:47 - Where did that inspiration come from?
10:04 - Is there a close relationship between Best Practices and Tooling or are these two separate things?
11:27 - Why is Data Literacy important with Coiled?
14:46 - How do you think about the balance between enabling Data Science to have a lot of powerful compute?
17:05 - Machine Learning as a space for tracking best practices experimentation
19:32 - What makes Data Science so difficult?
24:07 - How can a for-profit company compliment Open Source Software (OSS)
29:40 - Amazon becoming a competitor with your own open-source technology (?)
32:50 - How do you encourage more people to contribute and ensure quality?
34:58 - Do you see Coiled operating within the DASK ecosystem?
37:30 - What is DASK?
39:19 - What should people know about parallelism?
41:28 - Why is it so hard to put things back together?
41:34 - Why does Python need a whole new tool to enable that? Or maybe some other tools as well?
44:44 - Dynamic Tasks Scheduling as being useful to Data Scientists
47:15 - Why is reliability in particular important in Data Science?
52:27 - What's in store for DASK?