Narwhals: Expanding DataFrame Compatibility Between Libraries
Oct 18, 2024
auto_awesome
Marco Gorelli, a data scientist and creator of the Narwhals project, shares his journey into open source and the mission behind Narwhals, which enhances compatibility between DataFrame libraries like Polars and PyArrow. He discusses the benefits of lazy evaluation and how this simplifies data processing. The conversation also delves into community engagement in open source, offering advice for newcomers, and highlights collaborations with libraries such as Altair and scikit-lego. Marco's insights make for an engaging exploration of modern data compatibility.
Narwhals aims to simplify compatibility across various data frame libraries, streamlining integration for better data processing efficiency.
Marco Gorelli discusses his journey in open source, highlighting how contributions can enhance career prospects and community involvement.
The project emphasizes lazy evaluation to improve memory management, allowing complex operations without memory overload in large datasets.
Deep dives
Narwhals Project Overview
Narwhals is a project designed to create compatibility across various data frame libraries, catering primarily to library maintainers rather than end users. The goal of this tool is to simplify the integration process of handling different data frame formats like Pandas, Polars, and PyArrow, enabling users to seamlessly work across these libraries. By allowing maintainers to express data frame logic without worrying about the underlying library specifics, Narwhals aims to minimize dependencies and provide a consistent interface. Marco Gorelli highlights the advantages of supporting modern features, such as lazy evaluation, which enhances performance and efficiency in data processing.
Contributions and Open Source Involvement
Marco Gorelli shares insights into his journey of contributing to open source, illustrating how an initial encounter with an issue in Pandas led to a broader involvement in library development. His experience showcases how open source communities can be welcoming, encouraging newcomers to submit fixes and contribute meaningfully. He emphasizes that engaging in projects can enhance job prospects, as employers value independent work and community contributions. Those new to open source are encouraged to find issues related to their own experiences or challenges, rather than searching for curated lists of beginner-friendly tasks.
Resource Efficiency with Lazy Evaluation
Lazy evaluation is presented as a significant aspect of Narwhals, allowing operations to be deferred until necessary, which improves memory management and computational efficiency. By deferring execution, users can define complex data frame operations without risking memory overload, optimizing performance across large datasets. This approach permits libraries to handle operations more intelligently, processing data only when demanded, akin to a manager streamlining tasks rather than issuing every instruction immediately. Ultimately, lazy evaluation helps users avoid pitfalls associated with eager execution in traditional libraries.
Community Engagement and Development
The growth of Narwhals is supported by an active community, with collaborative efforts engaging multiple contributors and maintainers from related libraries. Regular community calls and live streams provide avenues for users and developers to interact, ask questions, and contribute ideas to enhance the project. Gorelli notes that involvement from interns and external developers has led to rapid growth and adoption of Narwhals within the ecosystem. This community-driven approach not only fosters dedication but also ensures that the library remains relevant and adaptable to changing needs.
Future Directions and Opportunities
Looking ahead, Narwhals aims to expand its compatibility with an increasing number of libraries, fostering broader adoption within the data science community. Interest from major libraries such as Plotly and Fairlearn indicates a recognition of Narwhals' potential as a solution for compatibility issues. Gorelli emphasizes the importance of maintaining a balance between expanding functionality and ensuring the stability of existing features, underscoring the project's focus on a clean, lightweight implementation. By continuing to attract contributions and partnerships, Narwhals is positioned to influence the data frame landscape positively.
How does a Python tool support all types of DataFrames and their various features? Could a lightweight library be used to add compatibility for newer formats like Polars or PyArrow? This week on the show, we speak with Marco Gorelli about his project, Narwhals.
Narwhals is a project aimed at library maintainers rather than end users. We discuss how the added compatibility benefits users by supporting modern features like lazy evaluation. We cover several projects Marco has been working with to implement Narwhals, including Altair, scikit-lego, and Ibis.
We also discuss how Marco started contributing to open-source projects. Marco has contributed to both pandas and Polars, which helps explain his interest in growing compatibility between libraries. He also offers advice on making your first contribution.
In this video course, you’ll learn how Python’s mutable and immutable data types work internally and how you can take advantage of mutability or immutability to power your code.
Topics:
00:00:00 – Introduction
00:02:02 – Euro SciPy 2024 and sprints
00:04:04 – How did you get involved in open source?
00:07:18 – Finding a good issue to get started
00:09:25 – Discord and open-source projects
00:11:12 – Who would you describe Narwhals?
00:16:47 – Working on Polars
00:19:17 – Apache Arrow and a data interchange protocol
00:22:55 – Sponsor: CodeRabbit
00:23:55 – Digging into eager vs lazy
00:27:04 – Ibis DataFrame library
00:28:57 – What do libraries need from Narwhals?
00:34:57 – The scikit-lego library
00:37:15 – Video Course Spotlight
00:38:45 – Other libraries interested in Narwhals
00:41:56 – Compatibility policy
00:45:18 – What should an end user expect?
00:46:32 – Have other projects that attempted this?
00:47:54 – Keeping the project light and pure Python
00:49:32 – Contributors and how to get involved
00:54:42 – What are you excited about in the world of Python?
00:57:18 – What do you want to learn next?
00:59:05 – How can people follow your work online?