
Data Engineering Podcast
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine
Episode guests
Podcast summary created with Snipd AI
Quick takeaways
- Tab9 is an AI assistant for software development that offers features like code suggestions, automated test generation, and documentation generation, aiming to improve productivity, code quality, and discoverability.
- Building an AI assistant for software development presents challenges such as optimizing for low latency, handling contextual awareness, and managing evaluation and feedback loops, while also requiring human expertise for creative and complex algorithm design.
Deep dives
Tab9: AI Assistant for Software Development
Tab9 is an AI assistant for software development, helping with code generation, code completions, and other software development tasks. It offers features like code suggestions, automated test generation, and documentation generation. Tab9 envisions a future where software development is driven by AI, improving productivity, code quality, and discoverability. Developers can use Tab9 to accelerate their coding process, learn new techniques, and access a wider context of programming solutions. Organizations also benefit from improved productivity, harmonized code, and accelerated knowledge sharing among teams. Tab9 supports various programming languages and is continuously evolving to provide better assistance to software engineers.
Challenges and Customizations of AI Assistance for Developers
Building an AI assistant for software development presents various challenges and customization needs. These include optimizing for low latency in code completions, handling contextual awareness in a codebase, and managing evaluation and feedback loops for accuracy. The granularity of code generation and the ability to express requirements clearly to the model are important considerations. While AI assistance is valuable for code completion, review, and test generation, creative and complex algorithm design may still require human expertise. Education and calibration of users' expectations are crucial for effective utilization of AI assistance. Integration with non-code sources and supporting different languages further adds complexity to the development process.
Unexpected Use Cases and Future Plans for Tab9
Tab9 has seen unexpected use cases and innovative applications in software development. Some users have leveraged Tab9 for writing emails, meeting summaries, and other content generation tasks beyond code-related activities. The customization and adaptability of Tab9 have allowed users to experiment with unique applications, such as migrating legacy Cobol code to modern languages. Future plans for Tab9 include improving code review capabilities, integrating with non-code sources like Confluence and JIRA, and exploring the generation of programming tutorial videos. These advancements aim to make Tab9 more contextually aware, improve human-like interaction, and extend its usability to various domains in software engineering.
Barriers to Adoption for Machine Learning in Software Development
Two significant barriers to the adoption of machine learning in software development are privacy/security concerns and the interface between human and AI systems. Organizations may hesitate to share sensitive code or data with external ML systems due to privacy and security risks. Ensuring privacy and providing secure ML solutions is crucial. The interface between humans and AI systems also poses a challenge, as finding the right level of presentation and communication between the AI system and the user remains a complexity. Discovering effective ways to communicate results, understanding user needs, and tailoring interaction to human preferences are important adoption factors.
Summary
Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine
Interview
- Introduction
- How did you get involved in machine learning?
- Can you describe what Tabnine is and the story behind it?
- What are the individual and organizational motivations for using AI to generate code?
- What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.)
- What are the elements of skepticism/oversight that developers need to exercise while using a system like Tabnine?
- What are some of the primary ways that developers interact with Tabnine during their development workflow?
- Are there any particular styles of software for which an AI is more appropriate/capable? (e.g. webapps vs. data pipelines vs. exploratory analysis, etc.)
- For natural languages there is a strong bias toward English in the current generation of LLMs. How does that translate into computer languages? (e.g. Python, Java, C++, etc.)
- Can you describe the structure and implementation of Tabnine?
- Do you rely primarily on a single core model, or do you have multiple models with subspecialization?
- How have the design and goals of the product changed since you first started working on it?
- What are the biggest challenges in building a custom LLM for code?
- What are the opportunities for specialization of the model architecture given the highly structured nature of the problem domain?
- For users of Tabnine, how do you assess/monitor the accuracy of recommendations?
- What are the feedback and reinforcement mechanisms for the model(s)?
- What are the most interesting, innovative, or unexpected ways that you have seen Tabnine's LLM powered coding assistant used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI assisted development at Tabnine?
- When is an AI developer assistant the wrong choice?
- What do you have planned for the future of Tabnine?
Contact Info
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- TabNine
- Technion University
- Program Synthesis
- Context Stuffing
- Elixir
- Dependency Injection
- COBOL
- Verilog
- MidJourney
The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Sponsored By:
- Starburst:  This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Rudderstack:  Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize:  You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
- Datafold:  This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!