Towards Data Science cover image

Towards Data Science

Latest episodes

undefined
Oct 22, 2019 • 40min

10. Sanyam Bhutani - Data science beyond the classroom

A few years ago, there really wasn’t much of a difference between data science in theory and in practice: a jupyter notebook and a couple of imports were all you really needed to do meaningful data science work. Today, as the classroom overlaps less and less with the realities of industry, it’s becoming more and more important for data scientists to develop the ability to learn independently and go off the beaten path. Few people have done so as effectively as Sanyam Bhutani, who among other things is an incoming ML engineer at H2O.ai, a top-1% Kaggler, popular blogger and host of the Chai Time Data Science Podcast. Sanyam has a unique perspective on the mismatch between what’s taught in the classroom and what’s required in industry: he started doing ML contract work while still in undergrad, and has interviewed some of the world’s top-ranked Kagglers to better understand where the rubber meets the data science road. 
undefined
Oct 15, 2019 • 53min

9. Ben Lorica - Trends in data science with O'Reilly Media's Chief Data Scientist

The trend towards model deployment, engineering and just generally building “stuff that works” is just the latest step in the evolution of the now-maturing world of data science. It’s almost guaranteed not to be the last one though, and staying ahead of the data science curve means keeping an eye on what trends might be just around the corner. That’s why we asked Ben Lorica, O’Reilly Media’s Chief Data Scientist, to join us on the podcast. Not only does Ben have a mile-high view of the data science world (he advises about a dozen startups and organizes multiple world-class conferences), but he also has a perspective that spans two decades of data science evolution.
undefined
Oct 8, 2019 • 51min

8. George Hayward: comedian, lawyer and data scientist

Each week, I have dozens of conversations with people who are trying to break into data science. The main topic of the conversations varies, but it’s rare that I walk away without getting a question like, “Do you think I have a shot in data science given my unusual background in [finance/physics/stats/economics/etc]?”. From now on, my answer to that question will be to point them to today’s guest, George John Jordan Thomas Aquinas Hayward. George [names omitted] Hayward’s data science career is a testament to the power of branding and storytelling. After completing a JD/MBA at Stanford and reaching top-ranked status in Hackerrank’s SQL challenges, he went on to work on contract for a startup at Google, and subsequently for a number of other companies. Now, you might be tempted to ask how comedy and law could possibly lead to a data science career.
undefined
Oct 1, 2019 • 39min

7. Serkan Piantino - From Facebook to startups: data science is becoming an engineering problem

For today’s podcast, we spoke with someone who is laser-focused on considering this second possibility: the idea that data science is becoming an engineer’s game. Serkan Piantino served as the Director of Engineering for Facebook AI Research, and now runs machine learning infrastructure startup Spell. Their goal is to make dev tools for data scientists that make it as easy to train models on the cloud as it is to train them locally. That experience, combined with his time at Facebook, have given him a unique perspective on the engineering best practices that data scientists should use, and the future of the field as a whole.
undefined
Sep 25, 2019 • 45min

6. Jay Feng - Data science in the startup world

I’ve said it before and I’ll say it again: “data science” is an ambiguous job title. People use the term to refer to data science, data engineering, machine learning engineering and analytics roles, and that’s bad enough. But worse still, being a “data scientist” means completely different things depending on the scale and stage of the company you’re working at. A data scientist at a small startup might have almost nothing in common with a data scientist at a massive enterprise company, for example. So today, we decided to talk to someone who’s seen data science at both scales. Jay Feng started his career working in analytics and data science at Jobr, which was acquired by Monster.com (which was itself acquired by an even bigger company). Among many other things, his story sheds light on a question that you might not have thought about before: what happens to data scientists when their company gets acquired?
undefined
Sep 19, 2019 • 45min

5. Rocio Ng - Data science and product management at LinkedIn

Most software development roles are pretty straightforward: someone tells you what to build (usually a product manager), and you build it. What’s interesting about data science is that although it’s a software role, it doesn’t quite follow this rule. That’s because data scientists are often the only people who can understand the practical business consequences of their work. There’s only one person on the team who can answer questions like, “What does the variance in our cluster analysis tell us about user preferences?” and “ What are the business consequences of our model’s ROC score?”, and that person is the data scientist. In that sense, data scientists have a very important responsibility not to leave any insights on the table, and to bring business instincts to bare even when they’re dealing with deeply technical problems. For today’s episode, we spoke with Rocio Ng, a data scientist at LinkedIn, about the need for strong partnerships between data scientists and product managers, and the day-to-day dynamic between those roles at LinkedIn. Along the way, we also talked about one of the most common mistakes that early career data scientists make: focusing too much on that first role.
undefined
Sep 10, 2019 • 49min

4. Akshay Singh - The thin line between data science and data engineering

Akshay Singh, an expert in data science and data engineering, discusses the evolution of the field and the challenges faced in implementing data science in production systems. Topics include author disambiguation, managing feature drift in production systems, and measuring outcomes in data science projects.
undefined
Aug 14, 2019 • 51min

Susan Holcomb - Nontechnical career skills for data scientists

It’s easy to think of data science as a technical discipline, but in practice, things don’t really work out that way. If you’re going to be a successful data scientist, people will need to believe that you can add value in order to hire you, people will need to believe in your pet project in order to endorse it within your company, and people will need to make decisions based on the insights you pull out of your data. Although it’s easy to forget about the human element, managing it is one of the most useful skills you can develop if you want to climb the data science ladder, and land that first job, or that promotion you’re after. And that’s exactly why we sat down with Susan Holcomb, the former Head of Data at Pebble, the world’s first smartwatch company. When Pebble first hired her, Susan was fresh out of grad school in physics, and had never led a team, or interacted with startup executives. As the company grew, she had to figure out how to get Pebble’s leadership to support her effort to push the company in a more data-driven direction, at the same time as she managed a team of data scientists for the first time. 
undefined
Jul 16, 2019 • 43min

Tan Vachiramon - Choosing the right algorithm for your real-world problem

You import your data. You clean your data. You make your baseline model.  Then, you tune your hyperparameters. You go back and forth from random forests to XGBoost, add feature selection, and tune some more. Your model’s performance goes up, and up, and up. And eventually, the thought occurs to you: when do I stop? Most data scientists struggle with this question on a regular basis, and from what I’ve seen working with SharpestMinds, the vast majority of aspiring data scientists get the answer wrong. That’s why we sat down with Tan Vachiramon, a member of the Spatial AI team Oculus, and former data scientist at Airbnb.  Tan has seen data science applied in two very different industry settings: once, as part of a team whose job it was to figure out how to understand their customer base in the middle of a the whirlwind of out-of-control user growth (at Airbnb); and again in a context where he’s had the luxury of conducting far more rigorous data science experiments under controlled circumstances (at Oculus).  My biggest take-home from our conversation was this: if you’re interested in working at a company, it’s worth taking some time to think about their business context, because that’s the single most important factor driving the kind of data science you’ll be doing there. Specifically: Data science at rapidly growing companies comes with a special kind of challenge that’s not immediately obvious: because they’re growing so fast, no matter where you look, everything looks like it’s correlated with growth! New referral campaign? “That definitely made the numbers go up!” New user onboarding strategy? “Wow, that worked so well!”. Because the product is taking off, you need special strategies to ensure that you don’t confuse the effectiveness of a company initiative you’re interested in with the inherent viral growth that the product was already experiencing.  The amount of time you spend tuning or selecting your model, or doing feature selection, entirely depends on the business context. In some companies (like Airbnb in the early days), super-accurate algorithms aren’t as valuable as algorithms that allow you to understand what the heck is going on in your dataset. As long as business decisions don’t depend on getting second-digit-after-the-decimal levels of accuracy, it’s okay (and even critical) to build a quick model and move on. In these cases, even logistic regression often does the trick! In other contexts, where tens of millions of dollars depend on every decimal point of accuracy you can squeeze out of your model (investment banking, ad optimization), expect to spend more time on tuning/modeling. At the end of the day, it’s a question of opportunity costs: keep asking yourself if you could be creating more value for the business if you wrapped up your model tuning now, to work on something else. If you think the answer could be yes, then consider calling model.save() and walking away.
undefined
4 snips
Jul 16, 2019 • 48min

Joel Grus - The case against the jupyter notebook

To most data scientists, the jupyter notebook is a staple tool: it’s where they learned the ropes, it’s where they go to prototype models or explore their data — basically, it’s the default arena for their all their data science work.  But Joel Grus isn’t like most data scientists: he’s a former hedge fund manager and former Googler, and author of Data Science From Scratch. He currently works as a research engineer at the Allen Institute for Artificial Intelligence, and maintains a very active Twitter account.  Oh, and he thinks you should stop using Jupyter noteoboks. Now.  When you ask him why, he’ll provide many reasons, but a handful really stand out: Hidden state: let’s say you define a variable like a = 1 in the first cell of your notebook. In a later cell, you assign it a new value, say a = 3 . This results is fairly predictable behavior as long as you run your notebook in order, from top to bottom. But if you don’t—or worse still, if you run the a = 3 cell and delete it later — it can be hard, or impossible to know from a simple inspection of the notebook what the true state of your variables is.  Replicability: one of the most important things to do to ensure that you’re running repeatable data science experiments is to write robust, modular code. Jupyter notebooks implicitly discourage this, because they’re not designed to be modularized (awkward hacks do allow you to import one notebook into another, but they’re, well, awkward). What’s more, to reproduce another person’s results, you need to first reproduce the environment in which their code was run. Vanilla notebooks don’t give you a good way to do that.  Bad for teaching: Jupyter notebooks make it very easy to write terrible tutorials — you know, the kind where you mindlessly hit “shift-enter” a whole bunch of times, and make your computer do a bunch of stuff that you don’t actually understand? It leads to a lot of frustrated learners, or even worse, a lot of beginners who think they understand how to code, but actually don’t. Overall, Joel’s objections to Jupyter notebooks seem to come in large part from his somewhat philosophical view that data scientists should follow the same set of best practices that any good software engineers would. For instance, Joel stresses the importance of writing unit tests (even for data science code), and is a strong proponent of using type annotation (if you aren’t familiar with that, you should definitely learn about it here).  But even Joel thinks Jupyter notebooks have a place in data science: if you’re poking around at a pandas dataframe to do some basic exploratory data analysis, it’s hard to think of a better way to produce helpful plots on the fly than the trusty ol’ Jupyter notebook.  Whatever side of the Jupyter debate you’re on, it’s hard to deny that Joel makes some compelling points. I’m not personally shutting down my Jupyter kernel just yet, but I’m guessing I’ll be firing up my favorite IDE a bit more often in the future.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app