06-19, 15:00–15:45 (Europe/London), Tower Suite 3
You use pandas every day. You know every keyword argument on every function, even .melt
! You even know whether it's .rename
, .rename_axis
, or .set_axis
that you want—and you get it right on the first try! So why bother learning Python? Sure, pandas is written in it, but outside of assembling parts of the pandas API, what's there that has any value in your life?
It's common for data scientists to narrowly focus on the APIs of the tools they use every day—pandas
, matplotlib
, pymc
, dask
, &c.—to the detriment of any focus on the surrounding programming language. In the case of tools like matplotlib
, the total amount of Python we need to know is limited to what existed when matplotlib
was first developed. (Did you know that matplotlib
predates @property
? That explains a lot…) In the case of newer tools like dask
or pymc
or even pandas
, we may encounter some newer parts of Python—e.g., context managers or descriptors—as part of these tools' API design, but it's very easy to accept these as mere “syntax.”
In this talk, we will discuss where a deeper understanding of pure Python has direct and immediate consequences to your work as a data scientist. We will discuss where these parts of Python you may have skimmed over show up in analytical code, outside of the mere “syntax” of an API.
This talk will be organised around answering the following questions:
- why do generators even matter (and who cares about coroutines)?
- the itertools
module is great… if I were writing scripts, but where does it show up in data analysis?
- object orientation seems like a bunch of bureaucracy—can it really simplify my analytical code?
- why should I bother with data types in the builtins and collections
; is the pandas.DataFrame
not enough?
- knowledge of Python internals would probably be useful, if I were a programmer writing scripts, but why do they matter for a data scientist?
No previous knowledge expected
James Powell serves on the board of NumFOCUS as co-Chairman and Vice President. NumFOCUS is the 501(c)(3) non-profit that supports all the major tools in the Python data analysis ecosystem (incl. pandas, numpy, jupyter, matplotlib, and others.) At NumFOCUS, he helps build global open source communities for data scientists, data engineers, and business analysis. He helps NumFOCUS run the PyData conference series and has sat on speaker selection and organizing committees for over two dozen conferences. James is also a prolific speaker: since 2013, he has given over seventy conference talks at over fifty Python events worldwide.