This past weekend I am happy to have attended PyData Chicago 2016, put on by PyData and held at the University of Illinois Chicago at the UIC Student Center. PyData Chicago 2016 was a conference all about using Python along with data science and open source tools to help data scientists, developers, and academics better get their jobs done in a more efficient way. Before I go any further... if you are unfamiliar with PyData, PyData is a community of Python developers, data scientists, and open source leaders that hold events around the world each year to help grow the scientific, academic, and development uses of Python with data science. PyData is a great organization and Chicago lucky enough to have a monthly meetup that I try to make it to whenever I can. This actually comes as a surprise to many people because of my long history in mobile and web engineering and not in data science, but data science in the context of Python is a topic that I find interesting strictly from an educational perspective. As you may know, I am a big fan or Python and mathematics, so, sooner or later, I was bound to find myself at a Python data science event just out of pure interest of mine.
On Saturday, I attended a very interesting talk by Tom Augspurger, called, "Mind the Gap! Bridging the pandas – scikit-learn dtype divide," where he discussed the differences in data types when working on a project in pandas or scikit-learn. One of the key take-aways from this talk for me was that "real-world" data is often very messy and not heterogeneous. Meaning that data often used in the industry is not of one type and often contains strings, floats, integers, and maybe even lists mixed together. So, when attempting to manipulate data in arrays or or dataframes, data often has to be manipulated or bridged to help ease the way before a proper model can be built out. Another key take-away from the talk, that I will now attempt put into my own words, was that pandas or scikit-learn do not need to be thought of as limited in functionality because of the data set you may be working with, but instead, there are ways to bridge these problems and still get the job done. This was an excellent talk that impressively included lots of live coding and many real-life examples of working with the pandas library to alter data from a starting point with mixed types to an ending point with similar data assigned to the appropriate rows and columns. To view the notes from Tom's talk, check out the Github repo he put up with his slides here.
On Sunday morning I went to a very very interesting talk by Piero Ferrante called Creating a Contemporary Risk Management System Using Python. This talk was the best talk of the entire conference, and I think this might have been because Piero presented the information in a way that it was easy for a novice (like myself) or a seasoned veteran to pick up on and understand quickly. Piero talked about the different ways in which his company creates a forecast about a particular scenario. One of the key take-aways from Piero's talk was that every scenario is different when attempting to create a proper forecast. One method in attempting to create a forecast described was to plot out existing or known data and then remove a specific time window from that data and attempt to use models that fit your scenario to then as close as possible represent that time window that you removed. The model that most closely represents the time window that you removed would be the model that you would use to attempt your forecast. I thought that this concept was excellent.
Another take-away from Piero's talk that I found very interesting was that he was not using pure Python to plot out his data for forecasting. He mentioned in his talk that he used a lot of R along side Python to help him build out his models. R, if you are unfamiliar, if a statistical language for computing and graphics and is very prevalent among academics and statisticians. So, the take-away for me then was that when attempting to build a forecasting model there is a good use case for Python and also R too. I wish I knew the specific use-case for R in this equation, but what I do know is that in this specific use-case both languages are used side-by-side.
Please let me know if you have any questions comments or concerns. I would love to hear the feedback or corrections on anything that I may have misinterpreted in this post. Thanks!
The idea for image in this blog post was referenced from the PyData Chicago website.