Recently I jumped in and taught myself how to do medium-sized data exploration and machine learning. (Excel-sized < My Data Set < Big Data)

If you are a real data scientist or expert, skip this. It isn't for you. But if you have a good data set and want to start playing with it and learning some of the tools of modern data analytics, perhaps this can save you some time.

Matlab vs. R vs. Python

If you work at a university or big company, maybe you have access to Matlab, which is apparently great, but expensive. I didn't.

A physicist I was working with knew and used R. Apparently it is incredibly powerful and has many of the most cutting edge algorithms. BUT, I found the syntax baffling, the documentation copious, but written for mathematicians instead of hackers, and overall difficult and frustrating.

Python, on the other hand, is a dream. The language is easy. The documentation is copious and comprehensible. The online community is awesome. And in addition to all the analysis, you can do data munging as well. Python was my pick, and I think it was the right one.

The next step is picking the packages to support Python.

Python Packages for Analysis

I'm sure there are a lot of different choices with pluses and minuses, but this set served me very well, came after reasonable research, and never let me down. So it's a good starting point.

  • Python 2.7 (vs. 3.x) - It feels weird to use an older version, but all the packages work with 2.x, and some might not work with 3.x, so I went with 2.x and never had a problem.
  • NumPy & SciPy - these are the core packages for scientific computation, array manipulation, etc. They are the base.
  • Matplotlib - this allows you to do lots of easy visual analysis, charting, etc. Very powerful, but also easy for easy things. A histogram is a couple of lines.
  • iPython - this is the base interactive shell for python scientific computing. The key is you can easily run commands from the shell as you experiment. I played with others, but I wish I had switched to iPython sooner.
  • Console2 (windows) - If you use Windows, get the free Console2 and hook iPython (and Git, and cygwin, etc.) up to it.
  • Pandas - Pandas implements R's "dataframe" concept in python. Basically it wraps Numpy arrays so you can reference rows and columns by labels instead of just numbers. Overall, I was mixed on Pandas. It's undoubtedly powerful, and when you are good at it, you can do things simply and elegantly. But I also found that the vast majority of my debugging and tinkering was in Pandas. So I guess I'd say skip it at the start, but know its out there in case you feel you need it.
  • Statsmodels - This is a statistics package that did the trick for my limited usage.
  • Scikit-learn- A great package with lots of machine learning algorithms, easy to use, and great documentation.
    • Orange - Orange is another ML package which is very highly recommended, and also has a GUI. But I found it very un-pythonic to my novice eyes. I was trying to figure out how to read in a dict of my features and got lost in a seemingly endless list of Orange classes. When I loooked at Sklearn, it was a one liner. So Sklearn was great for me, but you might want to check out Orange.
  • NLTK - Natural Language Toolkit - useful for some base machine learning (though sklearn is better), but good if you are doing any NLP.
  • Excel - Don't forget Excel! I would often do my analysis in Python, then dump some results into Excel for quick, easy data exploration. If you are really good with the above tools, you probably don't need it. But Excel is SO easy, that I found it a really valuable tool.

While I'm at it, here are some good documentation sources I used:

Also, if you need to clean up your data to get it into a usable state, you might try Data Wrangler or  Google Refine. I love the concept of both (and Wrangler is wicked-cool), but they were both buggy for me and if you are good with Python, just use it.

Happy data exploring!


comments powered by Disqus