Python-based alternatives
Sounds easier said than done. As so often in the open source environment, a new ecosystem of tools with a rich, exotic and international flora of code plants is revealed. At first glance it was not that easy to clearly differentiate the abbreviations of the individual tools and their focuses. In addition, I very soon realized that the motivation for the developers and the professional backgrounds of the first users are clearly different to the ones of SPSS or SAS jockeys. They were clearly focusing on machine learning. What I mean is e.g.:
- Clearly technical orientation: More possibilities to influence the data on low level by oneself. The proximity to the underlying implementation in C is recognizable and the explicit choice of data types is supported and pays off again.
- High acceptance of black box procedures (for scikit): The end justifies the means and in case scoring with exotic procedures provides better performance, it is fine even if I am not able to illustrate the interaction of the predictors in a simple manner.
- Brute force leads to the target: Both the ex-works possibility for parallelization and the grid search procedures for parameter optimization quickly tempt the user to just let the CPUs glow :-).
Machine Learning in Python has not reserved these features for itself as they can also apply for the use of R even if respective extensions are frequently required. It is rather my subjective impression which arises when I see the documentations, the tutorials and the focuses of the functionalities. Even if you just take a look at the cheat sheet for scikit learn algorithms, at first glance it seems as if there are no conventional regression methods at all. The leap to regularized methods is directly made and instead of logistic regression, SVMs are suggested for a binary classification problem.
Even if this whole hotchpotch of python modules seems to be somewhat confusing and inaccessible - a basket full of snakes - I see an enormous potential in this technology. It enables very performant, formally stringent, highly automatable and easily deployable machine learning applications which are scalable to big applications.
Overview of Packages
The following list of elements is designed give beginners an initial guide through the jungle of abbreviations:
NumPy
NumPy provides highly performant data structures. It is excellent for the processing of data in extensive very vectors and matrices. The basics were implemented in C and Fortran, which is a reason for the good performance.
NumPy also includes possibilities for mathematical calculations; however, this is not the primary application purpose.
NumPy's data types and data structures form the basis for all further Python technologies listed here.
SciPy
SciPy builds up on NumPy and expands the latter by scientific computing possibilities. So to speak, SciPy and NumPy below can be seen as small Python alternative for Matlab.
SciPy is rather uninteresting for everyday analysis or the area of machine learning.
pandas
Pandas also builds up on NumPy and improves the usability of data transformations and analysis.
NumPy structures are packed into objects which are quite similar to R data types. Thus, there are series (R vectors) and data frames. Subsetting/filtering, new calculation of variables, aggregation and linking of data frames work relatively intuitively and are clearly inspired by R in terms of usability.
Pandas objects simplify the work by providing methods for descriptive statistic and visualization. The standards for an explorative data analysis are thus included right away. Regarding the wealth of ex-works methods, it is worth it to frequently browse the documentary.
scikit-learn
This package is Python's central machine learning library. It offers many modern processes for every area of machine learning. As each kind of modeling is addressed by the same interface, you can test various models against each other by still generic codes much more easily than in R. The included set of standard tools for variable transformations is also remarkable.
The downside in this context is the fact that the scikit modeling only models and does not do much more. There are no model summary, no charts, even some of the simplest diagnostics such as residues have to be calculated manually and so on.
The name scikit originates from "SciPy Toolkit". Thus, it was developed based on SciPy as SciPy itself does not include relevant methods.
statsmodel
Statsmodel is a very new alternative to scikit-learn which is significantly more inspired by R and is more focused on classic modeling and data mining. You get a nice summary with a useful selection of key figures after calculating a model. The model object also includes many re-usable calculations as one is familiar with from R. There even is an interface which accepts the model formulas in the same notation as in R.
IPython
First of all, IPython is an interactive command line for Python. The intelligent auto-complete dramatically facilitates the exploration of data. However, one cannot automatically equate the console with the iPython notebook. The notebook lifts the iPython console to a web-based working environment which includes many usability features, visualizes charts and output during work in the code and provides publishing opportunities in terms of "reproducible research". So whoever is looking for an alternative to Rs knitr necessarily arrives at IPython notebook.
matplotlib
Machine Learning under Python only becomes well-rounded by means of chic visualizations and here matplotlib steps into the breach. The graphics library which was originally developed independently from the aforementioned technologies, enables any conceivable visualization as all objects, however small, remain selectable and modifiable in a chart, provided sufficient skills of the developer. At the same time, it offers sufficient "high level" functions for beginners in order to quickly create the first charts on publication level.