As a kid in the early 1990ies, I was able to have a go at a programming language for the first time; back then, it was Turbo Pascal. Through my professional experiences and private interests, I have since developed smaller codes in C++, somewhat larger codes in Java, Perl and JavaScript, full-blown apps for end customers in VBA and for professional projects with PHP over five years. However, from the first moment none of these programming languages has won me over like Python.
Certainly, Python in its normal distribution as "high level" language is not suitable to hold a candle to languages such as C++ or Java when it comes to performance. The latter are simply faster and more optimized and have the upper hand in professional software development.
Why would a Data Scientist program Anything?
No problem for us ... data scientists are probably seldom required to develop complex applications. Why should we even deal with programming languages at all?
I have recently seen a number of job ads for "data scientists" which only require experience with one statistics tool and basic SQL skills. No wonder that the word is of a "data scientist" hype, if it is only understood as rebranding mere statisticians.
Well, every real data scientist always faces the task of firstly preparing his/her data, making analysis results accessible to other users and in some cases prepare or draw up reports or presentations. I, for one, always encounter minor or major issues in this process, which I can only solve awkwardly or with a lot of manual effort unless I use a little programming to assist me. In addition, programs give me the possibility to automate processes and to thus avoid clerical mistakes.
Python - the language of Data Scientists?
This is where Python as script language comes into play. It is possible to cobble a little script together quickly and without a large set-up, solving the issue immediately. Of course, I could actually have used Perl or abused PHP for all these cases, as well. But: no competitor of Python has its upward potential. Python may not be dismissed as one of many script languages. Python combines a large number of the big programming paradigms. For example, the language is fully object-oriented, but also supports functional programming. One can blame Python for stealing the best ideas from other languages. The author Guido von Rossum responded to this to the effect that it had been the very intention to bundle the highlights of other languages.
Especially over the last 5 years, I have seen a rapidly growing spread of the language. Everyone who uses the internet has certainly been using services which use i.a. Python, primarily Google, Youtube and Dropbox.
Data scientists face an ever-increasing range of dedicated data preparation (e.g. http://pandas.pydata.org/), analysis (e.g. http://www.scipy.org/) and reporting (e.g. http://matplotlib.org/) modules. Recently, I even found an article which views the language as the future replacement of tools such as R: http://readwrite.com/2013/11/25/python-displacing-r-as-the-programming-language-for-data-science.
I would not take such an extreme view by a long shot. A lot of experience with the language is required in order to replace established solutions such as R.
Advantages of Python
Personally, I am very convinced of the language because...
- it uses code which is clearly legible and comprehensible.
- it provides a lot of functionality which is delivered right at the beginning.
- it achieves a huge amount with few lines. It gets the job done!
- by now, there are libraries of good literature available.
- one can get started immediately after a short, easy set-up.
GOOD Experiences with Python
In my work environment, I have been using Python e.g. for the following things over the last 7 years:
- Parsing files
- Parsing web sources
- ETLs of all kinds with various database systems
- Dynamic web applications with the web2py framework
- Huge volumes of automated reports in Excel or Latex
and a lot more....
A little example
But I have raved enough. Let me give you a little example. The message in the first line is supposed to be converted into "IMHO learns Python". The slow way of doing it, which runs along the words and only converts the first 4, is set out on top.
Firstly, the message is transposed into a list of words by means of a space (split command). Then, this list is run off in a loop. If the respective word is among the first four, only the first letter is selected for each of these words and attached to the new message. The remaining words are included with a leading space without modification. As the new message is currently still a list of words, it is joined for output into one single string by means of an empty string as adhesive between the words (join command); this is done in the print command.
Below, there is the advanced option, everything in one line :-D.
message = "In My Humble Opinion lernt Python!"
#hier die ausführliche Variante
message_list = message.split(' ')
new_message = []
for word in message_list:
if word in message_list[0:4]:
new_message.append(word[0:1])
else:
new_message.append(" "+word)
print(''.join(new_message))
#und hier all das in nur einer Ziele
print(''.join([word[0:1] if word in message.split(' ')[0:4] else ' '+word for word in message.split(' ')]))
<//span>