Why Python for Data Analysis

The S language and its open source implementation - R - dominate the toolboxes of most data analysts. Whole armies of PhD candidates have implemented their dissertations as R libraries, enriching the R ecosystem in the process. R supports fantastic plotting functionality. The user community provides great technical and moral support. And all of this comes for free! If you need more reliable technical support, Revolution Analytics provides a commercial R distribution. So why would you consider another language for statistics, data analysis or predictive modeling?

I could re-write the previous paragraph and replace the references to either the S language or the R project with Python. In addition, Python is truly object oriented and is known for its clean, simple and readable syntax. This makes learning Python and maintaining Python code much easier. The language is widely used in a variety of software engineering applications and is growing in its popularity for statistical and data analysis applications.

Python has been adopted by MIT and UC Berkeley as the introductory programming language. Many math, statistical and computer science students will have more exposure to Python than R, SAS, Stata, or other statistics specific tools It's used because the syntax is simple, consistent and clean so that students can focus on the computer science ideas rather than worry about cryptic syntax. The result? The recent grads that you hire probably know Python better than the other standard statistical tools. Python is even replacing Matlab as the tool of choice for Scientific and Numeric computation.

OK, you're curious and you're now wondering "But can I fit a GLM with Python?" The answer is yes, but you'll need a few additional Python packages:

  • scipy
  • bumpy
  • pandas
  • matplotplib
  • statsmodels

Fortunately these are all available through the scipy.org website.

Step 1. Download the scipy stack.

Most scientific and statistical computing applications written in Python use the scipy packages. This is actually a collection of packages based that include scipy, numpy, pandas, statsmodels and patsy. The collection can be downloaded at the scipy website.

The scientific and statistical computing libraries are not part of Python's core library. So the first step will be to import numpy, pandas, statsmodels.api, and patsy.

import numpy as np 
import pandas as pd
import statsmodels.api as sm
import patsy as py

Of course we can't fit a predictive model without some data. In this example, I'll construct a simple frequency model using the Python programming language and the Allstate Claim Prediction Challenge data set hosted on Kaggle. Details about the data set are available on the Kaggle competition site. You'll need to create an account to download the data or to participate in a future Kaggle competition.

Importing the data into a DataFrame object can be done in a single line.

claims = pd.read_csv('/.../train_set.csv')

This will automatically read in the data set and infer the data types. If you want more control over the data set types you can specify the data types that the 'read_csv' function should use as it imports the data. We need to use a Python dictionary object to specify the data types. Dictionaries are sets of (key, value) pairs. Each key must be unique. The data type dictionary must be of the form (Variable Name, Data Type).

colTypes = {'Household_ID': int, 'Vehicle':int, 
 'Calendar_Year':int, 'Model_Year':int,
 'Blind_Make': pd.Categorical, 'Blind_Model':str,
 'Blind_Submodel': str, 'Cat1':pd.Categorical, 
 'Cat2': pd.Categorical, 'Cat3':pd.Categorical,
 'Cat4': pd.Categorical, 'Cat5':pd.Categorical,
 'Cat6': pd.Categorical, 'Cat7':pd.Categorical,
 'Cat8': pd.Categorical, 'Cat9':pd.Categorical,
 'Cat10': pd.Categorical, 'Cat11':pd.Categorical,
 'Cat12': pd.Categorical, 'OrdCat':pd.Categorical,
 'Var1': np.float64, 'Var2': np.float64,
 'Var3': np.float64, 'Var4': np.float64,
 'Var5': np.float64, 'Var6': np.float64,
 'Var7': np.float64, 'Var8': np.float64,
 'NVCat': pd.Categorical, 'NVVar1': np.float64,
 'NVVar2': np.float64, 'NVVar3': np.float64,
 'NVVar4': np.float64, 'Claim_Amount': np.float64}

The 'Categorical' data type is defined in the pandas library and the 'float64' data type is defined in the numpy library

I can now read in the data set with the predefined data types.

claims = pd.read_csv('/.../train_set.csv',
 dtype = colTypes)

It's always a good idea to do some visual inspection of the data set to make sure that it was imported correctly. This includes checking the column names and some basic variable statistics.

# Check the column names
claims.columns

# Check the Categroical Variables
makes = claims['Blind_Make'].value_counts()
cat1 = claims['Cat1'].value_counts()

# Check some basic variable statistics
claims.describe()

# Check how many claims are greater than zero
sum(claims['Claim_Amount']

Now that we are satisfied that the data was imported correctly, we want to under stand the data set a little better. The key method is visual analysis. We can produce a simple histogram using the 'hist()' method that is part of the 'DataFrame' object.

claims['Var1'].hist()

Partical Data Analysis with Python published a wonderful guide to using pandas for data visualization.

If you want to save the figure to a file you can save the plot as a Python object then use the 'get_figure()' and 'savefig()' methods to save the plot to a file. The 'savefig()' method uses the file extension to determine what format to use.

ax = claims['Var1'].hist()
fig = ax.get_figure()
fig.savefig('/.../claimVar1Histogram.png')

The data set has a 'Claim_Amount' variable but we need a simple binary indicator variable to fit a Logit model. We need to create the binary indicator.

claims['Claim_Ind'] = (claims['Claim_Amount']).astype(int)

The parenthesis creates a tuple object which contains true or false vales. The '.astype(int)' method coverts the true and false values into 1 and 0 values.

Avoid the temptation to define the indicator variable as a boolean (true/false) variable. The problem with defining the indicator as a boolean variable is that patsy will covert the boolean variable a tuple. The value 'true' will be converted to (1, 0) not 1, and the value 'false' will be converted to (0, 1) not 0.

Next we can define the model and create the necessary design matrices. Of course we double check the design matrices.

f = 'Claim_Ind ~ Cat1 + Cat2 + Var1 + Var2'
y, X = py.dmatrices(f, claims, return_type = 'data frame')

X.describe()
y.describe()

We are now ready to fit the model and review the summary.

model = sm.Logit(y, X).fit()
model.summary()

Like R, there are multiple way to achieve the same outcome. This method avoids having to create the design matrices as a separate step.

mod = sm.formula.logit(f, data=claims)
res = mod.fit(method='bfgs', maxiter=1000)

This is just a simple example of what you can do with Python. The possibilities are endless.