R is an extremely useful statistical tool available with thousands of different packages created by different users. As such R can seem quite difficult to master. Several websites offer introductions to R which can be very helpful in overcoming the steep learning curve. A list of useful websites is provided at the end of this page.
If you have not done so yet I recommend you download R. It can be found at http://cran.r-project.org and is available in a Windows, Mac or Linux platform. If you are having difficulty installing R additional information can be found in the CRAN installation manual. You may also find it useful to download an IDE (integrated development environment) for R. I suggest downloading RStudio as it has a simpler interface than R.
Basic Data Entry
With R a task as simple as data entry can seem difficult. However, with proper help it is very simple. I will cover some of the more basic methods for data entry. More advanced methods can be found in the CRAN manual or at Quick-R. I highly recommend you visit Quick-R as I have found it most helpful in understanding R.
Creating a Data Set: Simple Example
Assume you want to manually enter the following information about drivers and save it as Driver.
To do this we would write the following code.
# create variable driver
Gender <- c("Male", "Female")
Age <- c(24,75)
Vehicle Type <- c("BMW", "Buick")
Driver = data.frame(Gender,Age,VehicleType)
# note the = and <- are both acceptable
If you have a larger data set an incredibly useful way to enter the data is through R's editor. Let's redo the same example renaming our variable Driver2 so we can see the new variable we create.
#using the editor is like entering data into a spreadsheet
#before using the editor we need to define Driver2
Driver2 <- data.frame(Gender=Character(0), Age=numeric(0), VehicleType=character(0))
Driver2 <- edit(Driver2)
#You can now use the editor to enter your data and even to add more variables
Play around a little bit with the editor. You'll notice you can edit Driver, add additional variables, and choose their variable type.
Packages exist to help import data from the many different statistical software that exist. In this introduction I will only cover importing data from comma delimited files. If your data set isn't too large you could save it as comma delimited file and then import as shown below. For information on importing data from other sources I recommend Quick-R.
Here is a sample .csv document to test your importing abilities. The most important thing to note is that R uses / instead of the common \.
# remember use "/" not "\"
Driver <- read.table("C:/Users/Joel/Desktop/Driver Info.csv", header=TRUE,sep=",")
#header=True denotes that the first line contains variable names
#sep could also ="" to seperate by spaces
#it's a good idea to look at your data immediately after importing
Importing data in this manner is quite simple. For additional arguments or information on the arguments just type ?read.table. As with any function in R this will bring up the R documentation for the function.
You may also want to change the directory if you commonly load files from the same location. This can be done by selecting File, Change dir.. and then selecting the location. Or you can use the following.
#check the current working directory
#change the working directory
It is often necessary to load packages in order to run specific functions.
A package may be loaded by using the interface, selecting packages, selecting a mirror (just choose a location close to you) and then selecting the package that you would like to use. The packages will have to be loaded into a session in order to be used. If you are using RStudio you can use the GUI to check which packages you have and to activate them. If not you can use Library().
#show installed packages
#load package for example MASS
Before we can do any type of analysis it is necessary that we understand our data. We might want to look at the first few or the last few entries, count how many columns we have, or view some form of graphical demonstration of our data.
Let's create a data set so that we have something to look at.
#First we will generate a few distributions
Beta = rbeta(150,2,5)
Binomial = rbinom(150,7,.3)
Uniform = runif(150)
Normal = rnorm(150)
ChiSqr = rchisq(150,3)
Distributions = data.frame(Beta,Binomial,Uniform,Normal,ChiSqr)
Non-Graphical Methods for Looking at Data
Let's check out the first five entries in Distributions.
# view the first 5 or last 5 values
Now I'll quickly cover some other non-graphical output. We can count the number of rows or columns, check the names of variables, look at the structure of our data set, find the dimensions of our data set, and find the class of our data set.
#count number of rows or columns
# returns the names of the variables
# returns the structure of the dataset
# returns the dimensions of the dataset
# returns the class of the dataset
Graphical Methods for Reviewing Data
Some simple commonly used visual aids for looking at data are histograms, boxplots, qqplots and bargraphs.
Histograms are simple enough, but fine tuning them can be difficult.
# Histogram of Beta Distribution
For more information on histograms look at this
PowerPoint on histograms or some demonstrations done at Quick-R.
boxplot(Distributions$ChiSqr, range = 1.1, col="yellow", horizontal=T)
#remember to learn more about any function use ?function
QQPlot to check for normality
Bar or Line Graph
The data set we generated isn't very fitting for a bar or line graph let's create and view the following data set representing snowfall.
#Create data set
#Type in the data to create the appropriate data set.
In some cases it might be necessary to count the number of occurrences of a specific variable. Let's use the data set you previously imported as Driver.
#Bar graph of number of Males and Females
More elegant graphs can be obtained by using various other packages. One of the best available at the moment is ggplot(). If you need better graphing capabilities I recommend you learn to use ggplot().
Here are some useful exercises to practice and expand on your graphic abilities.
Jed Frees' book, Regression Modeling with Actuarial and Financial Applications
More from Jed Frees.
From the CAS Webinar Introduction to Predictive Modeling in R given by Ben Escoto on 2012-12-11. Contains an input data set and full code.
Since the webinar, some of the R packages used have been changed. With current versions of R, the statmod package must be loaded (e.g. library(statmod)) to perform tweedie regressions. Also, in the LiftCurve function, instances of m1.fit and m2.fit may have to be replaced by m1_fit and m2_fit respectively.
Useful Webpages for Learning R
The R manuals
The presentation as seen in the webinar is contained in this zip file.