## Convenient data manipulation with dplyr ### Introduction [dplyr](https://github.com/hadley/dplyr) is a data manipulation package for the [R](http://www.r-project.org) language. Although it was first released this year (2014), it has already become an essential tool for many. In fact, for actuaries learning R, I think it's worth picking up right away and considering a core part of the R language. Three reasons for this are: 1. dplyr's syntax is more intuitive than the basic R data manipulation functions. 2. Actuarial tasks frequently involve simple manipulations of tabular data, and dplyr excels at this. 3. dplyr is fast and scalable. You won't need to switch to a different tool if you start dealing with bigger data sets. This blog post will introduce dplyr using actuarial examples. We'll start with a small data set to show the basic syntax. Then we'll examine dplyr's speed on a larger data set. There are several good introductions to dplyr, including the [basic vignette included in the package](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html). We won't repeat most of that here, and will stick to illustrating dplyr on a sample actuarial data set. For details of the various dplyr functions, please refer to that vignette or [the reference manual](http://cran.r-project.org/web/packages/dplyr/dplyr.pdf). ### Small toy example Suppose we have a data frame called `quarter.df` that contains aggregate premium and loss by quarter for two years: ```{r} set.seed(0) quarter.df <- data.frame(year=c(rep(2011, 4), rep(2012, 4)), quarter=rep(1:4, 2), prem=rnorm(8, mean=5e6, sd=5e6 * .05), loss=rnorm(8, mean=5e6 * .7, sd=5e6 * .7 * .3)) ``` ```{r, results='asis', echo=FALSE} knitr::kable(quarter.df) ``` As an actuary, you may wonder: What are the loss ratios by quarter and by year? Can we compare the yearly and quarterly loss ratios? dplyr can help us do this. The dplyr elements we'll use here are: 1. group_by - group a table according to a column 2. summarize - distill a table that has already been group_by'd 3. inner_join - merge one table to another, like SQL's inner join 4. %>% - pipe operator (really part of the magrittr package, but you'll get it if you load dplyr) To get the loss ratios by year we will group the quarterly data frame into a yearly one: ```{r} year.df <- group_by(quarter.df, year) year.df <- summarize(year.df, yr.prem=sum(prem), yr.loss=sum(loss)) year.df <- transform(year.df, lr=yr.loss / yr.prem) year.df ``` Note how we kept redefining year. dplyr often lends itself to chained assignments using the pipe operator %>%. Conceptually the operator is very simple; it just lets us write "x %>% f(...)" instead of "f(x, ...)". It doesn't really do anything, it just lets us write functions in a different syntactic order. But sometimes that makes a huge difference! The following two blocks of code are both equivalent to the previous one. Which would you rather read? ```{r} year.df <- transform(summarize(group_by(quarter.df, year), yr.prem=sum(prem), yr.loss=sum(loss)), yr.lr=yr.loss / yr.prem) ``` ```{r} year.df <- (quarter.df %>% group_by(year) %>% summarize(yr.prem=sum(prem), yr.loss=sum(loss)) %>% transform(yr.lr=yr.loss / yr.prem)) ``` Finally, we can combine our yearly and quarterly data frames to compare loss ratio by quarter. ```{r} quarter.df <- (quarter.df %>% transform(lr=loss / prem) %>% inner_join(year.df[, c("year", "yr.lr")], by="year") %>% transform(diff.from.year=lr - yr.lr)) ``` ```{r, results='asis', echo=FALSE} knitr::kable(quarter.df) ``` Above, the diff.from.year column is the difference between the quarterly loss ratio and the loss ratio for that year. ### Larger example and speed comparisons dplyr is relatively fast. Let's expand the example above to have 10 million rows: ```{r, eval=FALSE} library(tweedie) library(lubridate) set.seed(0) num.rows <- 1e7 date.range <- as.Date("2003-01-01"):as.Date("2012-12-31") day.numbers <- sample(date.range, num.rows, replace=TRUE) policy.df <- data.frame(policy.num=seq(length.out=num.rows), eff.date=as.Date(day.numbers, origin=origin), prem=rnorm(num.rows, mean=100, sd=10), loss=rtweedie(num.rows, mu=70, phi=100, power=1.5)) policy.df$acc.year <- year(policy.df$eff.date) head(policy.df) ``` This time suppose we just want a listing of the 10 accident years and the total premium and loss in each year. We will test five different ways in R of computing this answer, and compare the speeds for each. #### dplyr ```{r, eval=FALSE} system.time( year.big.df <- (group_by(policy.df, acc.year) %>% summarize(yr.prem=sum(prem), yr.loss=sum(loss))) ) ``` #### aggregate aggregate is an older R function for grouping variables in a data frame. It is included in R's core stats package. ```{r, eval=FALSE} system.time( year.big.df <- aggregate(policy.df[, c("loss", "prem")], by=list(policy.df$acc.year), sum) ) ``` #### data tables The package [data.table](http://cran.r-project.org/web/packages/data.table/index.html) provides data manipulation features similar to dplyr. ```{r, eval=FALSE} system.time({ policy.dt <- data.table(policy.df) year.big.df <- policy.dt[, list(yr.prem=sum(prem), yr.loss=sum(loss)), by=acc.year] }) ``` #### sqldf [sqldf](https://code.google.com/p/sqldf/) is a neat package written by Gabor Grothenieck. It allows R users to write full SQL queries on R data frames. It can even be used to conveniently read/write data frames from/to a database on disk. ```{r, eval=FALSE} system.time({ year.big.df <- sqldf(" select acc_year, sum(prem) as [yr.prem], sum(loss) as [yr.loss] from [policy2.df] group by acc_year") }) ``` #### ddply ddply is from the [plyr](http://plyr.had.co.nz/) and is the predecessor of dplyr. Although dplyr has replaced it for most tasks, it can still be useful for some jobs. ```{r, eval=FALSE} system.time( year.big.df <- ddply(policy.df, .(acc.year), plyr::summarise, yr.prem=sum(prem), yr.loss=sum(loss)) ) ``` #### Test results Here is the result of the above tests (in seconds): ```{r, echo=FALSE} data.frame(user=c(213.88, 0.89, 3.45, 0.77, 48.70), sys=c(1.15, 0.20, 1.60, 0.00, 6.83), elapsed=c(217.42, 1.09, 5.05, 0.76, 55.64), row.names=c("aggregate", "data.table", "ddply", "dplyr", "sqldf")) ``` dplyr was the fasted in this test, more than 200 times faster than the aggregate function. It's also significantly faster than sqldf and the ddply function from Hadley Wickham's previous package plyr. However, data.table was close behind though, and would have been faster had we been using data.tables instead of data.frames to begin with. Before dplyr came out I would use data.table when I had to for speed, but I found myself always having to consult the data.table documentation because I found its syntax counterintuitive. ### Conclusion Hopefully this blog has presented a semi-realistic example of how dplyr can help process actuarial data. dplyr combines intuitive syntax, fast operation, and wide applicability into a very useful package.