R Markdown - Dynamic Documents for R
Background
In order to use statistical modelers to build pricing guidelines, there are usually some routine steps in R.
- Data import and cleaning
- Data summary
- Regression (GLM or other supervised/unsupervised models)
- Coefficient estimates and confidence intervals
- Prediction
- Graphs for actual/predicted responses
- Model comparison and selection
When actuaries are updating models or rerunning models for new data, usually the only thing they worry about is the third section, while everything else is very likely to remain the same or similar. To make things easy, actuaries would like to write functions or loops with the regression formulas and data input as arguments. Therefore, they only have to do three things. Firstly step is to reassign the formulas by adding more rating factors and update the input data source. Then, the actuaries are supposed to run the old R script codes from coefficients display to model comparison again. Last step is to copy and paste the outputs to Power or PDF and format the pages for presentation.
Compared to writing the codes again for model update, this approach is very effective. However, things can be even simpler, which is when we add the format part to the “loop”. To make this happen, I am introducing a new R package called “rmarkdown”, by which, the whole flow from import to format is fully reproducible. Instead of three steps, we now only need 2 steps, which is reassigning the arguments and rerunning the old R Markdown codes.
Introduction
From the R Markdown v2 official website, “R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document.” Beside integrating the all steps of predictive modeling, R Markdown has the following characteristics:
- The final documents can be outputs in different formats, such as HTML, PDF, and MS Word, depending on the audience and presentation scenario.
- R codes are the same as usual but texts of titles or section names can be displayed in LaTeX format, hence more colors, fonts and sizes for different items.
- Custom templates are available online
- Shiny can be used in interactive R Markdown documents.
Since R Markdown has more advantages then R script, it has to be different.
R Markdown R Script
- Only can be run in RStudio IDE Can be run in regular R base system or RStudio
- New file as R Markdown New file as R Script
- R codes should be written in chunks or inline Only R codes in R scripts
- If only want to run the R codes, highlight R codes and click “Run” button (Control+R for shortcut) as usual.
- If want to create the final document for presentation, click “Knit HTML”, “Knit PDF” or “Knit Word”. Click “Run” button (Control+R for Shortcut)
- Templates available in RStudio No Template in RStudio or R base system
Installation and Markdown Basis
RStudio has to be installed as the IDE for R Markdown. Windows, Mac and other system can always find their installers in .
Then, the add-on package “rmarkdown” should be installed and loaded by
install.packages(“rmarkdown”)
library(“rmarkdown”)
These codes can be written and run in R Script (.R) or R markdown (.Rmd) files in RStudio. To make thing simple, I first open a new .Rmd file in RStudio, write the above R codes in its R codes chunk, highlight the R codes and click “Run”.
R Markdown new file is below R Script new file.
Select a, for example, HTML file.
This is a template but we do not need to worry about its codes at first. I just create a chuck for easy package installation and loading.
Now we have everything ready. Let’s run the whole template and see how effective and efficient R Markdown is. Instead of highlight any R codes and “run”, I click “Knit HTML” directly. It is a dropdown list, where “Knit PDF” and “Knit Word” can also be selected.
I highly suggest you try the template at first. Beside the “summary(car) and plot(pressure)” example, there are also shiny documents, “Package Vignette” and “Tufte Handout” files, to open. When you get familiar with R Markdown syntax, we can try some insurance predictive modeling one, that can be downloaded from this page and have more details as below.
Running an Example
Suppose that we are interested in the factors that influence whether an auto policy incurs any losses in a policy year or not. The outcome (response) variable is binary (0/1): loss or not. Therefore, it is called loss_ind. The predictor variables of interest are some policy information that we have obtained in underwriting, such as number of vehicles (num_veh), number of drivers (num_drv) and marrige status (marital).
loss_ind ~ num_veh + num_drv + marital
We want to determine which independent variables are significant and their coefficient signs are making sense.
Please download “UCLA R Data Analysis Examples Logit Regression.zip” in the lower right of this blog page and decompress it. Run “UCLA R Data Analysis Examples Logit Regression.Rmd” like we run a template .Rmd file.
There is a setup chunk for the whole R Markdown.
The texts will be bold and appear at the beginning of the HTML output.
“echo” is defaulted to be TRUE and it means the R codes will appear in the final document besides the outputs they produce. If we want only the outputs to appear, we can either change the defaulted echo in r setup chunk or reassign the value of echo in each R chunk later.
For example,
and
.
and will not appear in the final document but
will. However, the output of “print(formula)”, which is
will appear in the final document. It is because R outputs will always appear in the final document no matter what “echo” we have.
There are 200 observations in the “binary 2015.csv” file imported and I use the whole dataset to train the regression model.
Next, we have a peek of the top rows, summary, counts of the dataset.
Since the response variable is binary, we use logit regression and the model outputs are displayed as below.
The number of cars is not a significant predictor but the number of vehicle is. The more vehicles in a policy, the less likelihood of accidents. This makes sense because more vehicles might indicate older, rich and better educated policyholders. They are more mature and careful when driving so fewer accidents. Also, married policyholders should be older and more mature than single policyholders.
On one hand, we can obtain confidence intervals for the coefficient estimates, based on the profiled log-likelihood function. On the other hand, we can get ones based on the standard errors by using the default method.
Odd ratios are calculated for each independent variable. In this case, the odd ratios and their upper/lower levels of significant predictors are consistently lower than 1, while those of insignificant predictor, num_drv, are not consistent.
“ggplot” package is a good tool that uses colors, shapes and sizes to show the relation between dependent variable and more than one independent variables. For example, it is clear in the ggplot that single policyholders consistently have higher likelihoods to incur losses and likelihoods decrease with increasing numbers of vehicles.
Each policy has a predicted likelihood to incur losses but how will we use this information. One example is to divide the policyholders into a couple of groups and see the group number as pricing guideline. For example, I have group 1 and group 2, where people in group 1 has likelihoods lower than the median while group2 the opposite. Therefore, group 1 is a better group and we should charge lower premiums. Below is a boxplot for their actual loss percentage. We do see that group 1 has lower actual loss percentage, so actual loss grouping is consistent with predicted trends.
Statistics can be easily calculated from R.
There is no meaning to see a single deviance difference or degree of freedom difference, but when we update the logit model, they can be compared with updated ones.
Update the Formula
Let’s come back to the number of drivers rating factor. Since it’s shown to be insignificant previously, we want to remove it and run the regression again. Only marital status and the number of vehicles are left in the regression.
loss_ind ~ num_veh + marital
The formula is redefined.
We just need to click the “Knit HTML” bottom. In the new HTML file, the coefficient display shows consistent results with the previous one. What is more, we see smaller deviance difference and degree of freedom difference. Policy observations are better grouped in terms of actual loss experiences.
Update the Data
In new policy year 2016, suppose we have 200 new observations. Therefore, we keep using the updated model
loss_ind ~ num_veh + marital
and we update the data as well.
I simply rerun the R Markdown file and create similar dynamic HTML output. The output final document shows consistent trend.
Many duplicated actions can be avoided so time is saved and mistakes are prevented. All these improvements should be thanks to R Markdown. Are you excited about how productive the R Markdown process can be? If yes, enjoy the journey and don't forget to post your new works online as returns!
Bibliography