Modeling ALAE Using Copulas

Introduction:

It is often necessary in reinsurance pricing to work with individual claim data that has ALAE and indemnity amounts listed separately for each claim. Reinsurance contracts can cover ALAE included in the loss, or pro-rata in addition to the indemnity for an excess layer. You might also have different trend assumptions for ALAE and indemnity.

For exposure rating, a reinsurer typically has a default assumption for the indemnity severity curve. The traditional method for modeling ALAE has been to assume it is a fixed percentage of the loss for all sizes of loss. This not only forces the distribution of ALAE amounts to be a scaled copy of the distribution of indemnity amounts, it also forces these two variables to be 100% correlated, two very strong assumptions.

A quick look at some actual reinsurance claim data shows the correlation assumption to be far from reality:

Picture1(1).jpg

See the references for some more complete papers on the theory, but claim data with indemnity and ALAE information is a sample from a bivariate distribution, and there are many tools in R that let us work with bivariate distributions.

To start with some thoery, Sklar’s theorem states that a bivariate distribution is completely defined by the marginal distribution of the variables, and the copula which relates their cumulative distribution functions:

(1)
\begin{equation} F(x,y)=C(u,v);u= F_X (x), v= F_Y (y) \end{equation}

A copula is simply a bivariate (or multivariate) distribution on the unit square (or cube, etc.). The practical application of this is that we can fit individual single-variable distributions to ALAE and indemnity each, and then separately fit a copula to the bivariate data transformed to $[0,1]\times[0,1]$ without worrying about any loss of information.

Setting up the Data in R:

The full R code will be attached in a file, but I will go through it step by step here. We start with a file containing the loss data with a column for indemnity and alae. This is the first handful of rows:

loss alae
4468750 5571.34
3490000 808363.4
3450000 180757
3425000 713552.2
2895000 394624.5
1958925 36074.6
1626762 326741.5
990000 936229.8
980885.4 13156.44

The first step in R is to import our libraries, import the data and extract the separate ALAE and indemnity univariate data, and the copula data:

library("actuar")
library("stats")
library("stats4")
library("copula")
library("distr")

copula_data <-read.csv("C:\\directory\\copula_data.csv", header = TRUE)
alae_data <- copula_data[,2]
alae_data <- alae_data[alae_data>0]
loss_data <- copula_data[,1]
loss_threshold <-100000
loss_data<-loss_data-loss_threshold
loss_data <-loss_data[loss_data>0]
set.seed(123)
rankdata <- sapply(copula_data , rank, ties.method = "random") / (nrow(copula_data) + 1)

A few minor tweaks we are doing here is to look at only indemnity loss above a certain threshold, on the assumption that we are analyzing an excess layer of reinsurance. We also remove any ALAE data points at exactly 0. This will allow us to fit curves to the logarithm of the data. An adjustment could be made at the end to add back in the probability of 0 ALAE but I have not done that here.

In the last line, we are transforming the data to be in $[0,1]\times[0,1]$ by using the rank function and then dividing by the number of data points plus one. Some of the copula fitting procedures break down if there are ties or repeats in the data, so we apply a tie-break procedure which just randomly selects one of the equal entries to be ranked ahead of the other.

Univariate Distribution Fitting:

There are so many resources discussing curve fitting for univariate data I won’t go into the theory but the following code takes the ALAE data and uses maximum likelihood to fit a pareto, log-gamma, Weibull and lognormal distribution. It then plots the resulting fits against the empirical points of the CDF:

nLL1 <- function(mu, sigma) -sum(stats::dlnorm(loss_data, mu, sigma, log = TRUE))
loss_fit1<-mle(nLL1, start = list(mu = 10, sigma = 1), method = "L-BFGS-B", nobs = length(loss_data), lower = c(1,0.01))

nLL2 <- function(shape, scale) -sum(stats::dweibull(loss_data, shape, scale, log = TRUE))
loss_fit2<-mle(nLL2, start = list(shape = 1, scale = 50000), method = "L-BFGS-B", nobs = length(loss_data), lower = c(0.1,100))

nLL3 <- function(shapelog, ratelog) -sum(actuar::dlgamma(loss_data, shapelog, ratelog, log = TRUE))
loss_fit3<-mle(nLL3, start = list(shapelog = 60, ratelog = 1), method = "L-BFGS-B", nobs = length(loss_data), lower = c(0.01,0.01))

nLL4 <- function(shape, scale) -sum(actuar::dpareto(loss_data, shape, scale, log = TRUE))
loss_fit4<-mle(nLL4, start = list(shape = 1.1, scale = 1000), method = "L-BFGS-B", nobs = length(loss_data), lower = c(1,100))

x<-seq(0,max(loss_data),max(loss_data)/1000)

plot(x,plnorm(x,coef(loss_fit1)[1], coef(loss_fit1)[2]),type="l",col="red", main="ECDF vs Curve Fits")
lines(x,pweibull(x,coef(loss_fit2)[1], coef(loss_fit2)[2]),type="l",col="blue")
lines(x,plgamma(x,coef(loss_fit3)[1], coef(loss_fit3)[2]),type="l",col="orange")
lines(x,ppareto(x,coef(loss_fit4)[1], coef(loss_fit4)[2]),type="l",col="green")
plot(ecdf(loss_data),add=TRUE) 

summary(loss_fit4)
loss_fit4@coef
Picture2(1).jpg

From practice the pareto seems to be the best fit most often, although the alpha parameter will be fairly high, in the 5 to 10 range. Another consideration could be to fit some type of split distribution, one for lower amounts and one for higher amounts, as has been in done other actuarial applications.

For losses, the corresponding code can be used to do the same fitting procedure for indemnity. As mentioned above many reinsurers will have an exposure curve, or a priori distribution of indemnity amounts. As an alternative to fitting the indemnity data, one can also upload a .csv file containing the cumulative loss distribution:

100000 0
100461.579 0.002916338
100925.2886 0.005824171
101391.1386 0.008723524
101859.1388 0.011614422
102329.2992 0.014496888
102801.6298 0.017370949
9772372.21 0.999704841
9817479.43 0.999708979
9862794.856 0.999713169
9908319.449 0.999717391
9954054.174 0.999721607
10000000 0.99990522

Copula Fitting:

There are two aspects to copula fitting. Just like with univariate distributions, there are different families of distributions and then within a family any given dataset will have a best fit curve (according to some goodness of fit measure). From my own experimentation I found that the best fitting copula family to various datasets of indemnity and ALAE was the Gumbel copula. This was also chosen as the best family in Frees and Valdez [3], Micocci [8] and Venter [11]. The Gumbel has the desirable properties of being single parameter, an extreme-value copula, Archimedean, and having a closed form expression for Pearson correlation (traditional product-moment) and upper tail correlation, two key measures of correlation.

So for this exercise we will assume the Gumbel to be the correct copula family and then fit the best parameter, as it is a single parameter copula, using maximum likelihood:

fitted_copula<-fitCopula(gumbelCopula(1), rankdata, method = "ml")

summary(fitted_copula)

simcopula<-rCopula(nrow(rankdata), fitted_copula@copula)
theta <- fitted_copula@copula@parameters
tail_correlation <- 2 - 2^(1/theta)
tail_correlation

In my trials with different datasets of casualty large loss data, I had tail correlation fits of between 0.2 and 0.4. This is on a scale of 0 to 1 for the Gumbel. Below is a chart of the raw copula data transformed to the unit square:

Picture3.jpg

Simulation and Export:

With both marginal distributions and a copula we now have a full model. Please note that while I refer to the “fitted” distribution, it is possible you have chosen to import your own indemnity severity curve in which case it is not fitted to the data used for the copula, but there is code for both cases.

We can plot some simulated data against the actual data to visually assess the model for any blatant errors:

simcopula <- rCopula(nrow(copula_data), fitted_copula@copula)

simloss <- cbind( q(indemnity)(simcopula[,1]),
qpareto(simcopula[,2], alae_fit4@coef["shape"], alae_fit4@coef["scale"]) + 10000)

plot(copula_data, main = "input data")
plot(simloss, main = "modeled data")
plot(simloss, xlim = c(0, 10^7), ylim = c(0, 1.5*10^6))
Picture4(1).jpg

Of course, the above plots are of limited value because we’ve only simulated as many points from the fitted distribution as there are points in the original dataset, so even two such plots form the same fitted distribution may look very different.

The output is based on empirical simulation, not theoretical integration, so the first step is to simulate from our fitted bivariate distribution. The code below is for the case of a preselected and imported indemnity distribution:

simcopula <- rCopula(100000, fitted_copula@copula)
simloss <- cbind( q(indemnity)(simcopula[,1]),
qpareto(simcopula[,2], alae_fit4@coef["shape"], alae_fit4@coef["scale"]))
write.csv(simloss, "C:\\directory\\simulated_losses.csv")

Interestingly, we need to simulate only from the copula distribution. These points live on the unit square, then we use the “q” function of the fitted distributions to convert from cumulative percentiles to x-values of ALAE and indemnity.

There are two forms of output I have code for. The first is an “ALAE Included” combined severity cumulative distribution. In other words we add the simulated indemnity and ALAE amounts together for each simulated point, and rank them to create a univariate distribution. We export the resulting empiric distribution to a .csv:

loss_alaeinc<-simloss[,1]+simloss[,2]
cumulative_prob<-c(1:1000)/1000
loss_dist_alaeinc <- t(rbind(quantile(loss_alaeinc, cumulative_prob), cumulative_prob))
write.csv(loss_dist_alaeinc, "C:\\directory\\loss_dist_alaeinc.csv")

The other form of output is a summary of loss statistics for a set of excess layers. We start with a file with the desired limit and attachment of our study layers. There are also columns labeled count, moment1 and moment2. The count column will be the number of claims in the layer. Moment 1 will be the average loss in the layer and moment2 will be the mean of the square of the loss to the layer. This will make more sense when we examine the output file.

Here is what the input file looks like:

limit attachment count moment1 moment2
250000 250000 0 0 0
500000 500000 0 0 0
1000000 1000000 0 0 0
3000000 2000000 0 0 0
5000000 5000000 0 0 0

Here is the code in the ALAE pro-rata case:

bc_exhibit <- read.csv("C:\\directory\\bc_exhibit_input.csv", header = TRUE)
layer <- function(x, limit, attachment) apply( cbind(apply(cbind(x-attachment, rep(limit, length(x))), 1, min), rep(0, length(x))), 1, max)

for (i in 1:nrow(bc_exhibit)){
layeredindem <- layer(simloss[,1], bc_exhibit[i,1], bc_exhibit[i,2])
layeredloss <- layeredindem + simloss[,2] * layeredindem/simloss[,1]
bc_exhibit[i,3] <- length(layeredloss[layeredloss>0])
bc_exhibit[i,4] <- mean(layeredloss)
bc_exhibit[i,5] <- mean(layeredloss*layeredloss)
}

c_exhibit
write.csv(bc_exhibit, "C:\\directory\\bc_exhibit_output.csv")

The ALAE pro-rata is accounted for in the layeredloss variable assignment. It is the indemnity (first simloss column entry) plus the ALAE amount (second entry) multiplied by the ratio of the layered indemnity over the total indemnity.

The output file is identical to the input file, but with the statistics data populated:

limit attachment count moment1 moment2
250000 250000 140 132752.1 29400000000
500000 500000 70 116238.8 50000000000
1000000 1000000 31 87879.97 75500000000
3000000 2000000 12 54638.55 114000000000
5000000 5000000 2 30873.52 90100000000

This is the last use of R, but there is one more step which was done in Excel, and that is to translate the R output into the more common statistics of expected loss, frequency, severity and coefficient of variation of layer losses for the study layers. These simple formulas are functions of the R output and the only other input our model needs which is expected frequency at the data threshold. Recall we had to select a threshold for the indemnity data above 0, which can be lower or equal to our lowest study layer. In this example it is lower.

Assuming a frequency of 16 at the 100,000 threshold, we can calculate the following statistics:

layer lim attach risk prem freq sev CoV
1 250000 250000 1941451 10.9 178115 0.34
2 500000 500000 1772998 5.874 301864 0.493
3 1000000 1000000 1181740 2.468 478851 0.835
4 3000000 2000000 993437 0.778 1277718 1.507
5 5000000 5000000 239618 0.155 1550813 3.681

Discussion and Further Refinements:

Hopefully the reader can find this helpful, but there are a few weaknesses in the direct application of this procedure to actual claim data. First, we have not mentioned trend or development. Trend can be accounted for by trending the data first before importing to R. Development is a much more difficult problem, but indemnity development can be accounted for by importing a pre-selected indemnity severity distribution which represents ultimate values.

As far as how the results compare to the traditional assumption, I typically found that given the same total amount of loss the loss to lower layers was increased while the expected loss to upper layers decreased for ALAE included treatment. This will differ for every application but the important thing to understand is that there are numerous factors affecting layer losses up and down. For example, the classical assumption was 100% correlation, so any fitting we do at less than 100% correlation will reduce the loss to the very highest layers. Our fitted ALAE severity distribution however, may be more or less severe that the scalar multiple of indemnity default assumption.

Examples of all input and output files as well as all R code including some not in the blog text are attached to this page. Click the "Files" link at the bottom of the page.

Greg McNulty, FCAS
SCOR Reinsurance
moc.rocs|ytluncmg#moc.rocs|ytluncmg

Bibliography
1. Camphausen, F. et al. “Package ‘distr’”. Cran.org. Version 2.4. February 7, 2013.
2. Dutang, C. et al. “actuar: An R Package for Actuarial Science”. Journal of Statistical Software. March 2008, Volume 25, Issue 7.
3. Frees, E.; Valdez, E. “Understanding Relationships Using Copulas”. North American Actuarial Journal, Volume 2, Number 1. 1998.
4. Genest, C.; MacKay, J. “The Joy of Copulas: Bivariate Distributions with Uniform Marginals”. The American Statistician, Volume 40, Issue 4 (Nov., 1986),280-283.
5. Geyer, C. “Maximum Likelihood in R”. www.stat.umn.edu/geyer. September 30, 2003.
6. Hofert, M. et al. “Package ‘copula’”. Cran.org. Version 0.999-5, November 2012.
7. Joe, Harry. Multivariate Models and Dependence Concepts. Monographs on Statistics and Probability 73, Chapman & Hall/CRC, 2001.
8. Kojadinovic, I.; Yan, J. “Modeling Multivariate Distributions with Continuous Margins Using the copula R Package”. Journal of Statistical Software. May 2010, Volume 34, Issue 9.
9. Micocci, M.; Masala, G. “Loss-ALAE modeling through a copula dependence structure”. Investment Management and Financial Innovations. Volume 6, Issue 4, 2009.
10. Ricci, V. “Fitting Distributions with R”. Cran.r-project.org. Release 0.4-21, February 2005.
11. Ruckdeschel, P. et al. “S4 Classes for Distributions—a manual for packages "distr", "distrEx", "distrEllipse", "distrMod", "distrSim", "distrTEst", "distrTeach", version 2.4”. r-project.org. February 5, 2013.
12. Venter, G. “Tails of Copulas”. Proceedings of the Casualty Actuarial Society. Arlington, Virginia. 2002: LXXXIX, 68-113.
13. Yan, J. “Enjoy the Joy of Copulas: With a Package copula”. Journal of Statistical Software. October 2007, Volume 21, Issue 4.