An Individual Claims Reserving Model for Reported Claims

We present a claims reserving technique that uses claim-specific feature and past payment information in order to estimate claims reserves for individual reported claims. We design one single neural network allowing us to estimate expected future cash flows for every individual reported claim. We introduce a consistent way of using dropout layers in order to fit the neural network to the incomplete time series of past individual claims payments. A proof of concept is provided by applying this model to a data set for which the true outstanding payments for reported claims are known.


Introduction
Claims reserving is of major importance in general insurance. The main task is to estimate the amount required to cover future payments for claims that have already occurred. Traditionally, these claims reserves are calculated using aggregate data such as upper claims reserving triangles. The most popular method on aggregate data is Mack's chain-ladder model [26]. Working on aggregate data simplifies the modeling task considerably, however, it also leads to considerable loss of information. Using individual claims data has the potential to improve accuracy of claims reserving. Therefore, developing new claims reserving techniques that go beyond modeling aggregate data has become increasingly popular in actuarial science. Surveys of recently developed claims reserving techniques can be found in [4] and [33]. The individual claims reserving models introduced in the last couple of years can broadly be separated into models with and without the use of machine learning methods. For semi-parametric individual claims reserving models not using machine learning techniques we refer to [1,6,19,21,23,27,28,34,36]. Disadvantages of these models are that they often assume too rigid assumptions, making it difficult to apply them in practice. More recently, machine learning has also found its way into claims reserving research, see [2,8,9,10,12,22,25,35] for approaches based on regression trees, gradient boosting machines and neural networks. Claims reserves can be split into an RBNS (reported but not settled) part and an IBNR (incurred but not reported) part. Here, we focus on modeling RBNS reserves, which consist of estimated expected future payments for claims that have already been reported. Usually, the biggest share of the claims reserves are due to such RBNS claims.

A single neural network model
We consider synthetic data generated by the individual claims history simulation machine [16]. For every individual reported claim we are equipped with claim-specific feature information such as age or labor sector of the insured. After a claim gets reported, it can trigger several payments in the following years. The time lags (on a yearly scale) between payments and reporting date are called payment delays. At the end of the last considered accident year, for every reported claim we have claim specific feature information and past payment information. The main goal is to use this information in order to estimate expected future payments for every individual reported claim. This corresponds to estimating RBNS reserves. We design a single neural network for this task. It models the expected payments in all considered payment delay periods on the basis of claim-specific feature and past payment information, separately for every individual reported claim. This single neural network consists of several subnets. Every subnet serves the purpose of modeling the expected payment in the corresponding payment delay period. The separate subnets allow for sufficient flexibility to jointly model all considered payment delay periods in one neural network. However, the subnets are still connected to each other by sharing some of the neural network parameters. This increases stability of the neural network, and allows us to learn from one payment delay period to the other. The procedure of simultaneously learning multiple tasks is called multi-task learning, a concept introduced in [5]; for an overview we refer to [30]. For readers not familiar with neural networks, we refer to [11,13,17,24,31]. Fitting the single neural network to the observed payments then allows us to jointly estimate the expected future payments, for every individual reported claim. This provides us with an estimate of RBNS reserves based on individual claims information.

Incomplete time series and dropout
A key problem of individual claims reserving models is the incomplete time series structure of past payments. Individual claims are equipped with different amounts of past payment information, depending on the reporting years of the claims. In particular, for every observed payment also the corresponding past payment information is known. Thus, an individual claims reserving model fitted to data might rely too heavily on past payment information. The problem then is that when estimating expected future payments, not all past payment information is known, and the resulting estimation might suffer from a systematic bias. The issue of having different lengths of past payment information is usually solved by using every individual claim multiple times during model fitting. For every replication of a particular individual claim, a different knowledge of past payment information is assumed. This artificial augmentation of the number of individual claims is superfluous in terms of additional information. Moreover, this technique substantially increases the time required to fit the model. We propose a different approach here. The problem of the incomplete time series structure of past payment information is addressed by combining embedding layers and dropout layers. We assume that expected future payments depend on past payments only through the order of magnitude of these payments. We then introduce the additional label "no information" for past payments variables, and set the neural network embedding weights for this label to zero. This does not result in a loss of generality, as there are no restrictions for the other labels. By applying dropout layers on past payment information during neural network fitting, i.e. by randomly setting some of the corresponding neurons to zero, we mimic the situation of missing parts of the past payment information. This forces the neural network to learn how to cope with the situation of only knowing parts of the past payment information. For implementation we use the statistical software package R, see [29]. The R code can be downloaded from GitHub: https://github.com/gabrielliandrea/neuralnetworkindividualrbnsclaimsreserving.

Outline
In Section 2 we introduce the data used in this manuscript. In Section 3 we set up the model framework for individual claims reserving. In Section 4 we describe the single neural network architecture in detail. In Section 5 we train this neural network by applying the dropout technique on past payment information, and we calculate the individual RBNS claims reserves. In Section 6 we analyse the individual RBNS claims reserves in terms of neural network stability and in terms of sensitivity with respect to claim-specific feature and past payment information.

Data
In this manuscript we use a synthetic insurance portfolio generated from the individual claims history simulation machine [16]. We generate exactly the same insurance portfolio as in [14,15]. This portfolio is divided into six lines of business (LoBs). As we use the same model separately for every LoB, we refrain from using a LoB component in our notation. For all six LoBs, we consider individual claims from I = 12 accident years. For our RBNS claims reserving exercise we assume that we are at the end of the last considered accident year I, that is we only consider those claims which are reported until the end of calendar year I. We assume that these n reported claims are labeled with v = 1, . . . , n, see Table 1 for the numbers of individual reported claims until time I for all six LoBs. LoB   Every individual reported claim v is equipped with corresponding feature information • x v,4 : age of the injured age (in 5 years age buckets) in X 4 = {20, 25, . . . , 70}; • x v,5 : injured part inj part, which is categorical with labels in X 5 ⊂ {1, . . . , 99}; • x v,6 : reporting delay RepDel in X 6 = {0, . . . , 11}.
The feature claims code cc denotes the labor sector of the insured. We remark that not all the values in the set {1, . . . , 53} are attained. In fact, we only have |X 1 | = 51 different observed values. Originally, the claims have accident years AY in {1994, . . . , 2005}. In order to ease notation we relabel these I = 12 accident years with values {1, . . . , 12}. The accident quarter AQ labels {1, 2, 3, 4} denote the time periods January-March, April-June, July-September and October-December, respectively, of the accident. The individual claims history simulation machine [16] provides age of the injured age from age 15 to age 70 on a yearly scale. Assuming that the approximate age is sufficient, we simplify our model by summarizing the ages in 5 years age buckets {20, 25, . . . , 70}. The feature injured part inj part denotes the part of the body injured. We observe |X 5 | = 46 different values. The reporting delay RepDel denotes the time lag between accident year and reporting year. The individual claims history simulation machine assumes that all claims are fully settled after development period J = 11, without considering a tail development factor. The accident year x v,2 and the reporting delay x v,6 determine the reporting year r v = x v,2 + x v,6 ∈ {1, . . . , I + J} of claim v, for all v = 1, . . . , n. We define X = X 1 × X 2 × · · · × X d ⊂ R d to be the feature space of the d-dimensional features x 1 , . . . , x n . In Figure 1 we provide the distribution of the portfolio of LoB 1 for all six features, only considering individual claims which have been reported until calendar year I. We approximately have uniform distributions for accident year AY and accident quarter AQ. For reporting delay RepDel we observe a quite imbalanced distribution. Most of the claims have reporting delay 0, and the proportions of claims with reporting delay bigger than 1 are very small.
For every individual reported claim v = 1, . . . , n the individual claims history simulation machine [16] additionally provides us with the complete cash flow of total yearly payments. Considering Y v,j , subscript j = 0, . . . , J − x v,6 indicates the payment delay after reporting year r v . If a total yearly payment Y v,j is equal to 0, then there is no net payment in payment delay period j for claim v. A positive total yearly payment indicates a net payment from the insurance company to the policyholder; a negative total yearly payment a net recovery payment from the policyholder to the insurance company. As we only consider claims developments until development period J, the last yearly payment of claim v is Y v,J−x v, 6 . We remark that the individual claims history simulation machine [16] generates synthetic data with complete knowledge of the cash flows. As we assume to be at the end of calendar year I, all payment information beyond calendar year I is only used for back-testing our predictions.  In Figure 2 we provide descriptive statistics for the payments observed in LoB 1 up to calendar year I. In the top row of Figure 2 we present the empirical probabilities of having a (positive) payment as well as the empirical mean of the logarithmic (positive) payments for all payment delay periods j = 0, . . . , 11. We deduce that the probability of a payment decreases with increasing payment delay j, and that the mean logarithmic payment increases for j = 0, 1, 2, 3 and is more or less constant for later payment delays j = 4, . . . , 11. In the bottom row of Figure  2 we show the densities of the logarithmic (positive) payments for all payment delay periods j = 0, . . . , 11. We observe approximate normal distributions on the log-scale, i.e. log-normal distributions on the original scale. For payment delays j = 4, . . . , 11 the densities have a very similar shape, with slightly decreasing variance. For more details on the generated dataset we refer to Appendix A of [15]. In RBNS claims reserving we are interested in the true outstanding payments for reported claims given by Our main goal is to estimate the expected outstanding payments E[R true RBNS | D I ], conditionally given D I , where D I reflects past payment information at time I. In order to tackle this RBNS prediction problem on an individual claims level, we need to define an appropriate model. This is done in the next section.

Model assumptions
In this section we set up the model framework for individual claims reserving. We assume that we work on a sufficiently rich probability space (Ω, F, P), and that the cash flow is a random vector on this probability space depending on the features x v ∈ R d , for all individual claims v = 1, . . . , n. In our run-off situation we assume that we are at the end of calendar year I. Thus, the available information at time I is given by Our goal is to predict expected future payments, conditionally given information D I . We assume that expected future payments do not depend on exact values of past payments, but only on the simplified information "no payment", "recovery payment", "small payment", "medium payment", "big payment" and "extremely big payment". To this end, we define function g : R → {0, 1, 2, 3, 4, 5} by 3, if 5'000 < y ≤ 20'000, 4, if 20'000 < y ≤ 100'000, 5, if y > 100'000. Thus, available past payments are summarized using six labels {0, 1, 2, 3, 4, 5}. We introduce the additional label "6" corresponding to "no information". This label is used for past payments which are unknown at the time of estimation. Consequently, for all individual reported claims v = 1, . . . , n, the past payment information used for estimating the expected payment in payment delay period j is a j-dimensional vector in {0, . . . , 6} j . We can now introduce the model assumptions used for our individual RBNS claims reserving model.
We remark that in assumption (a1) we assume that individual claims v = 1, . . . , n are independent from each other. This is a common assumption in individual claims reserving. It allows us to treat all individual claims separately. Assumption (a2) has already been mentioned above. It implies that expected future payments only depend on past payments through the simplifying function g(·) defined in (3.1).
While assumption (a3) is rather natural, assumption (a4) is more delicate, as payment size modeling in general insurance is difficult. However, we emphasize that the model is not used to simulate future payments, and that we are only interested in expected values. In light of the bottom row of Figure 2, the log-normal assumption can be justified here. We remark that in assumptions (a3) and (a4) we assume that the payments are not known, and the past payment information corresponding to payment delay periods k + 1, . . . , j − 1 is set to the value 6, corresponding to the class "no information". This implies that this model can be used regardless of the amount of past payment information.
According to assumption (a5), we use a rather crude model for recoveries. We assume that the probability of a recovery and the mean recovery payment are described by feature independent values. The proportion of the recovery payments to the positive payments until time I is roughly only -0.06% for all LoBs. Thus, we choose to keep our model as simple as possible, and to focus on the positive payments. For data sets where recovery payments play a more important role, we could use assumptions (a3) and (a4) also for recoveries. Using Model Assumptions 3.1, we have In the next section we model the J +1 probability functions {p j (·)} j=0,...,J and the J +1 regression functions {µ j (·)} j=0,...,J with a single neural network.

Neural network architecture
The neural network presented here jointly models the J + 1 probability functions {p j (·)} j=0,...,J and the J + 1 regression functions {µ j (·)} j=0,...,J . The advantage of such a joint model is that we enable learning from one probability/regression function to the other via parameter sharing. That is, we can exploit commonalities and differences between the individual tasks, which can increase model performance. However, setting up a joint model for 2(J + 1) = 24 different functions is a balancing act: on the one hand, the neural network needs to be flexible in order to capture the structure in the data as good as possible for every modeled function; on the other hand, we have to be careful to avoid overfitting, i.e. to avoid learning all the peculiarities of the training data. Therefore, our neural network framework will have characteristics providing flexibility, and characteristics contributing to stability. In a first step, we define the embedding layers used in this neural network, see Section 4.1. In a second step, we create J +1 = 12 subnets within the neural network where subnet j models probability function p j (·) and regression function µ j (·) used in payment delay period j, see Section 4.2. The basic idea is to create separate feed-forward neural networks for every one of these J +1 subnets, allowing for flexibility in a joint model of all probability functions {p j (·)} j=0,...,J and regression functions {µ j (·)} j=0,...,J in one neural network. However, the probability/regression functions are still connected to each other by shared embedding weights. The shared embedding weights increase model stability and allow us to learn from one payment delay period to the other. For simplicity, apart from claim-specific feature and past payment information used, these J + 1 subnets have exactly the same design in terms of number of hidden layers and numbers of neurons in these hidden layers.

Embedding layers
For all j = 0, . . . , J probability function p j (·) and regression function µ j (·) are functions defined on X × {0, . . . , 6} j . As we model all probability functions {p j (·)} j=0,...,J and regression functions {µ j (·)} j=0,...,J in one neural network, the input of the neural network is of the form (x, y) ∈ X × {0, . . . , 6} J , where x = (x 1 , . . . , x d ) ∈ X reflects the claim-specific feature information and y = (y 0 , . . . , y J−1 ) ∈ {0, . . . , 6} J the past payment information. This input (x, y) ∈ X × {0, . . . , 6} J consists of the categorical features claims code cc, injured part inj part and the J = 11 past payment information, as well as on the continuous (but categorized) features accident year AY, accident quarter AQ, age age and reporting delay RepDel. In neural networks, embedding layers introduced in [3] provide an easy and elegant way to treat categorical input variables. They enable us to learn a low-dimensional representation of the considered categorical variable, as every level of this categorical variable is mapped to a vector in R p , for some p ∈ N. The entries of these embedding vectors are then simply parameters of the neural network, and consequently they have to be trained. Even though embedding layers have originally been introduced in order to deal with categorical input variables, they can also be used for continuous variables that have been summarized into categories. The four continuous (but categorized) features accident year AY (12 categories), accident quarter AQ (4 categories), age age (11 categories) and reporting delay RepDel (12 categories summarized to 3 categories, see below) do not consist of too many categories. Therefore, believing that the gain in flexibility outweighs the loss of stability, we use embedding layers also for these input variables. We choose two-dimensional embeddings for all inputs (x, y) ∈ X × {0, . . . , 6} J except for the accident year AY represented by x 2 , for which we use a three-dimensional embedding. We come back to the reason for this choice in Section 4.2.4, below. For all l = 1, 3, . . . , d the two-dimensional embedding of the levels u l,1 , . . . , u l,|X l | ∈ X l is defined by for all k = 1, . . . , |X l |. The set {α l,1 (u l,k ), α l,2 (u l,k )} k=1,...,|X l | is the set of parameters to be learned for feature l representing the claims code cc (l = 1), the accident quarter AQ (l = 3), the age of the injured age (l = 4), the part of the body injured inj part (l = 5) and the reporting delay RepDel (l = d = 6) . For accident year AY represented by x 2 we use a three-dimensional embedding where {α 2,1 (u 2,k ), α 2,2 (u 2,k ), α 2,3 (u 2,k )} k=1,...,|X 2 | is the set of parameters to be learned for feature accident year AY. In Figure 1 we have seen that in LoB 1 reporting delays bigger than 1 occur very rarely. We have a similar situation for all other LoBs. Therefore, we set α d,1 (r) = α d,1 (2) and α d,2 (r) = α d,2 (2) for all r = 3, . . . , 11, i.e. reporting delays bigger than 1 are summarized into one category, and are treated in the same way.
For all elements of past payment information y = (y 0 , . . . , y J−1 ) ∈ {0, . . . , 6} J we use twodimensional embedding 6 is the set of parameters to be learned for past payment information. We remark that one could define separate embedding layers for every element of past payment information y = (y 0 , . . . , y J−1 ) ∈ {0, . . . , 6} J . However, for all k = 0, . . . , J − 1, element y k corresponds to the payment in payment delay period k. As we observe less and less payments for later payment delays, with separate embedding layers for elements of past payment information the model can become unstable. Moreover, it may also make sense to use the same embedding weights for all elements of past payment information, as it allows the neural network to transfer model structure from one payment delay period to the other. We recall that y k ∈ {0, . . . , 5} indicates the order of magnitude of the payment in payment delay period k = 0, . . . , J − 1. If this information is not known at time I, then we have y k = 6. While the embedding weights β(0), . . . , β(5) have to be trained, we set β(6) = (β 1 (6), β 2 (6)) = (0, 0), and we declare these two embedding weights to be non-trainable, i.e. they do not change during model fitting. As there are no restrictions for the other embedding weights, this choice does not result in a loss of generality. It only fixes the origin of the embedding. With this choice we can use dropout layers during neural network fitting in order to mimic the situation of not knowing parts of the past payment information, see Section 5.3. Summarizing, the embedding layers map the input variables (x, y) ∈ X × {0, . . . , 6} J to

Subnet j of the neural network
The single neural network consists of J + 1 subnets, where, for all j = 0, . . . , J, subnet j models probability function p j (·) and regression function µ j (·). In Figure 3 we provide the model architecture of subnet j = 3. The five colors reflect the different parts of the model. The input variables (x 1 , . . . , x d , y 0 , y 1 , y 2 ) ∈ X × {0, . . . , 6} 2 are given in blue color, the two-dimensional embedding weights from the previous section in green color. The first set of embedding weights of the input variables (x 1 , . . . , x d , y 0 , y 1 , y 2 ) ∈ X × {0, . . . , 6} 3 is processed through a classical feed-forward neural network with two hidden layers. This is the orange part of the model in Figure 3. Probability function p 3 (·) and regression function µ 3 (·) both share this feed-forward neural network. In order to increase the task-specific flexibility, the feed-forward neural network is processed through additional hidden layers. On the one hand, the feed-forward neural network is followed by a third hidden layer which is used to model probability function p 3 (·). Additionally to this third hidden layer, probability function p 3 (·) is directly linked to the second set of embedding weights of the input variables (x 1 , . . . , x d , y 0 , y 1 , y 2 ) ∈ X × {0, . . . , 6} 3 . This is the red part of the model. On the other hand, the feed-forward neural network is followed by a fourth hidden layer which is used to model regression function µ 3 (·). Completely analogously to above, regression function µ 3 (·) is also directly linked to the second set of embedding weights of the input variables (x 1 , . . . , x d , y 0 , y 1 , y 2 ) ∈ X × {0, . . . , 6} 3 . This is the magenta part of the model. The box labeled "'second embedding weights" appears twice in Figure 3 only for graphical reasons.
The individual parts of this neural network model are described in detail below. As all J + 1 subnets have exactly the same neural network design, we present subnet j for a general j = 0, . . . , J. The only difference between the J +1 subnets are the input variables used. For subnet j we use input variables of the form (x, y 0 , . . . , y j−1 ) = (x 1 , . . . , x d , y 0 , . . . , y j−1 ) ∈ X × {0, . . . , 6} j .

Feed-forward neural network
This part of the model corresponds to the orange part in Figure 3.
serves as input of a standard feed-forward neural network with two hidden layers. The superscript (j) in z (j) 0 refers to subnet j, and the subscript 0 to the input layer of the feed-forward neural network of this subnet j. We remark that the input vector z (j) 0 ∈ R d+j only contains the corresponding first embedding values α 1,1 (x 1 ), . . . , α d,1 (x d ), β 1 (y 0 ), . . . , β 1 (y j−1 ) of the inputs x 1 , . . . , x d , y 0 , . . . , y j−1 . The corresponding second embedding values α 1,2 (x 1 ), . . . , α d,2 (x d ), β 2 (y 0 ), . . . , β 2 (y j−1 ) are not processed through the feed-forward neural network but are directly linked to the outputs p j (·) and µ j (·) of this subnet via a skip connection in order to speed up learning, see below.
As activation function of the feed-forward neural network we choose the hyperbolic tangent activation function φ = tanh. This activation function has the advantage that it allows to calculate gradients in an efficient manner and, as it is bounded, that the activations in the neurons do not explode. These two properties are useful in neural network fitting as they can considerably accelerate gradient descent methods. The first and second hidden layers of this feed-forward neural network of subnet j consist of q 1 and q 2 neurons, respectively, and are denoted by z (j) 1 and z (j) 2 , respectively. From the universal approximation property of neural networks, see [7] and [20], one hidden layer is sufficient to approximate any continuous and compactly supported function arbitrarily well. The drawback of choosing only one hidden layer is that we may require too many neurons, which makes the model more difficult to calibrate. Moreover, additional hidden layers stimulate the learning of interactions between input features. Therefore, we choose two hidden layers for the feed-forward neural network part. We come back to the numbers of neurons q 1 and q 2 below.

Probability function p j (·)
The model for probability function p j (·) corresponds to the red part in Figure 3. As the second hidden layer z (j) 2 is used by both probability function p j (·) and regression function µ j (·), we introduce additional task-dependent hidden layers. The second hidden layer z (j) 2 is processed to a third hidden layer z (j) 3 with q 3 neurons, which is used together with the second embedding values α 1,2 (x 1 ), . . . , α d,2 (x d ), β 2 (y 0 ), . . . , β 2 (y j−1 ) of the inputs x 1 , . . . , x d , y 0 , . . . , y j−1 to define probability function p j (·). As p j (·) models a probability, its range has to be in the unit interval [0,1]. Therefore, as output function we use the logistic function, which we denote here by ϕ(·). Probability function p j (·) is then defined by p j (x, y 0 , . . . , y j−1 ) = ϕ γ 3,j−1 ∈ R and η (j) 3 ∈ R q 3 are parameters of the neural network, and the operation ·, · denotes the usual scalar product in Euclidean space. We remark that as in classical neural network modeling, the output neuron containing the probability function p j (·) depends on the previous hidden layer z (j) 3 . However, in order to speed up learning, we directly link the second embedding values α 1,2 (x 1 ), . . . , α d,2 (x d ), β 2 (y 0 ), . . . , β 2 (y j−1 ) of the inputs x 1 , . . . , x d , y 0 , . . . , y j−1 to the output neuron. This step can be interpreted as a skip connection between the embedding layers and the output neuron. Skip connections have originally been introduced in [18], and help to speed up training of deep neural networks by linking non-consecutive layers. The direct dependence of probability function p j (·) on the second embedding values α 1,2 (x 1 ), . . . , α d,2 (x d ), β 2 (y 0 ), . . . , β 2 (y j−1 ) of the inputs x 1 , . . . , x d , y 0 , . . . , y j−1 can also be viewed as GLM (generalized linear model) step, which is additionally supported by the neural network through the hidden layer z

Regression function µ j (·)
In order to also have a task-dependent layer for regression function µ j (·), the second hidden layer z (j) 2 is processed to a fourth hidden layer z (j) 4 with q 4 neurons. Thus, the third hidden layer and the fourth hidden layer are not consecutive but parallel layers. Regression function µ j (·) is defined using the fourth hidden layer z (j) 4 together with the second embedding values . , x d , y 0 , . . . , y j−1 as well as the third embedding value α 2,3 (x 2 ) of the input x 2 representing accident year AY. As µ j (·) models the mean parameter of a log-normal distribution, its range is the whole real line. We define where γ

Feature accident year AY
For the feature accident year AY we face a very particular problem. The single neural network has to be trained using individual claims information available at time I. This implies that for payment delay period j = 0 we have payment observations Y v,0 for all accident years {1, . . . , I}. But this is not the case for later payment delay periods j = 1, . . . , J. For example for the last considered payment delay period J we only have observations Y v,J for individual claims v with accident year x v,2 = 1. Therefore, in subnet j = J the neural network weights corresponding to the neurons containing the embedding weights of the accident years are only trained for accident year equal to 1. Thus, we cannot control the effect of these weights on later accident years. However, we would like to estimate expected payments in this last payment delay period J also for claims with accident years greater than 1. This problem is circumvented by sharing neural network weights for the feature accident year. For all subnets j = 0, . . . , J we use the same q 1 weights linking the embedding value α 2,1 (x 2 ) to the first hidden layer, and the same two weights linking the embedding values α 2,2 (x 2 ) and α 2,3 (x 2 ) to the probability function and the regression function, respectively. This ensures that these weights can be trained for all accident years, yielding a better control for this feature component. As this reduces flexibility of the probability/regression models w.r.t. the feature accident year AY, we use as countermeasure three embedding weights for this feature component. The first embedding weight α 2,1 (x 2 ) is used as input of the feed-forward neural networks, the second embedding weight α 2,2 (x 2 ) as GLM part for probability functions {p j (·)} j=0,...,J , and the second embedding weight α 2,3 (x 2 ) as GLM part for regression functions {µ j (·)} j=0,...,J .

Numbers of neurons
Usually, one selects the numbers of neurons by training the neural network for a set of possible numbers of neurons and opting for the combination for which one observes the lowest out-ofsample loss. However, here we directly choose small numbers of neurons in order to decrease the risk of over-fitting and to control run times and stability. Following the guideline q 1 > q 2 > q 3 + q 4 , reflecting the idea that information processed through the hidden layers should be condensed until it reaches the output, we decide to use q 1 = 40, q 2 = 30, q 3 = q 4 = 10.
We remark that we choose small numbers of neurons and that the neural network is performing multi-task learning, which already acts as a regularization technique. Therefore, apart from early stopping, see Sections 5.4 and 5.5, we refrain from using additional regularization techniques.

Output of the neural network model
Summarizing, for an input (x, y) ∈ X × {0, . . . , 6} J , the output of the single neural network are the J + 1 probability functions {p j (·)} j=0,...,J and the J + 1 regression functions {µ j (·)} j=0,...,J . Training this neural network allows us to learn from one probability/regression function to the other. This is done in the next section. We remark that even though all J + 1 probability functions and all J + 1 regression functions are modeled in a single neural network, we cannot train all 2(J + 1) output functions for every individual claim v = 1, . . . , n. For an individual claim v with reporting year r v ∈ {1, . . . , I}, we can only train the probability functions p 0 (·), . . . , p I−rv (·), and the regression functions µ j (·) for those payment delay periods j = 0, . . . , I − r v for which we have observed a payment. This requires a special discussion of the choice of loss function, see Section 5.2.

Individual RBNS claims reserves
There remains to estimate the J + 1 probability functions {p j (·)} j=0,...,J , the J + 1 regression functions {µ j (·)} j=0,...,J , the J + 1 variance parameters {σ 2 j } j=0,...,J , the recovery probability p − and the recovery mean µ − . Training of the single neural network modeling the J + 1 probability functions {p j (·)} j=0,...,J and the J + 1 regression functions {µ j (·)} j=0,...,J is described in Sections 5.1-5.5. In Section 5.6 we estimate the variance parameters {σ 2 j } j=0,...,J . For the recovery probability p − and the recovery mean µ − we refer to Section 5.7. The resulting individual RBNS claims reserves are presented in Section 5.8. In this section we use notation Y v,j instead of Y v,j (x v ) for payments observed at time I, as here we are not interested in modeling the payments as random variables, but rather in using the observed payments in order to fit the model and estimate the expected outstanding payments. The single neural network is trained using the Keras library within R. Neural network modeling with Keras involves some randomness, for example for the initialization of the neural network weights or the choice of the mini-batches. In order to get reproducible results, we set the Keras seed, in a random but reproducible way.

Parameter initialization
Before training a neural network, it might make sense to choose particular starting values for some parameters of the neural network. Such an initialization allows to exploit information which is already available, and it can speed up training. For all subnets j = 0, . . . , J, in (4.1) and (4.2) we initially set where a j is the empirical probability of having a positive payment in payment delay period j among all claims v for which Y v,j is known at time I, and b j is the empirical logarithmic mean payment size in payment delay period j among all claims v for which Y v,j is known at time I and Y v,j > 0. Then, for all subnets j = 0, . . . , J we set γ (j) 3,0 = log a j 1 − a j and γ (j) With initializations (5.1) and (5.2), the probability function p j (·) and the regression function µ j (·) are initially given by p j (x, y 0 , . . . , y j−1 ) = a j and µ j (x, y 0 , . . . , y j−1 ) = b j , for all j = 0, . . . , J. This initialization implies that we start from the homogeneous model not considering any covariate information.

Loss function
In order to train the neural network using information D I , we have to define a loss function. As We define the binary cross-entropy loss function We remark that those claims v for which r v + j > I, i.e. for which Y v,j / ∈ D I , do not contribute to the loss function. Thus, we only use information known at time I. For regression function µ j (·) of subnet j the available response variables are given by We define the mean square loss function The mean square loss function L j,2 (µ j (·), D I ) only uses information D I known at time I. Moreover, claims v for which Y v,j ≤ 0 do not contribute to the loss function, as we model the positive payment size conditionally given there is a positive payment. Training the single neural network requires simultaneous minimization of the loss functions {L j,1 (p j (·), D I ), L j,2 (µ j (·), D I )} {j=0,...,J} . In multi-task learning this is achieved by minimizing a weighted sum of the considered individual loss functions. Therefore, we define the global loss function of the single neural network by L({p j (·), µ j (·)} j=0,...,J , D I ) = J j=0 w j,1 L j,1 (p j (·), D I ) + w j,2 L j,2 (µ j (·), D I ), (5.3) where, for all j = 0, . . . , J, and The weights defined in (5.4) and (5.5) ensure that all the 2(J + 1) individual loss functions {L j,1 (p j (·), D I ), L j,2 (µ j (·), D I )} j=0,...,J live on the same scale, i.e. that they contribute equally to the overall loss function. This normalization step is important to make sure that all tasks of the neural network can be learned. We train the single neural network by minimizing the global loss function given in (5.3). This can be achieved by iteratively applying (a version of) the stochastic gradient descent method.
Here, we choose mini-batches of size 10'000 and the optimizer nadam within the Keras library; we refer to Sections 8.1.3 and 8.5.2 of [17] for theoretical background.
Usually, this consistency problem is circumvented by using the individual claims multiple times, assuming a different amount of past payment information for every replication. This enables the neural network to also learn how to process the embedding weights β 1 (6) = β 2 (6) = 0. However, this technique leads to a massive increase of data, slowing down the neural network fitting, without any gain of additional information. Therefore, we opt for a different strategy here, exploiting the properties of dropout layers and using every observation only once.

(5.6)
For l = 1, 2 exactly the same entries of the embedding vector are set to 0. The choices of the embedding vector in (5.6) are independent for different subnets j ∈ {2, . . . , J} and individual claims v = 1, . . . , n. We do not need to apply dropout for subnets j = 0 and j = 1 because for subnet j = 0 no past payment information is used, and for subnet j = 1 the only past payment Y v,0 is known for all claims. By using structure (5.6) we mimic the situation we also encounter when estimating expected future payments. For an individual claim v and payment delay j with I − r v < j ≤ J − x v,6 , we rely on past payment information (g(Y v,0 ), . . . , g(Y v,I−rv ), 6, . . . , 6) ∈ {0, . . . , 6} j with embedding weights (β l (g(Y v,0 )), . . . , β l (g(Y v,I−rv )), 0, . . . , 0) ∈ R j , l = 1, 2. Moreover, considering subnet j = 2, . . . , J, then among the individual claims v for which we need to predict the expected payment in payment delay period j, the past payment Y v,k , k < j, is known for roughly j−k j of the claims (assuming a roughly uniform distribution among reporting years). Summing up the probabilities given in (5.6) for the cases for which β l (g(Y v,k )) is not set to 0, we exactly get We conclude that during neural network training, in subnet j = 2, . . . , J the embedding value β l (g(Y v,k )) is used in roughly j−k j of the cases, exactly as during estimation procedure.

First training step: embedding weights
We divide neural network training into two steps. In a first step, we train all neural network weights, but we only store the embedding weights introduced in Section 4.1. The embedding weights are a crucial ingredient in a neural network, as they allow us to distinguish between categorical levels of the features. We interpret training of embedding weights as choosing reasonable representations of the feature variables. Having chosen the embedding weights beforehand then allows us to start the actual neural network training. Thus, in a second step, we keep the learned embedding weights fixed, and re-train the remaining neural network weights. We still need to determine the numbers of epochs for both the first and the second training step. An epoch refers to considering all available data exactly once in the gradient descent iteration. As we use the stochastic gradient descent method nadam in Keras, one epoch corresponds to several gradient descent steps. In order to determine the numbers of epochs for the first training step, we randomly allocate 80% of the n individual claims to a training set and the remaining 20% to a validation set. We then train the neural network only using the training set. After every epoch, we calculate the value of the loss function (5.3) on the training set and the validation set.
The results for the six LoBs are given in Figure 4. The blue points indicate the values of the loss function on the training set (in-sample) and the red points the losses on the validation set (out-of-sample). Generally speaking, the neural network is learning (true) model structure of the data as long as the validation loss decreases. Once this validation loss flattens out or starts to increase, we have reached the phase of over-fitting, where we are not learning (true) model structure of the data anymore but rather the noisy part in the training set, which is undesirable. The chosen numbers of epochs for the six LoBs are given on line (i) of Table 2, see also the green lines in Figure 4. Note that for simplicity we only consider numbers of epochs in {k · 10 | k ∈ N}.  Table 2: Chosen numbers of epochs in the first training step and corresponding time required for training.
After this training/validation analysis, we train the neural network using all n individual claims for the chosen number of epochs. The time required for this training on a personal MacBook Pro with CPU @ 2.20GHz (6 CPUs) with 16GB RAM is given on line (ii) of Table 2 for all six LoBs. The resulting values of the embedding weights are stored, and kept constant during the second training step.

Second training step: neural network weights
In the second training step we keep the embedding weights fixed, i.e. we only train the remaining neural network weights, which are initially set to the same values as before the first training step. Apart from this, we proceed as above. We use the same splitting of the n individual claims into a training set and a validation set. The resulting training and validation losses are given in Figure 5, for all six LoBs. If we compare the validation losses of Figure 4 and Figure  5, we observe smaller values for the second training step. This indicates that determining the embedding weights beforehand leads to better out-of-sample performance. The chosen numbers of epochs for the six LoBs in the second training step are given on line (i) of Table 3, see also the green lines in Figure 5. Finally, we train the neural network using all n individual claims for the chosen number of epochs. We remark that we do not use a single number of epochs in order to estimate the probability/

Variance parameters
In order to estimate the variance parameters {σ 2 j } j=0,...,J , we split the n individual claims into two halves I 1 and I 2 . Then, we repeat the second training step using the chosen number of epochs from Table 3 and only considering the individual clams in I 1 . Writing { µ I 1 j (·)} j=0,...,J for the estimated regression functions {µ j (·)} j=0,...,J using only individual claims in I 1 , we define for all j = 0, . . . , J. The time required for estimating the J +1 variance parameters on a personal MacBook Pro with CPU @ 2.20GHz (6 CPUs) with 16GB RAM is given in Table 4, for all LoBs.

Recoveries
In Model Assumptions 3.1 we assume in (a5) that the probability of a recovery and the mean recovery payment are the same, respectively, for all claims and payment delays.
In order to estimate recovery probability p − , we define M r,j = v=1,...,n rv=r for all reporting years r = 1, . . . , I and payment delays j = 0, . . . , I − r. In particular, M r,j is the number of individual claims with reporting year r having a recovery payment in payment delay period j. Moreover, we define the vector N = (N 1 , . . . , N I ) of the numbers of reported claims per reporting year r = 1, . . . , I. We estimate recovery probability p − by Note that in (5.7) we use a weighting with payment delay j. The intuition behind this weighting is that when predicting future recoveries, we need to consider payment delay period j for exactly j reporting years, i.e. recovery predictions are more often needed for higher payment delays. We try to take that into account when estimating recovery probability p − by using the weighting with the payment delay.
In order to estimate mean recovery payment µ − , we define for all reporting years r = 1, . . . , I and payment delays j = 0, . . . , I − r. In particular, S r,j is the total recovery payment for individual claims with reporting year r in payment delay period j. We estimate mean recovery payment µ − by Finally, we estimate the expected future recovery payments by Table 5 for the results. For sake of readability we present all payment-related (absolute) results in this manuscript in units of 1'000. If we compare these values to the true future recovery payments, see line (ii) of Table 5, we observe that, on average, our crude approach works sufficiently well here. Moreover, if we compare the numbers in Table 5 to the estimated RBNS reserves and the true outstanding payments in Table 6, below, we see that the recovery payments are almost negligible for our data set.  Table 5: Estimated expected future recovery payments and true future recovery payments.

Results
The individual RBNS claims reserves R IC RBNS (IC = individual claims) are defined by ,0 ), . . . , g(Y v,I−rv ), 6, . . . , 6) + σ 2 j /2 + p − µ − , see also (3.2). The results are given on line (i) of Table 6, for all six LoBs. We see that we get quite close to the true outstanding payments for reported claims given on line (ii). On lines (iii) and (iv) we provide the absolute and relative biases of the individual RBNS claims reserves R IC RBNS . We observe that we always stay within 2% of the true outstanding payments. For additional analyses of the individual RBNS claims reserves we refer to the next section.

Analysis of the individual RBNS claims reserves
The results of the individual RBNS claims reserves R IC RBNS presented in Table 6 seem convincing in the sense that they are very close to the true outstanding payments for reported claims R true RBNS . In this section we take a closer look at the individual RBNS claims reserves. First, we check the stability of the single neural network. In a second step, we analyse whether the neural network is indeed able to capture the claim-specific feature and past payment information.

Stability of the neural network
We challenge the stability of the single neural network in two ways. On the one hand, we calculate the individual RBNS claims reserves using different numbers of epochs in the second training step. On the other hand, we calculate the individual RBNS claims reserves using different seeds in the second training step. The individual RBNS claims reserves R IC RBNS presented on line (i) of Table 6 are calculated based on E training epochs in the second training step, where E has been chosen according to the training/validation analysis in Figure 5, and is given on line (i) of Table 3. The validation losses (in red) in Figure 5 are rather erratic, and slightly smaller or bigger numbers of epochs can also be justified. In order to analyse the stability of the neural network, we  Figure 6, for all six LoBs. The dashed green lines indicate zero bias, the dashed red lines a bias of ±5%. The numbers of epochs chosen beforehand in Section 5.5 are indicated by the dashed vertical lines. We observe that we stay rather close to the true outstanding payments R true RBNS . Therefore, even though the choice of the number of epochs remains crucial, there is some leeway. We remark that one can increase stability of the individual RBNS claims reserves by averaging over a higher number of epochs. As a second stability check we repeat the second training step for 100 different seeds, always using the number of epochs E given on line (i) of Table 3 and averaging over {E −2, E −1, E, E +1, E +2}. The boxplots of the relative biases of the resulting 100 individual RBNS claims reserves with respect to the true outstanding payments are given in Figure 7, for all six LoBs. The dashed green lines indicate zero bias, the dashed red lines a bias of ±5%. We observe that even though we did not choose seed-dependent numbers of epochs, we predominantly stay within 5% of the true outstanding payments. Moreover, considering the medians of the biases, we can conclude that the individual RBNS claims reserving model does not suffer from a systematic bias.

Individual feature and past payment information
The neural network individual RBNS claims reserving model works very well in terms of small biases with respect to the true outstanding payments R true RBNS . We may wonder if this is really due to the ability of the neural network to correctly process the provided claim-specific feature and past payment information. To answer this question, we analyse the performance of the individual RBNS claims reserving model on smaller subsets of the n individual claims. For every feature l = 1, . . . , d and every level u l,k ∈ X l = {u l,1 , . . . , u l,|X l | } of this feature l, we can estimate the average reserves for individual claims with level u l,k of feature l. These estimated average reserves can be compared to the true average outstanding payments. For LoB 1 the results for all d = 6 features claims code cc, accident year AY, accident quarter AQ, age age, injured part inj part and reporting delay RepDel are presented in logarithmic scale in the first two rows of Figure 8. The (categorical) levels of the d = 6 features are represented by the blue numbers of their corresponding classes. The green diagonals reflect the identity between the estimated average reserves from the individual RBNS claims reserving model and the true average outstanding payments. We see that in general the observations are close to the diagonals. This implies that the neural network is correctly processing the claim-specific feature information. Moreover, by considering the ranges of the y-axes, we see that the feature inj part is the most important among the d = 6 features (excluding the accident year AY) in terms of (expected) future payments. Similarly, for all payment delay periods j = 0, . . . , J − 1 and past payment information v ∈ {0, . . . , 6} we can estimate the average reserves for individual claims v with past payment information g(Y v,j ) = v, and compare them to the true average outstanding payments. For LoB 1 the results for payment delay periods j = 0, 2, 9 are presented in logarithmic scale in the third row of Figure 8. We observe that generally the neural network is also able to correctly process the claim-specific past payment information. In particular, on average the neural network knows how to deal with past payment information class "6" corresponding to "no information". We remark that for payment delay period j = 0 the classes "1" ("recovery payment") and "6" are not available as there are no recovery payments in this payment delay period and as the payments in this payment delay period are known for all individual claims. The ranges of the y-axes indicate that past payment information is the main driver of expected future payments. This closes the analysis of the individual RBNS claims reserves.

Conclusion
We presented a claims reserving technique on the basis of claim-specific feature and past payment information in order to estimate RBNS claims reserves. For every individual reported claim, the expected future cash flow is estimated using a single neural network. This neural network consists of several subnets, where every subnet models the expected payment for a particular payment delay period. Training this single neural network allows us to transfer model structure from one payment delay period to the other. As individual claims are not equipped with the same amount of past payment information, we combined embedding layers and dropout layers in order to consistently fit the neural network to the incomplete time series of past payments for individual reported claims. We applied this individual RBNS claims reserving model to a synthetic insurance data set with complete information consisting of six lines of business generated from the individual claims history simulation machine [16]. We observed that the individual RBNS claims reserves stayed within 2% of the true outstanding payments for all lines of business. Challenging the stability of the individual RBNS claims reserving model by choosing different numbers of epochs and different Keras seeds during neural network fitting, we observed that we predominantly stayed within 5% of the true outstanding payments. An analysis of individual feature and past payment information showed that the neural network on average correctly processes the claim-specific feature and past payment information. In particular, the neural network trained with the dropout technique applied on past payment information knows how to deal with unavailable past payment information. In this paper we did not touch upon uncertainty estimates. Uncertainty of claims reserves is usually measured in terms of (conditional) root-mean square error of prediction, which consists of process variance and parameter estimation error. Within our model the process variance could be estimated using Model Assumptions 3.1. The parameter estimation error is more tricky. One approach would be to bootstrap the observed payments for the individual claims. However, our log-normal payment size model is only appropriate for modeling expected payments. It does not allow reasonable simulations of payments, as it would not generate sufficiently extreme observations. Thus, one would need to choose a more sophisticated payment size model. Finally, a complete picture of the claims reserves would include IBNR reserves. We believe that IBNR reserves are ideally estimated on the basis of policy data. As the individual claims history simulation machine [16] does not provide policy data, we focused only on RBNS reserves.