An interpretable neural network approach to cause-of-death mortality forecasting

Sohei Tanaka; Naoki Matsuyama

doi:10.1017/S1748499524000319

An interpretable neural network approach to cause-of-death mortality forecasting

Published online by Cambridge University Press: 20 January 2025

Sohei Tanaka and

Naoki Matsuyama

Show author details

Sohei Tanaka: Affiliation:
Graduate School of Advanced Mathematical Sciences, Meiji University, Nakano, Nakano-ku, Tokyo 164-8525, Japan
Naoki Matsuyama*: Affiliation:
Graduate School of Advanced Mathematical Sciences, Meiji University, Nakano, Nakano-ku, Tokyo 164-8525, Japan
*: Corresponding author: Naoki Matsuyama; Email: [email protected]

Article contents

Abstract
Introduction
Preliminaries
Proposed model
Numerical analysis
Conclusion
Data availability statement
Funding statement
Competing interests
A
B
References

Rights & Permissions

Abstract

Cause-of-death mortality forecasting, a key topic in public health and actuarial science, is a challenging task due to the difficulty of modeling that accounts for dependencies among causes of death. While several cause-of-death mortality models have been proposed to address this difficulty, little attention has been paid to improving their predictive performance. Recently, purely data-driven approaches using tensor decomposition methods have been introduced to cause-of-death mortality modeling, demonstrating strong out-of-sample predictive performance compared to existing models. However, these methods have difficulties in the interpretability of multi-rank tensor components to achieve strong predictive performance. In response, we propose a novel tensor-based cause-of-death mortality model by replacing the tensor decomposition with a convolutional autoencoder with a one-dimensional latent layer that provides a Lee-Carter-like time-series factor; the model also provides the age sensitivity of cause-specific log mortality to the time-series factor. Due to the representational capability of the neural network, our model achieves better predictive performance compared to the existing tensor decomposition-based models, despite the simplified latent layer for model interpretability.

Keywords

Adaptive penalized tensor decomposition cause-of-death mortality convolutional autoencoder

Type: Original Research Paper
Information: Annals of Actuarial Science , First View , pp. 1 - 20

DOI: https://doi.org/10.1017/S1748499524000319 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries

1. Introduction

Cause-of-death mortality forecasting is a key topic in public health and actuarial practice; however, it is a challenging task due to the difficulty of modeling that accounts for dependencies among the causes of death with different evolving patterns. While early studies on cause-of-death mortality modeling assumed independence among causes of death, such as fitting the Lee-Carter (LC) model (Lee & Carter, Reference Lee and Carter1992) to each cause of death, cause-of-death mortality models with interdependencies have been proposed in the current decade using various techniques, including the vector error correction models or cointegration techniques (Arnold & Sherris, Reference Arnold and Sherris2013, Reference Arnold and Sherris2015; Arnold & Glushko, Reference Arnold and Glushko2021), multinomial logistic regression (MLR) method (Alai et al., Reference Alai, Arnold and Sherris2015), forecast reconciliation techniques to ensure coherence between aggregate and cause-specific mortality projections (Li & Lu, Reference Li and Lu2019), hierarchical Archimedean copula approach (Li et al., Reference Li, Li, Lu and Panagiotelis2019), and stochastic age-period-cohort models (Redondo-Loures and Cairns, Reference Redondo-Loures and Cairns2021). Recently, novel data-driven approaches to cause-of-death mortality modeling have been proposed by Huynh & Ludkowski (Reference Huynh and Ludkowski2024) and Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023), and our study is in the same context. Huynh & Ludkowski (Reference Huynh and Ludkowski2024) introduce Gaussian process regression, a Bayesian kernel-based data-driven regression framework that can provide confidence intervals but does not focus on point estimate accuracy; indeed, they do not compare predictive performance with existing models. As a natural data-driven approach from the perspective of a tensor (i.e., multidimensional array) data format, Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023) introduce tensor decomposition techniques to cause-of-death mortality modeling, with a motivation based on the acknowledgment that “existing cause-of-death mortality approaches have primarily focused on interpretation and inference, rather than necessarily aiming to improve out-of-sample performance,” with which we agree. Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023) propose the adaptive tensor decomposition (ADAPT) method, an improvement of the canonical polyadic decomposition (CPD) method, and jointly model the three dimensions (age, year, and cause of death) in a multi-rank tensor form using the method. The ADAPT method reduces the computational burden of the penalized CPD method proposed by Madrid-Padilla & Scot (Reference Madrid-Padilla and Scot2017). The out-of-sample prediction of the ADAPT model involves a two-stage estimation applying time-series models to the year factors. On the United States male cause-of-death mortality data, Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023) demonstrate the strong out-of-sample predictive performance of their ADAPT model compared to MLR, unpenalized CPD, and some existing models, including the cause-specific LC model and the product-ratio model (Hyndman et al., Reference Hyndman, Booth and Yasmeen2013). The product-ratio model accounts for dependencies among causes of death, but the underlying coherence assumption does not allow for different evolving patterns for each cause of death in the mid- and long-term predictions. While the ADAPT model demonstrates strong predictive performance, the multi-rank tensor decomposition defined by cross-validation loses model interpretability. Only the first factor (age, year, and cause of death) can be given some interpretation; however, it is less desirable for a cause-of-death mortality model because the age factor remains the same regardless of the cause of death.

To address the interpretability issue of the multi-rank tensor decomposition, which is required to improve the predictive performance, we propose a novel tensor-based cause-of-death mortality model by replacing the tensor decomposition with a neural network (NN) with a one-dimensional latent layer, aiming to achieve cause-specific LC-like interpretability (i.e., representing age sensitivities of cause-specific log mortality with respect to a single descending time-series factor) without losing the predictive performance. Our model uses a convolutional autoencoder (CAE) architecture, which is widely used for image anomaly detection; it includes convolutional neural networks (CNN) (LeCun et al., Reference Lecun, Boser, Denker, Henderson, Howard, Hubbard and Jackel1990), suitable for tensor data processing, and an autoencoder architecture for the dimensionality reduction as an alternative to tensor decomposition. Similar to the ADAPT model, the CAE’s out-of-sample prediction involves a two-stage estimation that applies time-series models to the latent variables in the bottleneck layer of the autoencoder,

In comparison with the ADAPT model, we demonstrate the predictive performance and model interpretability of the CAE model on the World Health Organization (WHO) cause-of-death data. From the data, we select five populations (Japanese males and females, British males and females, and German males) and summarize each population’s data into eight causes. These populations are specifically selected to observe the most recent data (up to 2019) and ensure no zero data in the eight causes of death. The eight selected causes of death include increasing mortality to observe the representability of models for the causes with increasing mortality rate.

The remainder of this paper is organized as follows. Section 2 reviews tensor decomposition techniques for cause-of-death mortality modeling and provides an overview of CAE basics; Section 3 explains the proposed CAE model; Section 4 presents the numerical results of the CAE model in comparison with the ADAPT model. Finally, Section 5 concludes the paper.

2. Preliminaries

2.1 Tensor decomposition for cause-of-death mortality modeling

Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023) introduce a regularized tensor decomposition technique to cause-of-death mortality modeling to suppress overlearning and improve prediction performance. Regularized tensor decomposition is based on CPD penalized by generalized lasso penalties. An ADAPT method is proposed to overcome the computational challenge of the penalized CPD. Madrid-Padilla & Scot (Reference Madrid-Padilla and Scot2017) provide the theoretical backgrounds on the tensors and the CPD.

Let the cause-of-death log mortality data be a tensor denoted by $\log \boldsymbol{M}\in \mathbb{R}^{{n_{c}}\times {n_{a}}\times T}$, where $n_{c},n_{a},T$ are the number of segments of cause-of-death, age, and observation year, respectively. A rank-R CPD aims to decompose the cause-of-death log mortality data into the sum of R-rank-1 tensors (i.e., R outer products of vectors) by solving the problem:

(1)

\begin{equation} \underset{\boldsymbol{u}_{r}\in \mathbb{R}^{{n_{c}}},\boldsymbol{v}_{r}\in \mathbb{R}^{{n_{a}}},\boldsymbol{w}_{r}\in \mathbb{R}^{T}}{\mathit{minimize}} \left| \mathit{\log }\; \boldsymbol{M}-\sum _{r=1}^{R}d_{r}\cdot \boldsymbol{u}_{r}\circ \boldsymbol{v}_{r}\circ \boldsymbol{w}_{r}\right| _{T}, \end{equation}

where $0\lt d_{r}\in \mathbb{R}, r=1,\ldots, R;$ the vectors $\boldsymbol{u}_{r},\boldsymbol{v}_{r},\boldsymbol{w}_{r}, r=1,\ldots, R$ are normalized to unit length for identifiability and correspond to the estimated factor of cause, age, and observation year, respectively; the operation $\circ$ denotes the outer product and $| | _{T}$ denotes the tensor norm (a dimensional generalization of the Frobenius norm), throughout the paper. The parameters $\boldsymbol{v}_{r},\boldsymbol{w}_{r}$ ($r=1,\ldots, R$) have some similarity to the age and time components of the generalized LC model of Renshaw & Haberman (Reference Renshaw and Haberman2003) when the model is fitted separately to each cause of death; however, the CPD is applied to all three dimensions of the tensors simultaneously.

Madrid-Padilla & Scot (Reference Madrid-Padilla and Scot2017) propose a penalized CPD to promote smoothness and sparsity in tensor decomposition. A rank-R penalized CPD with generalized lasso penalties on the vectors $\boldsymbol{u}_{r},\boldsymbol{v}_{r},\boldsymbol{w}_{r}$ is given by solving the problem:

(2)

\begin{align} &\underset{\boldsymbol{u}_{r}\in \mathbb{R}^{{n_{c}}},\boldsymbol{v}_{r}\in \mathbb{R}^{{n_{a}}},\boldsymbol{w}_{r}\in \mathbb{R}^{T}}{\mathit{minimize}} \left| \mathit{\log }\; \boldsymbol{M}-\sum _{r=1}^{R}d_{r}\cdot \boldsymbol{u}_{r}\circ \boldsymbol{v}_{r}\circ \boldsymbol{w}_{r}\right| _{T}\nonumber\\ &\quad+\sum _{r=1}^{R}\left(\lambda _{u,r}\left| \boldsymbol{D}_{r}^{u}\boldsymbol{u}_{r}\right| _{1}+\lambda _{v,r}\left| \boldsymbol{D}_{r}^{v}\boldsymbol{v}_{r}\right| _{1} +\lambda _{w,r}\left| \boldsymbol{D}_{r}^{w}\boldsymbol{w}_{r}\right| _{1}\right), \end{align}

where $\lambda _{u,r}$,$\lambda _{u,v}$,$\lambda _{w,r}$,$r=1,\ldots, R$ are the tuning parameters and $\boldsymbol{D}_{r}^{u},\boldsymbol{D}_{r}^{v},\boldsymbol{D}_{r}^{w}, r=1,\ldots, R$ are the corresponding penalty matrices. Here, $| | _{1}$denotes the L1 norm and the unit norm constraints for $\boldsymbol{u}_{r},\boldsymbol{v}_{r},\boldsymbol{w}_{r}, r=1,\ldots, R$ are relaxed to convex constraints (i.e., the norms are less than or equal to one) for optimization without affecting the performance of the decomposition. The larger the $R$, the better the model fit, but the lower the generalization performance for prediction, so it must be determined by grid search (see Section 4.2). However, selecting the tuning parameter values for $R\geq 3$ is computationally infeasible because the number of combinations of possible sets of tuning parameter values is too large; this problem motivates the introduction of the ADAPT technique proposed by Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023).

Replacing the penalty matrices $\boldsymbol{D}_{r}^{u},\boldsymbol{D}_{r}^{v},\boldsymbol{D}_{r}^{w}$ and tuning parameters $\lambda _{u,r}$,$\lambda _{u,v}$,$\lambda _{w,r}$ in Equation (2) with prespecified adaptively weighted penalty matrices $\tilde{\boldsymbol{D}}_{r}^{u},\tilde{\boldsymbol{D}}_{r}^{v},\tilde{\boldsymbol{D}}_{r}^{w}$ and three tuning parameters $\lambda _{u},\lambda _{v},\lambda _{w},$ the rank-R ADAPT is formulated as

(3)

\begin{align} & \underset{\boldsymbol{u}_{r}\in \mathbb{R}^{{n_{c}}},\boldsymbol{v}_{r}\in \mathbb{R}^{{n_{a}}},\boldsymbol{w}_{r}\in \mathbb{R}^{T}}{\mathit{minimize}} \left| \mathit{\log }\; \boldsymbol{M}-\sum _{r=1}^{R}d_{r}\cdot \boldsymbol{u}_{r}\circ \boldsymbol{v}_{r}\circ \boldsymbol{w}_{r}\right| _{T}\nonumber\\ &\quad +\sum _{r=1}^{R}\left(\lambda _{u}\left| \tilde{\boldsymbol{D}}_{r}^{u}\boldsymbol{u}_{r}\right| _{1}+\lambda _{v}\left| \tilde{\boldsymbol{D}}_{r}^{v}\boldsymbol{v}_{r}\right| _{1}+\lambda _{w}\left| \tilde{\boldsymbol{D}}_{r}^{w}\boldsymbol{w}_{r}\right| _{1}\right). \end{align}

In Equation (3), the adaptively weighted penalty matrix $\tilde{\boldsymbol{D}}_{r}^{u}$ and the tuning parameter $\lambda _{u}$ are set to zero, since no penalty is needed to promote smoothness and sparsity on the cause-of-death dimension; the other adaptively weighted penalty matrices $\tilde{\boldsymbol{D}}_{r}^{v},\tilde{\boldsymbol{D}}_{r}^{w}$ are determined using the vector estimates $\tilde{\boldsymbol{v}}_{r},\tilde{\boldsymbol{w}}_{r}$ from the unpenalized CPD, as follows.

(4)

\begin{equation}\tilde{\tau }_{r}^{v}=\boldsymbol{D}^{v}\tilde{\boldsymbol{v}}_{r},\tilde{\tau }_{r}^{w}=\boldsymbol{D}^{w}\tilde{\boldsymbol{w}}_{r};\; \tilde{\boldsymbol{D}}_{r}^{v}=\frac{\boldsymbol{D}^{v}}{\left| \tilde{\boldsymbol{\tau }}_{r}^{v}\right| }, \tilde{\boldsymbol{D}}_{r}^{w}=\frac{\boldsymbol{D}^{w}}{\left| \tilde{\boldsymbol{\tau }}_{r}^{w}\right|},\,\, r=1,\ldots, R,\end{equation}

where the penalized matrices $\boldsymbol{D}^{v}, \boldsymbol{D}^{w}$are

(5)

\begin{equation}\boldsymbol{D}^{v}=\left[\begin{array}{c} -1\\ 0\\ \vdots \\ 0 \end{array}\begin{array}{c} 1\\ -1\\ \vdots \\ 0 \end{array}\begin{array}{c} 0\\ 1\\ \vdots \\ 0 \end{array}\begin{array}{c} \cdots \\ \cdots \\ \ddots \\ \cdots \end{array}\begin{array}{c} 0\\ 0\\ \vdots \\ -1 \end{array}\begin{array}{c} 0\\ 0\\ \vdots \\ 1 \end{array}\right]\in \mathbb{R}^{(n_{a}-\textbf{1})\times {n_{a}}}, \boldsymbol{D}^{w}=\left[\begin{array}{c} -1\\ 0\\ \vdots \\ 0 \end{array}\begin{array}{c} 1\\ -1\\ \vdots \\ 0 \end{array}\begin{array}{c} 0\\ 1\\ \vdots \\ 0 \end{array}\begin{array}{c} \cdots \\ \cdots \\ \ddots \\ \cdots \end{array}\begin{array}{c} 0\\ 0\\ \vdots \\ -1 \end{array}\begin{array}{c} 0\\ 0\\ \vdots \\ 1 \end{array}\right]\in \mathbb{R}^{(T-\textbf{1})\times T}.\end{equation}

Finally, the tuning parameters $\lambda _{v},\lambda _{w}$ are determined by grid search; Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023) apply five-fold rolling origin cross-validation (Liu & Yang, Reference Liu and Yang2022) to find them$.$

2.2 Convolutional autoencoder

To replace the tensor decomposition with a NN, we focus on CAE, which combines an autoencoder for dimensionality reduction and a CNN suitable for tensor data processing.

A convolution operation involves an input tensor $z\in \mathbb{R}^{{q^{(1)}}\times \ldots \times {q^{(K)}}}$ of order $K$, where the first $K-1$ components have a spatial structure and the $K$-th component stands for the channels. The number of channels is a hyperparameter of CNN and is not tied to specific elements of the data, unlike the other components.

For $K=3$, the convolution is a mapping:

(6)

\begin{equation} z^{\left(1\right)}=\left(z_{l_{1},l_{2},l_{3}}^{\left(1\right)}\right)\in \mathbb{R}^{q_{1}^{\left(1\right)}\times q_{1}^{\left(2\right)}\times q_{1}^{\left(3\right)}}\rightarrow z^{\left(2\right)}=\left(z_{i_{1},i_{2},j}^{\left(2\right)}\right)\in \mathbb{R}^{q_{2}^{\left(1\right)}\times q_{2}^{\left(2\right)}\times q_{2}^{\left(3\right)}}. \end{equation}

For example, in the later section, the first and second components of the tensors are tied to the age and cause elements of the cause-of-death mortality data, respectively. Using intercepts $b_{j}\in \mathbb{R}$ and filter weights $W_{j}=(w_{{l_{1}},{l_{2}},{l_{3}};j})\in \mathbb{R}^{{f^{(1)}}\times {f^{(2)}}\times {f^{(3)}}}$ for $j=1,\ldots, q_{2}^{(3)},$ where $f^{(1)}=q_{1}^{(1)}-q_{2}^{(1)}+1, f^{(2)}=q_{1}^{(2)}-q_{2}^{(2)}+1, \textrm{and} f^{(3)}=q_{1}^{(3)},$ the exact formula for the mapping is

(7)

\begin{equation}z_{i_{1},i_{2},j}^{(2)}=\varnothing (b_{j}+{\sum}_{l_{1}=1}^{f^{(1)}}{\sum}_{l_{2}=1}^{f^{(2)}}{\sum}_{l_{3}=1}^{f^{(3)}}w_{{l_{1}},{l_{2}},{l_{3}};j}z_{i_{1}+l_{1}-1,i_{2}+l_{2}-1,l_{3}}^{(1)}),\,\, j=1,\ldots, q_{2}^{(3)},\end{equation}

where $\varnothing$ represents an activation function and the stride (i.e., the distance between two consecutive positions of the filter weight matrices over the input data) is set to one. For simplicity, only the unit stride is used throughout this paper. Convolution operations are typically implemented with a zero-padding operation before and a pooling operation after. Zero padding is used to reshape input tensors without adding any additional parameters, as the convolution operation reduces the size of the output tensors. Pooling is also a mapping that can be expressed by Formula (5), but it uses a $p^{(1)}\times p^{(2)}$ window ($p^{(1)},p^{(2)}\in \mathbb{N}$) that provides partitions of the spatial part of the tensor and applies a pooling operator to the partitioned data; the most common pooling operators are the following.

(8)

\begin{align}&\textrm{Average}-\textrm{pooling:}\,z_{i_1, i_2, j}^{(2)} = \frac{1}{p^{(1)} p^{(2)}} {\sum}_{l_1=1}^{p^{(1)}} {\sum}_{l_2=1}^{p^{(2)}} z_{p^{(1)}(i_1-1) + l_1, p^{(2)}(i_2-1) + l_2, j}^{(1)},\nonumber \\ &\text{Max-pooling: } z_{i_{1},i_{2},j}^{(2)}=\mathop{\max }_{1\leq l_{1}\leq p^{(1)},1\leq l_{2}\leq p^{(2)}} \left\{z_{p^{(1)}(i_{1}-1)+l_{1},p^{(2)}(i_{2}-1)+l_{2},j}^{(1)}\right\}.\end{align}

While pooling layers are used with convolution layers for dimensionality reduction, autoencoder neural networks are used more generally for dimensionality reduction. A typical autoencoder network has a bottleneck architecture where the input and output layers have the same dimension (number of neurons), and a dimensionally reduced latent layer (bottleneck layer) is placed between them. The network from the input layer to the latent layer is called the encoder, and the network from the latent layer to the output layer is called the decoder. In an autoencoder network, the network parameters are determined to minimize the reconstruction error of the output relative to the input.

CAE is a neural network that combines the architectural elements of both CNN and autoencoder and is suitable for dimensionality reduction of tensor data. The encoder repeats convolution and pooling, then connects to affine layers (i.e., fully connected layers) to compress the dimensions. The decoder reconstructs the input data using a network where convolution is replaced by deconvolution (transposed convolution), and pooling is replaced by upsampling interpolation (unpooling). Deconvolution and upsampling interpolation increase the dimensionality of the data, whereas convolution and pooling reduce it. For each channel, deconvolution with $f^{(1)}\times f^{(2)}$ weight over $q_{1}^{(1)}\times q_{1}^{(2)}$ input data is equivalent to convolution with $f^{(1)}\times f^{(2)}$ weight over ($f^{(1)}$+$q_{2}^{(1)}-1$)$\times (f^{(2)}+q_{2}^{(2)}-1)$ data given by adding zeros over the $q_{1}^{(1)}\times q_{1}^{(2)}$ input data, where the output dimension $q_{2}^{(1)}\times q_{2}^{(2)}$ is greater than or equal to $q_{1}^{(1)}\times q_{1}^{(2)}$ for each axis. Upsampling interpolation is defined so that the mapping of its output with the corresponding pooling function is equal to its input. The nearest neighbor or linear interpolation is commonly used; for example, the nearest neighbor interpolation extends each element of the input into the matrix with the same element that has the same dimension as the corresponding pooling window. For more details on deconvolution, refer to Dumoulin & Visin (Reference Dumoulin and Visin2016).

3. Proposed model

To suppress overlearning and improve prediction accuracy, we use a CAE architecture that incorporates an autoencoder architecture for dimensionality reduction and CNNs for learning dependencies among causes of death, where CNNs help reduce the number of parameters through a sparse network structure and convolutional filters shared by all data points.

Figure 1 Proposed CAE architecture.

The cause-of-death log mortality rates for each observation year $t (t=1,\ldots, T)$ form an $n_{a}\times n_{c}$ dimensional matrix representing $n_{a}$ age segments and $n_{c}$ cause-of-death segments. The CAE model encodes cause-of-death mortality data for each observation year$t$ into a one-dimensional latent variable $\mu _{t}$ and decodes this latent variable into the reproduction data, as shown in Fig. 1. The simplest CAE configuration is adopted as long as the data can be reconstructed, taking into account the computational load. The encoder consists of two convolution blocks and an affine layer, and the decoder consists of an affine layer and two deconvolution blocks. Each convolution block involves zero padding, which adds rows of zeros above and below the input data, convolution with a multichannel filter of size $3\times 1,$ and average pooling with a window of size $2\times 1$. By setting the dimension for the cause-of-death element to 1, the convolution filter and the pooling widow are designed to learn only the age neighborhood effect, not the cause neighborhood effect, since the order of the causes is meaningless; however, the filter is shared among the causes to learn their dependency. Each deconvolution block performs transposed convolution and upsampling, which are the inverse operations of convolution and pooling, respectively. CAE network parameters are shared across all observation years, and are optimized to minimize the error between the cause-of-death log mortality data, $\log \boldsymbol{M}_{t}\in \mathbb{R}^{{n_{a}}\times {n_{c}}}$ and its reconstruction, $\widehat{\log \boldsymbol{M}_{t,}}$ for all observation years $t=1,\ldots, T$. Applying L2 regularization with intensity $\lambda$ to the filter weights $\boldsymbol{W}_{m,j}^{(conv)},\boldsymbol{W}_{m,j}^{(\textit{deconv})}$and affine layer weights $\boldsymbol{W}^{(enc)},\boldsymbol{W}^{(dec)}$, the loss function is given by

(9)

\begin{align} LOSS&=\sum _{t=1}^{T}\left| \mathit{\log } \boldsymbol{M}_{t}-\widehat{\mathit{\log } \boldsymbol{M}_{t}}\right| _{F} +\lambda \Bigg\{\sum _{m=1}^{2}\Bigg(\sum _{j=1}^{4m}\left| \boldsymbol{W}_{m,j}^{\left(conv\right)}\right| _{2}+\sum _{j=1}^{1+3\left(2-m\right)}\left| \boldsymbol{W}_{m,j}^{\left(\textit{deconv}\right)}\right| _{2}\Bigg)+\left| \boldsymbol{W}^{\left(enc\right)}\right| _{2}\nonumber\\ &\quad +\left| \boldsymbol{W}^{\left(dec\right)}\right| _{2}\Bigg\}, \end{align}

where $| | _{F}$ denotes the Frobenius norm and $| | _{2}$ denotes the L2 norm. The intensity $\lambda$ is a hyperparameter to be determined by validation.

Table 1 provides the details of each process in the proposed CAE model, including the output dimensions.

Table 1. Processes and their output dimensions for each observation year t (t = 1$,\ldots, $T)

($\lfloor \rfloor$ indicates the floor function).

Table 2. Cause-of-death categories

Table 3. Validated parameters of ADAPT model

Table 4. MSE (left) and MAE (right) comparisons between CAE (boldface) and ADAPT for Japan (male, female), UK (male, female), and Germany (male), from top to bottom

The predictions of the model are obtained by substituting the predicted latent variables $\mu _{t}$ ($t\gt T$) into the CAE decoder function, where the predicted latent variables are obtained by fitting time-series models to the latent variables $\mu _{1},\ldots, \mu _{T}$ derived from the CAE. As a result of the performance comparison, Autoregressive Integrated Moving Average (0, 1, 0) is adopted as the time-series model.

The model can provide age sensitivities by cause, $\beta _{x,c}^{(t)}$ (for age $x,$ cause c, year $t$) using numerical differentiation of the decoder function $f_{dec}^{(CAE)}$ with respect to the latent variable $\mu _{t}$ as follows.

(10)

\begin{equation}\beta _{x,c}^{\left(t\right)}=\frac{\partial }{\partial \mu _{t}}\left(f_{dec}^{\left(CAE\right)}\left(\mu _{t}\right)\right)_{x,c}.\end{equation}

4. Numerical analysis

4.1 Data used

In this study, we use the cause-of-death mortality data sourced from the WHO mortality database, where the causes of death are classified according to the International Classification of Diseases (ICD). Despite the revision to ICD11 in 2019, this study exclusively relies on data classified under ICD10, the most recent ICD observable in the database. Notably, we do not undertake data translation between different ICDs due to the impracticability of full translation. The age groups are segmented into 19 age groups from age 0 to ages 85 and over, as in the segmentation of Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023). While Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023) classify ICD10 causes into five major causes of death and others, all categories exhibit a decreasing mortality trend. In this paper, to include categories of causes of death with increasing mortality rates, we initially compress ICD10 categories into the 22 categories of the ICD10-compliant basic classification table of the Japanese Ministry of Health, Labor and Welfare. Subsequently, causes of death categories with zero deaths in any age category are classified as “other” and finally organized into eight categories, as shown in Table 2. As a result of this processing, only five datasets are observable through 2019 without categories of zero deaths: Japan (male and female, 1995–2019), Germany (male, 1997–2019), and the United Kingdom (male and female, 2001–2019). These five datasets are used for the subsequent analyses.

Figure 2 Box plots of CAE’s MSE for Japan (male, female), UK (male, female), and Germany (male), from left to right. Dotted lines indicate ADAPT’s MSE.

Figure 3 Observed and predicted time-series factors of ADAPT (right) and CAE (left) for Japan, UK, and Germany, from top to bottom. Dotted lines represent females.

Figure 4 ADAPT’s cause-of-death factor $u_{1}$ (left) and age factor $v_{1}$(right) for Japan, UK, and Germany, from top to bottom. Dotted lines represent females.

Figure 5 ADAPT’s $u_{2},v_{2},w_{2}$(top) and $u_{3},v_{3},w_{3}$(bottom) for Japan male data.

Figure 6 Age sensitivities of CAE (left) and observed mortalities (right) of C3 over the training data (up to 2015) from Japan (male, female), Germany (male), UK (male, female), from top to bottom.

Figure 7 Age sensitivities of CAE (left) and observed mortalities (right) of C4 over the training data (up to 2015) from Japan (male, female), Germany (male), UK (male, female), from top to bottom.

Figure 8 Reconstruction (light lines) and 50-year projection (dark lines) of eight causes of death (C1to C8 from top to bottom) mortalities using CAE (left) and ADAPT (right) for Japanese male data.

Figure 9 Reproduction (light lines) and 50-year projection (dark lines) of all-cause mortality using CAE (left) and ADAPT (right) for Japanese male data.

4.2 Model calibration

For each dataset, we use the last five years of data as the test data and the remaining years as the learning data. The ADAPT’s tuning parameters in Equation (3), ($\lambda _{v}$,$\lambda _{w}$) and $R$, are selected through a grid search using the range of 20 parameters evenly spaced on the log scale from $10^{-6}$ to $2\times 10^{-2}$ for ($\lambda _{v}$,$\lambda _{w}$), and the range from 3 to 14 for $R$. While these ranges were the same as in Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023), our selected parameters did not touch the boundaries of the ranges, as shown in Table 3.

The hyperparameters of the CAE model are determined by validation using the last five years of training data; $\lambda$ in Equation (9) is set to 0.0005 from the grids {0.05, 0.01, 0.005, 0.001,…,0.000001}, and the number of channels for each layer is set as in Table 1. The CAE’s network parameters are optimized using ADAM with an initial learning rate of 0.001 and 15000 learning epochs. PyTorch’s linear interpolation function is used for the upsampling. Network training takes about 40 s per seed using Google Colaboratory, and predictive performance is evaluated as the average of the results for 10 seeds.

4.3 Results

We compare the predictive performance of the CAE model with that of the ADAPT model, which exhibits the strongest out-of-sample predictive performance among the existing cause-of-death mortality models as demonstrated by Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023). Table 4 shows a comparison of the CAE and ADAPT models in terms of their mean squared errors (MSEs) and mean absolute errors (MAEs) on the test data, with the average values for 10 random seeds for the CAE. The CAE model shows better predictive accuracy than ADAPT for all test data. The seed robustness of the CAE model is illustrated by the box plots for 10 random seeds in Fig. 2, where the dotted lines indicate the MSEs of the ADAPT model, which uses no random seeds.

Next, we compare the interpretation of the CAE and ADAPT models. The ADAPT model has parameter sets $\boldsymbol{u}_{r},\boldsymbol{v}_{r},\boldsymbol{w}_{r}, r=1,\ldots, {R}$ when the rank is R, but we specifically focus on the first factor components $\boldsymbol{u}_{1},\boldsymbol{v}_{1},\boldsymbol{w}_{1},$ corresponding to the largest $d_{r}$ in Equation (3), following Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023). Figure 3 compares the time-series factor of both models, $\boldsymbol{w}_{1}$ for the ADAPT model and the latent variables $\mu _{1},\ldots, \mu _{T}$ for the CAE model. Both exhibit LC-like decreasing shapes. While increasing the regularization factor $\lambda _{w}$ in Equation (3) simplifies the shape of $\boldsymbol{w}_{1}$, the $\lambda _{w}$ selected by the grid search in Section 4.2 on the data in this paper does not allow $\boldsymbol{w}_{1}$ to have simplified shapes as in Zhang et al. (Reference Zhang, Huang, Hui and Haberman2023). Simplified shapes of $\boldsymbol{w}_{1}$ can be obtained by increasing $\lambda _{w}$ outside of the grid search range, resulting in larger MSEs than those in Table 4. Examples for UK male and female data are provided in the Appendix A.

Unlike the first time-series factor, the other components of the first factor are not directly comparable to those of the CAE model. In Fig. 4, the first cause-of-death factor $\boldsymbol{u}_{1}$ and the first age factor $\boldsymbol{v}_{1}$ of the ADAPT model are depicted, where C1 to C8 denote the causes of death corresponding to the numbers in Table 2. Notably, differences by country and gender are small, and the age factors exhibit a mortality-like shape. However, the first factor of the ADAPT model has limitations because the age factors remain the same regardless of the cause of death and year of observation, which may indicate the average age factor across all causes, but does not fit the purpose of cause-of-death mortality models.

Figure 5 shows the second and third factor vectors of the ADAPT model for the Japanese male data, highlighting challenges in model interpretation. While only the first factor vectors of the ADAPT model are interpretable, the three vectors take only negative values, as shown in Figs. 3 and 4, and cannot represent causes of death with increasing mortality.

The CAE model has the advantage of the model interpretability by representing $\beta _{x,c,t}$, which is the decoder sensitivity to the time-series factor by cause of death and age. In contrast to the ADAPT model, the CAE model can represent age factors by cause of death and illustrate increases in mortality for specific causes. Figures 6 and 7, for C3 and C4, respectively, show the age sensitivities $\beta _{x,c}^{(t)}$ in Equation (10) and the corresponding observed cause-of-death log mortality data over the training period (up to 2015), with the color of the lines darkening as time approaches 2015.

For C3 (diseases of the nervous system), negative values of the age sensitivities of the CAE for the elderly in each population indicate increasing mortality and are consistent with the observed elderly mortality curves for C3 darkening from bottom to top, as shown in Fig. 6. This is also consistent with the fact that C3 includes Alzheimer’s disease, and Alzheimer’s disease and other dementias, the second leading cause of death in high-income countries (including all populations in this paper) as of 2019, has experienced the largest increase in deaths since 2000 among the top ten causes of death according to the WHO’s Global Health Estimates (GHE) 2000–2019. For C4 (diseases of the circulatory system), positive values of the age sensitivities of the CAE for all ages in each population indicate decreasing mortality, and are consistent with the observed mortality curves darkening from top to bottom, as shown in Fig. 7. This is also consistent with the fact that ischemic heart disease, the leading cause of death in high-income countries as of 2019, has experienced the largest decrease in deaths since 2000 among the top ten causes of death according to the GHE 2000–2019.

Unlike the rank-R ADAPT model, which has R-rank tensor forms and requires R-dimensional time-series models for the projection, the CAE model requires only a single time-series model for the projection due to the nonlinear representability of the network. This gives the CAE model advantages in long-term projections, as shown in Figs. 8 and 9: the 50-year projections by CAE and ADAPT for eight causes of mortality (Fig. 8) and all-cause mortality (Fig. 9) from Japanese male data with dark-colored projection curves and light-colored reconstruction curves. Due to the linearity of the model, the ADAPT model shows changes in projected mortality about twice as large as the changes in reconstructed mortality, proportional to the projection period being twice as long as the training period, but the CAE model shows smaller changes in projected mortality due to the nonlinearity of the model and is more consistent with the observed slowdown in mortality improvement (see, e.g., Djeundje et al., Reference Djeundje, Haberman, Bajekal and Lu2022). Unlike the CAE model, the projected cause-of-death mortality curves with the ADAPT model lose the smoothness observed in the reconstructed curves that reflects the similarity of the cause-of-death mortality at close ages in the observations. In particular, for causes C1, C2, and C3, the ADAPT model shows unexplainable differences by age in the projected cause-of-death mortality.

5. Conclusion

While the tensor decomposition-based cause-of-death mortality models, including ADAPT, are a natural data-driven approach from a data format perspective, they have an interpretability problem due to the multi-rank tensor decomposition to achieve strong predictive performance. To address this issue in the context of tensor-based cause-of-death mortality modeling, we propose to replace the tensor decomposition with a CAE with a one-dimensional latent layer to achieve LC-like interpretability of the model; despite the simplified latent layer for interpretability, the loss of model representability is limited due to the nonlinearity of the CAE network, resulting in better out-of-sample predictive performance than the ADAPT model on WHO cause-of-death data from five populations. Expanding the dimensionality of the latent layer of the CAE could improve the predictive performance, but at the expense of the interpretability we focus on. Rather, a deeper configuration of the CAE may improve the predictive performance if computational resources are available. The CAE architecture also contributes to more natural results than ADAPT in long-term projections, in the sense that the smoothness and changes in the projected cause-of-death mortality curves by CAE are more consistent with the smoothness in the reconstructed curves and the observed slowdown in mortality improvement than those projected by ADAPT.

This model produces interesting age sensitivity curves that vary by country, gender, and cause of death. However, the analysis of medical implications remains a topic for future research. Similar to the ADAPT model, our model has a limitation of two-stage estimation, and improving the model for single-stage estimation is another topic for future work.

Data availability statement

The data and codes supporting the results of this study are available in the WHO mortality database at https://www.who.int/data/data-collection-tools/who-mortality-database and in Appendix B.

Funding statement

This study was supported by the Institute of Science and Technology, Meiji University.

Competing interests

None.

Appendix A

Figure A.1 shows ADAPT’s first factor components $\boldsymbol{u}_{1},\boldsymbol{v}_{1},\boldsymbol{w}_{1}$ corresponding to $\lambda _{w}$ = 0.145374 from the UK male and female data, where their MSEs (0.09179 for male, 0.10003 for female) on the test data are larger than those in Table 4 corresponding to the grid-searched $\lambda _{w}$(0.01623 for male, 0.01069 for female).

Figure A.1 ADAPT’s $\boldsymbol{u}_{1},\boldsymbol{v}_{1},\boldsymbol{w}_{1}$ (from left to right) corresponding to $\lambda _{w}$ = 0.145374 from UK male (top) and female (bottom) data.

Appendix B

References

Alai, D. H., Arnold, S., & Sherris, M. (2015). Cause-of-death mortality and the impact of cause-elimination. Annals of Actuarial Science, 9(1), 167–186.CrossRef Google Scholar

Arnold, S., & Glushko, V. (2021). Cause-specific mortality rates: Common trends and differences. Insurance: Mathematics and Economics, 99, 294–308.Google Scholar

Arnold, S., & Sherris, M. (2013). Forecasting mortality trends allowing for cause-of-death mortality dependence. North American Actuarial Journal, 17(4), 273–282.CrossRef Google Scholar

Arnold, S., & Sherris, M. (2015). Cause-of-death mortality: What do we know on their dependence? North American Actuarial Journal, 19(2), 116–218.CrossRef Google Scholar

Djeundje, V. B., Haberman, S., Bajekal, M., & Lu, J. (2022). The slowdown in mortality improvement rates 2011-2017: A multi-country analysis. European Actuarial Journal, 12(2), 839–878.CrossRef Google Scholar

Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv: 1603 07285v2.Google Scholar

Huynh, N., & Ludkowski, M. (2024). Joint models for cause-of death mortality in multiple populations. Annals of Actuarial Science, 18, 51–77.CrossRef Google Scholar

Hyndman, R. J., Booth, H., & Yasmeen, F. (2013). Coherent mortality forecasting: the product ratio method with functional time series models. Demography, 50(1), 261–283.CrossRef Google Scholar PubMed

Lecun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. In Handwritten digit recognition with a back-propagation network (pp. 396–404).Google Scholar

Lee, R. D., & Carter, L. (1992). Modeling and forecasting US mortality. Journal of the American statistical association, 87(419), 659–671.Google Scholar

Li, H., Li, H., Lu, Y., & Panagiotelis, A. (2019). A forecast reconciliation approach to cause-of-death mortality modeling. Insurance: Mathematics and Economics, 86, 122–133.Google Scholar

Li, H., & Lu, Y. (2019). Modeling cause-of-death using hierarchical Archimedean copula. Scandinavian Actuarial Journal, 2019(3), 1–26.CrossRef Google Scholar

Liu, Z., & Yang, X. (2022). Cross validation for uncertain autoregressive model. Communications in Statistics. Simulation and Computation, 51(8), 4715–4726.CrossRef Google Scholar

Madrid-Padilla, O. H., & Scot, J. (2017). Tensor decomposition with generalized lasso penalties. Journal of Computational and Graphical Statistics, 26(3), 537–546.CrossRef Google Scholar

Redondo-Loures, C., & Cairns, A. J. (2021). Cause of death specific cohort effect. Insurance: Mathematics and Economics, 99, 190–199.Google Scholar

Renshaw, A., & Haberman, S. (2003). Lee-Carter mortality forecasting with age-specific enhancement. Insurance: Mathematics and Economics, 33(2), 255–272.Google Scholar

WORLD HEALTH ORGANIZATION. (2019). The top 10 cause of death. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.Google Scholar

Zhang, X., Huang, F., Hui, F. K. C., & Haberman, S. (2023). Cause-of-death mortality forecasting using adaptive penalized tensor decomposition. Insurance: Mathematics and Economics, 111, 193–213.Google Scholar

Figure 1 Proposed CAE architecture.

Table 1. Processes and their output dimensions for each observation year t (t = 1$,\ldots, $T)

Table 2. Cause-of-death categories

Table 3. Validated parameters of ADAPT model

Table 4. MSE (left) and MAE (right) comparisons between CAE (boldface) and ADAPT for Japan (male, female), UK (male, female), and Germany (male), from top to bottom

Figure 2 Box plots of CAE’s MSE for Japan (male, female), UK (male, female), and Germany (male), from left to right. Dotted lines indicate ADAPT’s MSE.

Figure 3 Observed and predicted time-series factors of ADAPT (right) and CAE (left) for Japan, UK, and Germany, from top to bottom. Dotted lines represent females.

Figure 4 ADAPT’s cause-of-death factor $u_{1}$ (left) and age factor $v_{1}$(right) for Japan, UK, and Germany, from top to bottom. Dotted lines represent females.

Figure 5 ADAPT’s $u_{2},v_{2},w_{2}$(top) and $u_{3},v_{3},w_{3}$(bottom) for Japan male data.

Figure 6 Age sensitivities of CAE (left) and observed mortalities (right) of C3 over the training data (up to 2015) from Japan (male, female), Germany (male), UK (male, female), from top to bottom.

Figure 7 Age sensitivities of CAE (left) and observed mortalities (right) of C4 over the training data (up to 2015) from Japan (male, female), Germany (male), UK (male, female), from top to bottom.

Figure 8 Reconstruction (light lines) and 50-year projection (dark lines) of eight causes of death (C1to C8 from top to bottom) mortalities using CAE (left) and ADAPT (right) for Japanese male data.

Figure 9 Reproduction (light lines) and 50-year projection (dark lines) of all-cause mortality using CAE (left) and ADAPT (right) for Japanese male data.

Article contents

An interpretable neural network approach to cause-of-death mortality forecasting

Abstract

Keywords

1. Introduction

2. Preliminaries

2.1 Tensor decomposition for cause-of-death mortality modeling

2.2 Convolutional autoencoder

3. Proposed model

4. Numerical analysis

4.1 Data used

4.2 Model calibration

4.3 Results

5. Conclusion

Data availability statement

Funding statement

Competing interests

Appendix A

Appendix B

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests