Table of Contents

Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis

The main focus of this paper is two-fold:

  1. generating short term (real-time) forecasts of the future COVID-19 cases for multiple countries;

  2. risk assessment (in terms of case fatality rate) of the novel COVID-19 for some profoundly affected countries by finding various important demographic characteristics of the countries along with some disease characteristics.

Forecast of COVID-19 cases

For the first problem, we propose a hybridization of stationary ARIMA and non-stationary WBF model to reduce the individual biases of the component models, that can generate short-term (ten days ahead) forecasts of the number of daily confirmed cases for Canada, France, India, South Korea, and the UK. The predictions of the future outbreak for different countries will be useful for the effective allocation of health care resources and will act as an early-warning system for government policymakers.

The $ARIMA(p,d,q)$ model can be mathematically expressed as follows: $$y_t = \theta_0 + \phi_1y_{t-1} + \phi_2y_{t-2} +...+\phi_py_{t-p}+\epsilon_t - \theta_1\epsilon_{t-1}-\theta_2\epsilon_{t-2}-...-\theta_q\epsilon_{t-q}$$

where:

The ARIMA model fails to produce random errors or even stationary residual series. Thus, we choose the wavelet function (WBF) to model the remaining series. Firstly, an ARIMA model is built to model the linear components of the epidemic time series, and a set of out-of-sample forecasts are generated. In the second phase, the ARIMA residuals (oscillatory residual series) are remodeled using a mathematically-grounded WBF model. Here, WBF models the left-over autocorrelations in the residuals which ARIMA could not model.

The algorithmic presentation of the proposed hybrid model is given in the table below:

The proposed model can be looked upon as an error remodeling approach in which we use ARIMA as the base model and remodel its error series by wavelet-based time series forecasting technique to generate more accurate forecasts.

As the WBF model is fitted on the residual time series, predictions are generated for the next ten time steps (5 April 2020 to 14 April 2020). Further, both the ARIMA forecasts and WBF residual forecasts are added together to get the final out-of-sample forecasts for the next ten days (5 April 2020 to 14 April 2020).

Predictions made upon data of Covid-19 cases showing the number of daily new cases according to the ARIMA-WBF model.

In the original paper the graphs of the forecast are separated by country. We have simply put them all together to highlight the differences in number of cases in the different countries.

The hybrid model predictions for Canada, France, India, South Korea and the UK, made by the authors are displayed below. The plot shows in persian green the actual data, while the in paradise pink the predictions running from 20th January until 4th April of 2020.

The predictions made by the authors seems to smoothly follow the pattern of the actual values, showing just a short delay.

In order to plot this graph, we downloaded the data from Our World in Data.

Risk assessment of COVID-19 cases

At the outset of the COVID-19 outbreak, data on country-wise case fatality rates due to COVID-19 were obtained for 50 affected countries. The case fatality rate (CFR) can be crudely defined as the number of deaths in persons who tested positive for COVID-19 divided by the confirmed number of COVID-19 cases.

In this section, we are going to find out a list of essential causal variables that have strong influences on the CFR. The datasets and codes of this section are made publicly available at https://github.com/indrajitg-r/COVID for the reproducibility of this work.

A key differentiation among the CFR of different countries can be found by determining an exhaustive list of causal variables that significantly affect CFR. In this work, we put an effort to identify critical parameters that may help to assess the risk (in terms of CFR) using an optimal regression tree model.

The regression tree has a built-in variable selection mechanism from high dimensional variable space and can model arbitrary decision boundaries. It combines case estimates, epidemiological characteristics of the disease, and health-care facilities to assess the risks of major outbreaks for profoundly affected countries.

Such assessments will help to anticipate the expected morbidity and mortality due to COVID-19 and provide some critical information for the planning of health care systems in various countries facing this epidemic.

Data

The CFR modeling dataset consists of 50 observations having ten possible causal variables and one numerical output variable.

The possible causal variables considered in this study are the followings:

The dataset contains a total of 8 numerical input variables and two categorical input variables.

This table shows the dataset we have used for the analysis. The data was available from the authors' repository. We have displayed it with style to highlight the number of cases and a in-table barplot to easily compare the number of deaths for country.

The dependent variable of this study is the CFR, the second-last column of this table, marked in bold.

Summary Statistics

Method: Regression Tree

For the risk assessment with the CFR dataset for 50 countries, we apply the Regression Tree (RT), a non-parametric supervised learning method used for regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

The corresponding machine learning algorithm is Classification and Regression Trees (CART).

The basic idea behind the algorithm is to find the point in the independent variable to split the data-set into 2 parts, so that the mean squared error is the minimized at that point. In other words it takes a feature and determines which cut-off point minimizes the variance of $y$ for a regression task, as the variance tells us how much the $y$ values in a node are spread around their mean value $\bar{y}$. As a consequence, the best cut-off point makes the two resulting subsets as different as possible with respect to the target outcome

The algorithm continues this search-and-split recursively and different subsets of the dataset are created until a stop criterion is reached. Possible criteria are: A minimum number of instances that have to be in a node before the split, or the minimum number of instances that have to be in a terminal node.

The intermediate subsets are called internal nodes (or split nodes) and the final subsets are called terminal (or leaf nodes). To predict the outcome in each leaf node, the average outcome of the training data in this node is used.

Paper Model

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

Basic regression trees partition a data set into smaller subgroups and then fit a simple constant for each observation in the subgroup. The partitioning is achieved by successive binary partitions (recursive partitioning) based on the different predictors. The constant to predict is based on the average response values for all observations that fall in that subgroup.

Considerations

After running the model proposed by the authors, the regression tree (RT) did not correspond to the one published in the paper. Looking at the variable importance graph we noticed that the variables considered were 9 instead of 10 as initially mentioned by the authors.

Therefore, we have also excluded the variable 'Climate Zones' (x.x10). After this truncation, it still does not correspond entirely to the one in the paper. The authors used the software R to compute the analysis whereas we used Python. Different version of different softwares have different computational costs and sometimes the algorithms are performed in different ways. Hence we think that the differences derive from this reason.

The regression tree displayed above shows the relationship between the important causal variables and CFR.

The RT starts with the total number of COVID-19 cases as the most crucial causal variable in the first parent node. In each box, the top most numerical values suggest the CFR estimates based on the tree. One of the key findings of the tree is the following rule:

Different Representation of The Regression Tree

Similarly, one can see all the rules generated by RT to get additional information about the relationships between control parameters and the response CFR variable with this visualization; it shows the distribution of decision feature in each node and the estimated CFR value on the leaf, in the case of regression tasks.

We have decided to implement this type of visualization for a better understanding of the technique, indeed we can see the way the CART algorithm splits the data.

Variable Importance Plot

The following plot shows how much the single variables affect the model. We can see that the most important variable is the number of cases of each country, followed by the total population, the number of beds in hospitals and the number of days since shutdown. The variable importance of our model varies a little from the one in the article. Two of the first four variables are different: in the paper the variables "doctors per 1000 people" and "percentage of People > 65" have a strong influence on the model. This difference is mainly due to the type of software used as explained before.

RT Europe

After having performed an RT including 50 countries of the world we thought to group the countries in continents. The idea was to perform a regression tree for every continent but unfortunately the observations were few. The continent with most observation is Europe (23) and we did an RT only for this continent because performing an RT with few data is meaningless.

However, it might be a good idea for future works to analyze the CFR by continent or region, in order to help the governments to take decisions about restrictions and medical improvement.

In this case the variable which most influences the model is Population/$Km^2$: this means that social distancing and wearing masks could be effective to avoid the spread of the virus.

The most important findings:

Different Representation of The Regression Tree

We have proposed again the alternative visualization also for this model; within each node, the decision features are plotted alongside their distribution and the estimated response by the leaf.

Variable Importance Plot

This plot again shows how much the single variables affect our model. We can see that the most important variable is still the number of cases of each country, followed by the population density, the time of arrival and the number of days since shutdown.

Metrics

When assessing how well a model fits a dataset, we use the Root Mean Squared Error (RMSE). The RMSE is a metric computed as the square root of the average squared difference between the predicted values and the actual values in a dataset:

$$RMSE=\sqrt{\sum{(\hat{y_i}-y_i)^2}\over{n}}$$

where:

The Mean Absolute Error (MAE) is a measure of errors between paired observations expressing the same phenomenon

$$MAE = {\sum{|\hat{y_i}-y_i|}\over{n}}$$

$MAE$ is conceptually simpler and also easier to interpret than $RMSE$: it is simply the average absolute vertical or horizontal distance between each point in a scatter plot as it is the average absolute difference between $\hat{y_i}$ and $y_i$.

Furthermore, each error contributes to $MAE$ in proportion to the absolute value of the error. This is in contrast to $RMSE$, which involves squaring the differences, so that a few large differences will increase the $RMSE$ to a greater degree than the $MAE$.

The coefficient of determination, denoted $R^{2}$ or $r^{2}$ and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable.

The most general definition of the coefficient of determination is

$$R^{2} = 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}$$

where, the sum of squares of residuals, also called the residual sum of squares is:

$${SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,}$$

and the total sum of squares (proportional to the variance of the data):

$${SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}}$$

$R^{2}$ value is included between 0 (the independent variables appears to be correlated, hence the model is not good) and 1 (the independent variables are not correlated, hence the model is good).

with:

The $AdjR^{2}$ (Adj stands for Adjusted) is a modified version of $R^{2}$ that has been adjusted for the number of predictors in the model. It is used to determine how reliable the correlation is and how much it is determined by the addition of independent variables and it is defined as:

$${\bar {R}}^{2}=1-(1-R^{2}){n-1 \over n-p-1}$$

This value is always lower than $R^{2}$ and can be negative, altough is usually positive.

These metrics are used to evaluate the predictive performance of the tree models used in this study. A good predictive model should have the errors RMSE and MAE low (near zero) and the $R^{2}$ and $AdjR^{2}$ high (near one).

We get different results for the same algorithm on the same machine implemented by different languages, such as R and Python.

We think that these small differences in the implementation of the underlying math libraries used will cause differences in the resulting model and predictions made by that model.

Infact the RMSE reported in the paper corresponds to $0.013$. Whereas the RMSE computed by ourselves using R was $0.012$ and using Python was $0.010$ as shown.

We also computed the $R^{2}$ and $AdjR^{2}$, with very few differences if compared to original model. The MAE was not computed by the authors of the paper but it is not an issue because the RMSE is present and it is enough to evaluate the error.The metrics referring to the EU model are good in terms of prediction as well.

Results and models comparison

The authors of the original paper suggested that the total number of cases, age distributions, and shutdown period have high impacts on the CFR estimates, but interestingly they found four more essential causal variables:

Therefore, seven are the causal variables obtained out of ten potential input variables having higher importance. It is possible to decrease the fatality rate solving the problems regarding each variable which affects the model.

In our model the variables affecting most the model are:

whereas, for the Europe model are population density and cases in thousand the two variables that have the most influence in the model. The population and the number of days since shutdown play also a role in affecting the model.

To prevent the virus from spreading and to avoid the death of people, governments should take into consideration to use restrictions based on these statistical results.

For instance, if a model has hospital beds per 1000 people as the variable which most influences it, more hospital beds should be provided; if instead the most affecting variable is population density, maybe the use of masks and the frequent hand washing is preferred because one gets in touch with a lot of people diary.

Limitations of our findings

The assumptions that the authors for sake of simplicity are listed as follows:

  1. the virus mutation rates are comparable for different countries;
  2. the recovered persons will achieve permanent immunity against COVID-19;
  3. we ignore the effect of climate change (also spatial data structures) during the short-term predictions.

Nowadays, one year and a half after this study we know that the recovered persons will not achieve permanent immunity against COVID-19, altough at the time it seemed a good assumptions even if it was a strong one.

However, as mentioned in the paper, there may exists other controllable factors and some disease-based characteristics that can also have an impact on the value of CFR for different countries.

Since the number of data points in both the datasets is limited thus going for advanced deep learning techniques will simply over-fit the datasets.

Conclusions

Throughout the paper, we have re-prosed the analysis cited by the authors of the paper "Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review", within the section 2.3 "ML and AI technology in SARS-CoV-2 prediction and forecasting", as reference 41: Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis.

In the first part we have checked the forecast computed by the authors, by plotting the results against the observed epidemic data, highlighting the differences among the countries.

Secondly, we have computed the regression tree implemented in the original paper to asses the risk associated with covid-19 data and some demographical features of the selected countries. The variables that resulted affecting the most the model for the authors are:

Our results, are similar to those of the original paper, however there are some slightly differences in some nodes. Our most important variables are:

We have supposed that these differences are due to computational errors arose when switching from R to Python, still these do not affect the overall interpretation of the model.

Among the classical representation of the RT, we proposed an alternative visualization of the regression tree, displaying the distribution of the decision feature in each node and the estimated CFR value on the leaf; this allowed us to present the way the CART algorithm splits the data.

Then, we suggested our idea which consisted into aggregating the countries by continents, in order to perform a regression tree for each continent, although the limited observations did not allowed us to compute a robust analysis for other continents except Europe. The results provided a change in the features importance, compared to the original model. For the Europe model, the two variables that have the most influence in the model are population density and cases in thousand . The population and the number of days since shutdown play also a role in affecting the model.

The authors of "Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review" suggested that this paper, among several others has provided evidence that AI and ML can significantly improve treatment, forecasting and contact tracing for the Covid-19 pandemic and reduce the human intervention in medical practice.

Summing up, we agreed to what they claim, specially because the model was quite reliable and that most of the models are not deployed enough to show their real-world operation, but they are still up to the mark to tackle the pandemic.