In the presented work we applied three machine learning techniques to forecast and predict COVID-19 cases, deaths ad recoveries numbers in Algeria for the next six months using data from February 25th, 2020 to April 26th , 2021. These models are represented by the Gaussian process regression (GPR), the support vector machine (SVM) and the decision tree (DT). The plotting results and parameters evaluation pointed out that the Gaussian Process Regression (GPR) has the best performance. Prediction with this model showed that the number of cases, deaths and recoveries will increase in the next months Algeria recording a peak in the month of August and the curve will tend to decrease later.
Seventeen months after its emergence, the coronavirus disease 2019 (COVID-19) continues its propagation affecting more than 165 million patients leading to more than 3.4 million deaths surpassing all expectations. Algeria has seen its former case emerged on February, 25th 2020. After multiple facets of the epidemiological curve, the number of cases has attained 125,896 subjects. This number seems to be lower than the number of cases reported in the bordered countries like Morocco (515,758 cases) and Tunisia (329,925 cases). With 3,395 deaths and 87,746 recovered persons these numbers determine until now a fatality rate of 2.7 % and a cured rate of 69.7% respectively [1]. To understand the epidemiological traits of this disease and to predict its evolution and its probable end-point multiple approaches have been used in Algeria and around the world. These approaches varied from epidemiological and mathematical/statistical to deep learning/machine learning models [2]. In this way, machine learning models are of great importance [3]. These tools which have proved their role in different complicated problems in different field in the last years including health, agriculture, engineering, sport, climate and robotics [4] have been widely used in the current context of COVID-19 [5,6,7,8].
Among these models we can find auto regressive integrated moving average (ARIMA) models [5], BSTS (Bayesian structural time series) [4], simple RNN (recurrent neural network) [7], artificial neural network (ANN) [8], long-short term memory (LSTM) [9], linear regression [10], adaptive neurofuzzy inference system (ANFIS) [11], least absolute shrinkage and selection operator (LASSO) regression [12], CUBIST (cubist regression) [13], Gaussian process regression (GPR) [14], exponential smoothing (ES) [15], random forest (RF) [8,13,16], ridge regression (RIDGE) [13], support vector machine (SVM) [8,13], Naïve bayes (NB) [8], decision tree (DT) [8], box-jenkins method [17], variational auto encoder (VAE) [7,10], gated recurrent units (GRU) [7,9] and multi-layer perceptron (MLP), models [18].
After analyzing historical COVID-19 data, Velásquez and Lara [14] forecasted COVID-19 affection with reduced-space Gaussian process regression associated to chaotic dynamical systems using obtained information of the two first months (January 21, to April 12, 2020). Their work demonstrated the usefulness of the Gaussian models in the COVID-19 infection prediction.
In their study, Ribeiro et al., [13] set as objectives the evaluation of the performance of multiple models like autoregressive integrated moving average (ARIMA), cubist regression (CUBIST), random forest (RF), RIDGE regression, support vector regression (SVR), and stacking-ensemble learning in a COVID-19 cases short projection of 1, 3 and 6 days of the ten most affected states in Brazil. The performance evaluation has given the following classification: SVR, stacking-ensemble learning, ARIMA, CUBIST, RIDGE, and RF models.
In a comparison made by Ball [14], the support vector machines (SVM) has demonstrated higher performance than linear regression, multi-layer perceptron, random forest models in predicting COVID-19 trend in USA, Germany and the global. In Mexico logistic regression, decision tree, support vector machine, naive Bayes, and artificial neutral network to study COVID-19 cases by Muhammed et al., [8]. The researchers observed that decision tree, support vector machine and Naïve bayes model have the highest accuracy (94.99%), sensitivity (93.34%) and specificity (94.30%) respectively.
Daniyal et al., [19] in Pakistan, compared the performance of three regression models including linear, logarithmic, and quadratic in modeling of COVID-19 deaths using data of about 5 months. Later, they deduced that the rate of mortality will decrease by the end of October as shown by the quadratic regression model which has shown the best performance.
Prediction of COVID-19 mortality in Korea was the main objective of the study of An et al., [12]. The study begun by testing the least absolute shrinkage and selection operator (LASSO), linear support vector machine (SVM), SVM with radial basis function kernel, random forest (RF), and k-nearest neighbors. As a result, LASSO and linear SVM has shown high sensitivities (90.7% and 92.0%, respectively) and specificities (91.4% and 91.8%, respectively). In the same country, Das et al., [20] predicted mortality in 3,524 COVID-19 patients using five machine learning models (logistic regression, support vector machine, K nearest neighbor, random forest and gradient boosting). The logistic regression model was proposed as an open-source online prediction tool for decision-making due to its high performances.
In this paper, COVID-19 time series data available till 26th April 2021 in Algeria were used for a projection of daily cases, deaths and recoveries for the next six months using three machine learning techniques that are Gaussian process regression (GPR), support vector machine (SVM) and decision tree (DT). Data regarding the number of cases reported in Algeria, were extracted from Worldometer. The COVID-19 curve evolution is shown in Figure 1.
In the current wok, three machine learning approaches were applied to predict the number of COVID-19 cases, deaths and cured persons in Algeria. We first evaluated the forecast performance of these models by the estimation of parameters like the root mean square error (RMSE), the mean square error (MSE), the mean absolute error (MAE), and the coefficient of determination (R2) values for COVID-19 daily cases. Results showed that if the three models have shown acceptable performances (Table 1), the GPR model was the most efficient showing an RMSE of 31.126 and an R2 of 0.98. These parameters were calculated by comparing actual/predicted cases after a 10-fold cross-validation. Figures 2, 3 and 4 showed response plots the three models GPR, SVM, and DT respectively. Figures 5, 6 and 7 present the predicted/observed pattern of each model.
Model parameters | GPR | Quadratic SVM | DT |
---|---|---|---|
RMSE | 31.126 | 42.485 | 37.93 |
R-squared | 0.98 | 0.97 | 0.97 |
MSE | 968.83 | 1804.9 | 1438.7 |
MAE | 18.334 | 26.268 | 22.996 |
We then, used available data till 26th April 2021, of daily confirmed, recovered, and deceased cases of COVID-19 cases in Algeria and forecasted them using the three models for the next six months. Predicted daily new cases, recovered and dead persons are shown in Figures 8, 9 and 10 respectively. Results showed that confirmed cases will increase in the next months and will start their declining from the first week of October according to the GPR model. The number of recoveries (Figure 9) and deaths (Figure 10) follow generally the same evolutionary curve.
It is to mention that these projections were done without considering the effect of preventive measures which are considered to be the same in the next months. Prediction performance could be ameliorated if their effect will be added. The performance of our models has shown a high value for the coefficient of determination of the three models used in this study. As a comparison we can show that our models have better performances in term of R2 than other models like ARIMA (0.95) [21] and ANFIS (0.956) [22]. Other models like MPL-ICA (0.9971) [22], logistic regression (0.996) [23] and lasso regression (1.0) [24] have demonstrated higher performances.