The effect of environmental variable selection in the prediction of Seasonal Influenza cases using machine learning


Fecha de publicación
Forma obra
Lugar de producción
Nota de edición
Digitalización realizada por la Biblioteca Virtual del Banco de la República (Colombia)
  • Tecnología; Tecnología / Ciencias médicas Medicina; Tecnología / Ciencias médicas Medicina / Incidencia y prevención de la enfermedad
  • Influenza; Human; Environment; Forecasting; Machine Learning; ARIMA; Multifactor Dimensionality Reduction
  • Colombia
  • Colfuturo
  • © Derechos reservados del autor
  • Abstract: Background: Seasonal Influenza is considered to be a cyclic and ordered sequence of values, influenced by external factors that can be predicted and used to detect disease outbreaks and monitoring. In machine learning, the key challenges that limit these analyses are in model explainability and limitations associated with ecological bias. Aim: Determine the best environmental variable selection method to predict Seasonal Influenza in Norway, using an environmental medicine approach combined with machine learning techniques. Methods: This is a quasi-experimental study that compares three approaches (non-variable selection, isolate component, and multipollutant mixture), represented in five methods (univariable, bivariable, multivariable AME, multivariable PCA, multivariable LDA). Per method, the best co-variable combination will be performed, following the internal rules of each method. The best covariable combination is the result of three components: variable selection, validation data set and lag. The first one involves 13 environmental variables (temperature, relative humidity, specific humidity, air pressure, wind speed, precipitation, CO, NO, NO2, O3, PM10, PM2,5 and SO2); second one compares a test dataset compiled from the data from 2019, last year (2018) and a synthetic environmental (avg. 2013-2018) data set in the validation process; and the third one compares a combination of lag from 0 to 12. All the predictions are made using ARIMA algorithm. The evaluation is given in terms of MAE, MSE, RMSE, OR. The training set is from 2 Jun 2013 (week 22/2013) to 28 May 2018 (week 21/2018), and test set is from (week 22/2018) to (week 21/2019) with a window of predictions of 52 weeks. Results: The increment of dimensionality in the environmental variable selection introduce different noise levels and optimize the prediction. Considerations that impact the explainability, usability, ecological bias and performance will be described. Conclusion: The increment of dimensionality in the variable selection has a better impact on performance than using complex algorithms.
Enlace permanente


  • RDF
  • JSON
  • BibTeX

Realizar otra búsqueda