A COMPARATIVE STUDY OF MACHINE LEARNING MODEL PERFORMANCE FOR PM2.5 FORECASTING WITH AND WITHOUT PRINCIPAL COMPONENT ANALYSIS
Abstract
This study seeks to forecast PM2.5 particulate matter levels using the XGBoost machine learning algorithm, integrated with hourly meteorological and air pollution data collected over a five-year period (2019–2023). The study incorporated meteorological characteristics measured at various altitudes, including temperature, relative humidity, wind speed and direction,
air pressure, shortwave and longwave radiation, and concentrations of pollutants (CO, NO, NO₂, NOx, PM2.5, and PM10). We employed Multivariate Imputation by Chained Equations (MICE) to address missing data. Two modeling methodologies were evaluated: the first utilized original variables without the implementation of Principal Component Analysis (PCA), whereas the second incorporated PCA. The findings demonstrated that the non-PCA model excelled, particularly
in 3-hour forecasts, achieving a correlation coefficient of 0.88, with RMSE and MAE values of 7.95 and 5.40, respectively. The PCA-based model yielded a correlation coefficient of 0.85, with RMSE and MAE values of 8.62and 5.77, respectively. As the forecast horizon extended, both models exhibited declining performance. In 24-hour forecasts, the non-PCA model achieved a correlation coefficient of 0.75, with RMSE and MAE values of 10.45 and 7.01, respectively.
The PCA model demonstrated a correlation coefficient of 0.74, with RMSE and MAE values of 10.71 and 7.22, respectively. In the case of extended 168-hour (7-day) forecasts, the correlation coefficient for the non-PCA model decreased to 0.48, with RMSE and MAE values of 16.87 and 10.01, respectively. In contrast, the PCA model demonstrated a slightly superior correlation coefficient of 0.49, along with RMSE and MAE values of 16.81 and 10.16, respectively.
The findings indicate that the incorporation of multi-level altitude measurements with MICE for data imputation and the XGBoost algorithm can precisely forecast short-term PM2.5 concentrations. Moreover, unprocessed data devoid of PCA preserves essential data attributes, leading to improved forecasting precision.
Full Text:
UntitledRefbacks
- There are currently no refbacks.