Enhancing solar flare forecasting: a multi-class and multi-label classification approach to handle imbalanced time series
Discola Junior, Sérgio Luisir
MetadataShow full item record
Solar flares are huge releases of energy from the Sun. They are categorized in five levels according to their potential damage to Earth (A, B, C, M, and X) and may produce strong impacts to communication systems, threatening human activities dependent on satellites and GPS. Therefore, predicting it in advance may reduce their negative impacts. However, solar flare forecasting has significant challenges: (a) the sequence of data influences the phenomena and should be tracked; (b) the features and intervals that cause and influence the phenomena are not defined; (c) the forecasting should be performed in an affordable time; (d) the data is highly imbalanced, (e) adjacent classes are sometimes difficult to distinguish, (f) the majority approaches perform binary forecasting (aggregating solar flare classes), instead of multi-class, as actually required. This work proposed a method that tackles these challenges simultaneously, being different from previous works, which tend to handle a challenge per time. First, we aimed to forecast the X-ray levels expected for the next few days. We proposed the SeMiner method that allows the labels prediction given past observations. SeMiner processes X-ray time series into sequences employing the new Series-to-Sequence (SS) al- gorithm through a sliding window approach configured by a domain specialist. This method allows to consider the sequence of instances in the mining process, handling challenge (a). Next, feature selection is employed in order to determine which interval of data in the time series, most influences the forecasting process, handling challenge (b). Then, the processed sequences are submitted to a traditional classifier to generate a model that predicts future X-ray levels. SeMiner reached 73% of accuracy for a 2-day forecast, 71% and 79%, respec- tively for True Positive and True Negative Rates. Second, we parallelized SS to increase its performance, in order to tackle issue (c), by implementing it in CUDA platform. This implementation allowed a speedup of 4.36 in its time processing due to the distribution of the processing among the GPUs (Graphics Processing Unit). Third, we improved SeMiner to tackle the remaining challenges by developing a new method called Ensemble of classifiers for imbalanced datasets (ECID). For each solar flare class, ECID employs a stratified random sampling for training binary-class base inducers, strength- ening their sensitivity to a given class in a very imbalanced scenario, which tackled issue (d). Using a modified bootstrap approach, an aggregation method combines the inducers results, enabling a multi-class and multi-label forecasting and thus, handling the issue of adjacent classes (challenge (e)). The results showed that ECID is well-suited for forecasting solar flares, achieving a maximum mean of True Positive Rate (TPR) of 91% and a Precision of 97%, in a time horizon of one day.