Enhancing solar flare forecasting: a multi-class and multi-label classification approach to handle imbalanced time series
Abstract
Solar flares are huge releases of energy from the Sun. They are categorized in five levels
according to their potential damage to Earth (A, B, C, M, and X) and may produce strong
impacts to communication systems, threatening human activities dependent on satellites
and GPS. Therefore, predicting it in advance may reduce their negative impacts. However,
solar flare forecasting has significant challenges: (a) the sequence of data influences the
phenomena and should be tracked; (b) the features and intervals that cause and influence the
phenomena are not defined; (c) the forecasting should be performed in an affordable time;
(d) the data is highly imbalanced, (e) adjacent classes are sometimes difficult to distinguish,
(f) the majority approaches perform binary forecasting (aggregating solar flare classes),
instead of multi-class, as actually required. This work proposed a method that tackles these
challenges simultaneously, being different from previous works, which tend to handle a
challenge per time.
First, we aimed to forecast the X-ray levels expected for the next few days. We proposed
the SeMiner method that allows the labels prediction given past observations. SeMiner
processes X-ray time series into sequences employing the new Series-to-Sequence (SS) al-
gorithm through a sliding window approach configured by a domain specialist. This method
allows to consider the sequence of instances in the mining process, handling challenge (a).
Next, feature selection is employed in order to determine which interval of data in the time
series, most influences the forecasting process, handling challenge (b). Then, the processed
sequences are submitted to a traditional classifier to generate a model that predicts future
X-ray levels. SeMiner reached 73% of accuracy for a 2-day forecast, 71% and 79%, respec-
tively for True Positive and True Negative Rates.
Second, we parallelized SS to increase its performance, in order to tackle issue (c), by
implementing it in CUDA platform. This implementation allowed a speedup of 4.36 in
its time processing due to the distribution of the processing among the GPUs (Graphics
Processing Unit).
Third, we improved SeMiner to tackle the remaining challenges by developing a new method
called Ensemble of classifiers for imbalanced datasets (ECID). For each solar flare class,
ECID employs a stratified random sampling for training binary-class base inducers, strength-
ening their sensitivity to a given class in a very imbalanced scenario, which tackled issue
(d).
Using a modified bootstrap approach, an aggregation method combines the inducers results, enabling a multi-class and multi-label forecasting and thus, handling the issue of adjacent
classes (challenge (e)). The results showed that ECID is well-suited for forecasting solar
flares, achieving a maximum mean of True Positive Rate (TPR) of 91% and a Precision of
97%, in a time horizon of one day.