Classificação binária de dados financeiros em problemas com classes desbalanceadas
Abstract
In order to mitigate the risks and uncertainties associated with credit granting, financial institutions are constantly exploring methods to enhance the credit evaluation system. In the same context, the growth in credit card transactions has led to an increase in fraud, resulting in billions of dollars in annual losses for financial institutions. Therefore, it is crucial for companies to effectively detect fraudulent transactions. One way to minimize losses due to default or fraud is to use statistical methods that yield results close to reality, presenting a low margin of error. However, the major challenge in executing this process is that such financial data is imbalanced, meaning there is a higher proportion of non-defaulting customers and legitimate transactions (majority groups) than delinquent customers and fraudulent transactions (minority groups). This imbalance leads to a classification bias, as learning algorithms tend to classify observations from the majority group better. In this context, this work aims to conduct a comparative study of the performance of support vector machines and logistic regression in classifying new sample units. This study will be carried out using three financial datasets with different degrees of imbalance, considering three contexts: (i) without applying any technique to handle the imbalance of the datasets; (ii) applying data preprocessing techniques to handle the imbalance of the datasets; and (iii) using the cost-sensitive version of the original classifiers to handle the imbalance of the datasets. The analysis of classifier performance will be based on measures derived from the confusion matrix that have been shown to be less sensitive to data imbalance, such as the G-mean, Matthews correlation coefficient, and F-score.
Collections
The following license files are associated with this item: