A Dual Sampling Approach for Improved Classifier Performance on Imbalance Datasets
Abstract
Background: The inability of traditional machine learning models to adequately classify minority instances in imbalanced datasets is a known challenge that militate against the successful application of these models in several real-world domains. To address this problem, several techniques including data sampling are mostly used. Though reducing the imbalance ratio via sampling is reported to improve classifier performance, most approaches do not consider the intra-class distribution of instances while sampling, which often lead to loss of significant information or on the contrary cause data redundancy. Methods: This study proposes a novel Dual Sampling Technique (DST) that minimises these challenges and enhances classifier performance on imbalance datasets. The technique proceeds by first clustering a training set into a number of clusters determined a priori using the elbow method. Sampling ratios are computed from each cluster and either random undersampling or a novel average oversampling technique or both are used to perform sampling in each cluster depending on the imbalance ratio. The resulting datasets are used to train Random Forest, Decision Tree and K-Nearest Neighbor classifiers and their performance evaluated. Findings: Experimental results showed that the performance of the classifiers significantly improved in most cases when the proposed technique is used to sample the training set prior to model building than when Random Undersampling (RUS), Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE) and Cluster-Based Undersampling (CBU) are used. Novelty: The novelty of the proposed technique lies in the exploration of a unique concept that sought to minimise the imbalance ratio in datasets while maintaining their natural distribution by uniquely performing both undersampling and oversampling on the same dataset.
Downloads
References
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022). A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743. https://doi.org/10.1016/j.engappai.2022.104743
Devi, D., Namasudra, S., & Kadry, S. (2021). A boosting-aided adaptive cluster-based undersampling approach for treatment of class imbalance problem. International Journal of Data Warehousing and Mining, 16(3). https://doi.org/10.4018/IJDWM.2020070104
García-Gil, D., Luque-Sánchez, F., Luengo, J., García, S., & Herrera, F. (2019). From big to smart data: Iterative ensemble filter for noise filtering in big data classification. International Journal of Intelligent Systems, 34(12), 3260-3274. https://doi.org/10.1002/int.22193
Venkateswarlu, B., Poornima, K., Vasavi, R., & Vaishnavi, J. V. (2022). A study on class imbalance problem using genetic algorithm. In Proceedings of the 4th International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 1709-1714). https://doi.org/10.1109/ICSSIT53264.2022.9716371
Saleh, M., Shahabadi, E., Tabrizchi, H., & Kuchaki, M. (2021). A combination of clustering-based under-sampling with ensemble methods for solving imbalanced class problem in intelligent systems. Technological Forecasting & Social Change, 169, 120796. https://doi.org/10.1016/j.techfore.2021.120796
Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., Hawalah, A., & Hussain, A. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940-7957. https://doi.org/10.1109/ACCESS.2016.2619719
Weiss, G. M. (2013). Foundations of imbalanced learning. In Imbalanced Learning: Foundations, Algorithms, and Applications (pp. 13-41). Wiley. https://doi.org/10.1002/9781118646106.ch2
Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. L. P. (2024). A survey on imbalanced learning: Latest research, applications and future directions. Artificial Intelligence Review, 57, 137. https://doi.org/10.1007/s10462-024-10759-6
Lin, C., Tsai, C.-F., & Lin, W.-C. (2023). Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: An experimental study. Artificial Intelligence Review, 56(2), 845-863. https://doi.org/10.1007/s10462-022-10186-5
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
Chen, Z., Lin, T., Xia, X., Xu, H., & Ding, S. (2018). A synthetic neighborhood generation-based ensemble learning for the imbalanced data classification. Applied Intelligence, 48, 2441-2457. https://doi.org/10.1007/s10489-017-1088-8
Datta, S., & Arputharaj, A. (2018). An analysis of several machine learning algorithms for imbalanced classes. In Proceedings of the 5th International Conference on Soft Computing & Machine Intelligence (pp. 22-27). https://doi.org/10.1109/ISCMI.2018.8703244
Sowah, R. A., Kuditchar, B., Mills, G. A., Acakpovi, A., Twum, R. A., Buah, G., & Agboyi, R. (2021). HCBST: An efficient hybrid sampling technique for class imbalance problems. ACM Transactions on Knowledge Discovery from Data, 16(3), 1-37. https://doi.org/10.1145/3488280
Tsai, C.-F., Lin, W.-C., Hu, Y.-H., & Yao, G.-T. (2019). Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences, 477, 47-54.
Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409, 17-26. https://doi.org/10.1016/j.ins.2017.05.008
Kang, Q., Shi, L., Zhou, M., Wang, X., Wu, Q., & Wei, Z. (2017). A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2017.2755595
Nugraha, W., Maulana, M. S., & Sasongko, A. (2020). Clustering-based undersampling for handling class imbalance in C4.5 classification algorithm. Journal of Physics: Conference Series, 1641, 012014. https://doi.org/10.1088/1742-6596/1641/1/012014
Rodríguez-Torres, F., Martínez-Trinidad, J. F., & Carrasco-Ochoa, J. A. (2022). An oversampling method for class imbalance problems on large datasets. Applied Sciences, 12(7), 3424. https://doi.org/10.3390/app12073424
Hamad, R. A., Kimura, M., & Lundström, J. (2020). Efficacy of imbalanced data handling methods on deep learning for smart homes environments. SN Computer Science, 1(4), 204. https://doi.org/10.1007/s42979-020-00211-1
Qian, M., & Li, Y.-F. (2022). A weakly supervised learning-based oversampling framework for class-imbalanced fault diagnosis. IEEE Transactions on Reliability, 71(1), 429-442. https://doi.org/10.1109/TR.2021.3138448
Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6(1), 1-54. https://doi.org/10.1186/s40537-019-0192-5
Wongvorachan, T., He, S., & Bulut, O. (2023). A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information, 14(1). https://doi.org/10.3390/info14010054
Kou, G., Chen, H., & Hefni, M. A. (2022). Improved hybrid resampling and ensemble model for imbalance learning and credit evaluation. Journal of Management Science and Engineering, 7(4), 511-529. https://doi.org/10.1016/j.jmse.2022.06.002
Kaggle. (2009). Kaggle: Your home for data science. https://www.kaggle.com
Acuña, E., & Rodríguez, C. (2005). An empirical study of the effect of outliers on the misclassification error rate. Manuscript submitted for publication.
Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 179-186). Morgan Kaufmann.

This work is licensed under a Creative Commons Attribution 4.0 International License.
.jpg)
