A methodological framework for evaluating ADASYN and borderline-SMOTE oversampling techniques in imbalanced epidemiological data: a proof-of-concept study for lassa fever detection

Osowomuabe Njama-Abang; Denis U. Ashishie; Paul T. Bukie; Ahena I. Bassey

doi:10.61298/rans.2026.4.1.238

Authors

Osowomuabe Njama-Abang
[email protected]

Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria
Denis U. Ashishie
Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria
Paul T. Bukie
Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria
Ahena I. Bassey
Department of Computer Science, University of Calabar PMB 1115, Etta Agbo Rd, Calabar, Nigeria

Keywords:

Lassa fever, Machine learning, Class imbalance, Oversampling, Synthetic data

Abstract

Class imbalance in epidemiological datasets poses a fundamental challenge to developing accurate predictive models, particularly for rare but critical outcomes. This proof-of-concept study presents a methodological framework for evaluating advanced oversampling techniques—Adaptive Synthetic (ADASYN) sampling and Borderline-SMOTE—in the context of imbalanced medical classification tasks. Using a controlled synthetic dataset that mimics the class distribution characteristics of Lassa Fever epidemiological data, we systematically compare these techniques’ effectiveness in preparing imbalanced datasets for machine learning. Our methodology emphasizes rigorous experimental design, including strict train-test separation before oversampling application, comprehensive ablation studies, and transparent statistical analysis. Individual machine learning models (Random Forest, XGBoost, LightGBM, and Neural Networks) and a weighted ensemble model were evaluated using appropriate metrics for imbalanced classification. This study employs synthetic data to establish a controlled experimental environment for algorithmic comparison. While results demonstrate the technical capabilities of ADASYN and Borderline-SMOTE under ideal conditions, these performance metrics should not be interpreted as clinically validated or representative of real-world performance. The primary contribution is a reusable methodological framework and comparative analysis of oversampling strategies, which requires validation on authentic clinical datasets before any deployment considerations. This work provides computational epidemiologists with evidence-based guidance for technique selection while clearly delineating the boundary between methodological demonstration and clinical applicability.

Dimensions

REFERENCES

[1] World Health Organization, “Lassa fever fact sheet”, 2023. [Online]. https://www.who.int/news-room/fact-sheets/detail/lassa-fever.

[2] Centers for Disease Control and Prevention, “Lassa fever epidemiology”, 2023. [Online]. https://www.cdc.gov/vhf/lassa/epidemiology.html.

[3] D. G. Bausch, A. J. Gambhir & G. R. W. Davis, “Review of the literature and proposed guidelines for the use of oral ribavirin as post-exposure prophylaxis for Lassa fever”, Clinical Infectious Diseases 51 (2010) 1435. https://doi.org/10.1086/657315.

[4] O. Njama-Abang, D. U. Ashishie & P. T. Bukie, “Addressing class imbalance in lassa fever data using machine learning: a case study with SMOTE and random forest”, Journal of the Nigerian Society of Physical Sciences 7 (2025) 2586. https://doi.org/10.46481/jnsps.2025.2586.

[5] N. V. Chawla, K. W. Bowyer, L. O. Hall & W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research 16 (2002) 321. https://doi.org/10.1613/jair.953.

[6] H. He & E. A. Garcia, “Learning from imbalanced data”, IEEE Transactions on Knowledge and Data Engineering 21 (2009) 1263. https://doi.org/10.1109/TKDE.2008.239.

[7] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue & G. Bing, “Learning from class-imbalanced data: review of methods and applications”, Expert Systems with Applications 73 (2017) 220. https://doi.org/10.1016/j.eswa.2016.12.035.

[8] J. Wiens, S. Saria, M. Sendak, M. Ghassemi, V. X. Liu, F. Doshi-Velez, K. Jung, K. Heller, D. Kale, M. Saeed, P. N. Ossorio, S. Thadaney-Israni & A. Goldenberg, “Do no harm: a roadmap for responsible machine learning for health care”, Nature Medicine 25 (2018) 1337. https://doi.org/10.1038/s41591-019-0548-6.

[9] P. Branco, L. Torgo & R. P. Ribeiro, “A survey of predictive modeling on imbalanced domains”, ACM Computing Surveys (CSUR) 49 (2016) 1. https://doi.org/10.1145/2907070.

[10] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince & F. Herrera, “A review on ensembles for the class imbalance problem: Bagging, boosting, and hybrid-based approaches”, IEEE Transactions on Systems, Man, and Cybernetics: Part C (Applications and Reviews) 42 (2012) 463. https://doi.org/10.1109/TSMCC.2011.2161285.

[11] R. Blagus & L. Lusa, “SMOTE for high-dimensional class-imbalanced data”, BMC Bioinformatics 14 (2013) 106. https://doi.org/10.1186/1471-2105-14-106.

[12] H. He, Y. Bai, E. A. Garcia & S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”, in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), 2008, pp. 1322. http://doi.org/10.1109/IJCNN.2008.4633969.

[13] H. Han, W.-Y. Wang & B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning”, in International Conference on Intelligent Computing (ICIC 2005), Lecture Notes in Computer Science, vol. 3644, 2005, pp. 878. https://doi.org/10.1007/11538059_91.

[14] L. Breiman, “Random Forests”, Machine Learning 45 (2001) 5. https://doi.org/10.1023/A:1010933404324.

[15] P. Probst, M. N. Wright & A. L. Boulesteix, “Hyperparameters and tuning strategies for random forest”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019) e1301. https://doi.org/10.1002/widm.1301.

[16] T. Chen & C. Guestrin, “XGBoost: A Scalable Tree Boosting System”, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785. https://doi.org/10.1145/2939672.2939785.

[17] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye & T. Y. Liu, “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”, in Advances in Neural Information Processing Systems, volume 30, Long Beach, CA, USA, 2017, pp. 3146. https://dl.acm.org/doi/10.5555/3294996.3295074.

[18] G. E. Hinton & R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks”, Science 313 (2006) 504. https://doi.org/10.1126/science.1127647.

[19] J. G. Shaffer, D. S. Grant, J. S. Schieffelin, M. L. Boisen, A. Goba, J. N. Hartnett et al., “Lassa fever in post-conflict Sierra Leone”, PLoS Neglected Tropical Diseases 8 (2014) e2748. https://doi.org/10.1371/journal.pntd.0002748.

[20] T. Saito & M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets”, PLoS One 10 (2015) e0118432. https://doi.org/10.1371/journal.pone.0118432.

[21] L. Grinsztajn, E. Oyallon & G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?”, in Advances in Neural Information Processing Systems, volume 35, 2022, pp. 507. https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html.