Optimizing the performance of the K-Nearest Neighbors algorithm using grid search and feature scaling to improve data classification accuracy
DOI:
https://doi.org/10.35335/mandiri.v14i2.466Kata Kunci:
Feature Scaling, Grid Search, K-Nearest Neighbors, Medical Classification, Parameter OptimizationAbstrak
The performance of distance-based classification algorithms such as K-Nearest Neighbors (KNN) is highly dependent on proper feature scaling and optimal parameter selection. Without systematic optimization, KNN may experience decreased accuracy due to feature scale disparities and suboptimal k-values. This study aims to enhance the performance of the KNN algorithm through the integration of Feature Scaling and Grid Search Cross-Validation as a parameter optimization strategy. The research employs the Breast Cancer Wisconsin Dataset, divided into 80% training and 20% testing data. Feature normalization was performed using StandardScaler, while Grid Search was applied to determine the optimal combination of parameters, including the number of neighbors (k), weighting function (weights), and distance metric (metric). The optimized KNN configuration with k = 9, weights = distance, and metric = manhattan achieved an average accuracy of 97.19%, outperforming the baseline accuracy of 93.86%. A paired t-test confirmed that the improvement was statistically significant (p < 0.05). These findings demonstrate that the synergy between feature scaling and parameter tuning can substantially improve both the accuracy and stability of KNN models. The scientific novelty of this study lies in the systematic integration of normalization and parameter optimization through Grid Search, providing an empirical framework that enhances KNN’s robustness across datasets with heterogeneous feature distributions. The proposed approach is recommended for medical data classification and can be adapted to other domains with heterogeneous numerical feature distributions.
Referensi
Abbas, F., Zhang, F., Iqbal, J., Abbas, F., Alrefaei, A. F., & Albeshr, M. F. (2023). Assessing the Dimensionality Reduction of Geospatial Dataset Using Principal Component Analysis (PCA) and Its Impact on the Accuracy and Performance Ensembled and Non-Ensembled Algorithms. 30. www.preprints.org
Abedi, R., & Professor, A. (2021). Comparison of Parametric and Non-Parametric Techniques to Accurate Classification of Forest Attributes on Satellite Image Data. Journal of Environmental Sciences Studies (JESS), 5(4), 3229–3235. www.jess.ir
Abu Alfeilat, H. A., Hassanat, A. B. A., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., Eyal Salman, H. S., & Prasath, V. B. S. (2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. In Big Data (Vol. 7, Issue 4, pp. 221–248). https://doi.org/10.1089/big.2018.0175
Açıkkar, M., & Tokgöz, S. (2024). An improved KNN classifier based on a novel weighted voting function and adaptive k-value selection. Neural Computing and Applications, 36(8), 4027–4045. https://doi.org/10.1007/s00521-023-09272-8
Afzal, M., Rahman, S., Singh, D., & Imran, A. (2024). Cross-Sector Application of Machine Learning in Telecommunications: Enhancing Customer Retention Through Comparative Analysis of Ensemble Methods. IEEE Access, 12, 115256–115267. https://doi.org/10.1109/ACCESS.2024.3445281
Ahuja, R., Chug, A., Gupta, S., Ahuja, P., & Kohli, S. (2020). Classification and clustering algorithms of machine learning with their applications. In Studies in Computational Intelligence (Vol. 855, pp. 225–248). Springer. https://doi.org/10.1007/978-3-030-28553-1_11
Ang, K. L. M., Seng, J. K. P., Ngharamike, E., & Ijemaru, G. K. (2022). Emerging Technologies for Smart Cities’ Transportation: Geo-Information, Data Analytics and Machine Learning Approaches. ISPRS International Journal of Geo-Information, 11(2), 85. https://doi.org/10.3390/ijgi11020085
Belete, D. M., & Huchaiah, M. D. (2022). Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results. International Journal of Computers and Applications, 44(9), 875–886. https://doi.org/10.1080/1206212X.2021.1974663
Ehsani, R., & Drabløs, F. (2020). Robust Distance Measures for kNN Classification of Cancer Data. Cancer Informatics, 19, 1176935120965542. https://doi.org/10.1177/1176935120965542
Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. Journal of Big Data, 11(1), 113. https://doi.org/10.1186/s40537-024-00973-y
Herwanto, H. W., Handayani, A. N., Wibawa, A. P., Chandrika, K. L., & Arai, K. (2021). Comparison of Min-Max, Z-Score and Decimal Scaling Normalization for Zoning Feature Extraction on Javanese Character Recognition. 7th International Conference on Electrical, Electronics and Information Engineering: Technological Breakthrough for Greater New Life, ICEEIE 2021, 1–3. https://doi.org/10.1109/ICEEIE52663.2021.9616665
Jijo, B. T., & Abdulazeez, A. M. (2021). Classification Based on Decision Tree Algorithm for Machine Learning. Journal of Applied Science and Technology Trends, 2(1), 20–28. https://doi.org/10.38094/jastt20165
Li, W., Chai, Y., Khan, F., Jan, S. R. U., Verma, S., Menon, V. G., Kavita, & Li, X. (2021). A Comprehensive Survey on Machine Learning-Based Big Data Analytics for IoT-Enabled Smart Healthcare System. In Mobile Networks and Applications (Vol. 26, Issue 1, pp. 234–252). Springer. https://doi.org/10.1007/s11036-020-01700-6
Mageed, I. A., Bhat, A. H., & Rehman, H. U. (2024). Shallow Learning vs. Deep Learning in Anomaly Detection Applications. In Shallow Learning vs. Deep Learning: A Practical Guide for Machine Learning Solutions (pp. 157–177). Springer. https://doi.org/10.1007/978-3-031-69499-8_7
Pagan, M., Zarlis, M., & Candra, A. (2023). Investigating the impact of data scaling on the k-nearest neighbor algorithm. Computer Science and Information Technologies, 4(2), 135–142.
Sharma, A., Sharma, R., Rana, R., & Kalia, A. (2024). Water quality prediction using Machine Learning Models. E3S Web of Conferences, 596(12), 35307–35334. https://doi.org/10.1051/e3sconf/202459601025
Tran, T. M., Le, X. M. T., Nguyen, H. T., & Huynh, V. N. (2019). A novel non-parametric method for time series classification based on k-Nearest Neighbors and Dynamic Time Warping Barycenter Averaging. Engineering Applications of Artificial Intelligence, 78, 173–185. https://doi.org/10.1016/j.engappai.2018.11.009
Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 6256. https://doi.org/10.1038/s41598-022-10358-x
Vandika, A. Y., & Pannyiwi, R. (2024). Application of KNN Algorithm for Credit Risk Analysis in Savings and Loan Cooperatives. Jurnal ICT: Information and Communication Technologies, 15(2), 55–61.
Xie, G., Zhao, Y., Xie, S., Huang, M., & Zhang, Y. (2019). Multi-classification method for determining coastal water quality based on SVM with grid search and KNN. International Journal of Performability Engineering, 15(10), 2618–2627. https://doi.org/10.23940/ijpe.19.10.p7.26182627
Unduhan
Diterbitkan
Cara Mengutip
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2025 Jonson Manurung, Hondor Saragih, , Muhammad Azhar Prabukusumo, Eryan Ahmad Firdaus

Artikel ini berlisensi Creative Commons Attribution-NonCommercial 4.0 International License.




