This paper examines the problem of class imbalance in building machine learning models for automatically classifying the helpfulness of user reviews on the Steam platform. An analysis of over 6 million reviews revealed that the proportion of helpful reviews is approximately 15%, leading to an "accuracy paradox" when training standard classifiers. A comparative study of sample balancing methods-class weighting and majority class undersampling-is conducted.
Class imbalance, machine learning, review analysis, undersampling, Steam, TF-IDF, text classification
1. Mudambi S.M. Chto delaet onlayn-otzyv poleznym? Issledovanie otzyvov pokupateley na Amazon.com. / Mudambi S.M., Shuff D. // MIS Quarterly. - 2010 - T. 34, № .1. DOI: https://doi.org/10.2307/20721420
2. Voroncov K.V. Matematicheskie metody obucheniya po precedentam (teoriya obucheniya mashin). - URL: http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf (Data obrascheniya: 15.02.2026).
3. He H. Obuchenie na osnove nesbalansirovannyh dannyh / He H., Garsiya E.A. // IEEE Transactions on Knowledge and Data Engineering. - 2009. - T. 21, № 9. - S. 1263-1284. DOI: https://doi.org/10.1109/TKDE.2008.239
4. Flah P. Mashinnoe obuchenie. Nauka i iskusstvo postroeniya algoritmov, kotorye izvlekayut znaniya iz dannyh. - M.: DMK Press, 2015.
5. Fernandes A. Obuchenie na nesbalansirovannyh naborah dannyh. / Fernandes A., Garsiya S., Galar M., Prati R.K., Kravchik B., Errera F. - Springer, 2018. – 377 s.
6. Chavla N.V., Boyer K.V., Holl L.O., Kegel'meyer V.P. SMOTE: Metodika sinteticheskogo peresemplirovaniya men'shinstva / Chavla N.V., Boyer K.V., Holl L.O., Kegel'meyer V.P. // Journal of Artificial Intelligence Research. - 2002. - T. 16. - S. 321-357. DOI: https://doi.org/10.1613/jair.953
7. Pedregosa F. Scikit-learn: Mashinnoe obuchenie na Python / Pedregosa F., Varoku G., Gramfort A. // Journal of Machine Learning Research. - 2011. - T. 12. - P. 2825-2830.
8. Blagus R. SMOTE dlya vysokorazmernyh klassovo-nesbalansirovannyh dannyh / Blagus R., Lusa L. // IEEE 11th International Conference on Machine Learning and Applications. - 2012. - T. 2. – C. 89 - 94. DOI: https://doi.org/10.1109/ICMLA.2012.183
9. Manning K., Raghavan P., Shyutce H. Vvedenie v informacionnyy poisk. / Manning K., Raghavan P., Shyutce H. - M.: Vil'yams, 2011. - 528 s.
10. Chen T. XGBoost: Masshtabiruemaya sistema bustinga derev'ev/ Chen T., Gestrin K. // Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. - 2016. - P. 785-794. DOI: https://doi.org/10.1145/2939672.2939785
11. Novikova T.P. Razrabotka algoritma kolichestvennogo investirovaniya na baze Random Forest / T.P. Novikova, C.A. Evdokimova, U Gocuy // Modelirovanie sistem i processov. - 2022. - T. 15, № 1. - S. 53-60. DOI: https://doi.org/10.12737/2219-0767-2022-15-4-53-60



