Abstract and keywords
Abstract:
This article examines the infrastructure layer for neural network inference optimization, including model serving platforms, scaling mechanisms, and distributed computing. A comparative analysis of leading systems is also provided: NVIDIA Triton Inference Server, Ray Serve, vLLM, and BentoML. Efficiency-enhancing mechanisms are discussed, including dynamic batching, continuous query aggregation, memory management using PagedAttention, and autonomous scaling strategies. The article also touches on the integration of infrastructure solutions with Kubernetes.

Keywords:
neural network inference, model serving, Triton Inference Server, Ray Serve, vLLM, dynamic batching, PagedAttention, autoscaling, Kubernetes, distributed computing
References

1. Mochalov, V. P. Algoritm dinamicheskogo raspredeleniya i balansirovki nagruzki v raspredelennyh oblachnyh vychisleniyah / V. P. Mochalov, N. Yu. Bratchenko, D. V. Gosteva // Modelirovanie sistem i processov. – 2024. – T. 17, № 1. – S. 92-102. – DOIhttps://doi.org/10.12737/2219-0767-2024-17-1-92-102. – EDN EWMPYM.

2. Zol'nikov, V. K. Modelirovanie raspredeleniya kanal'nogo resursa korporativnoy seti svyazi / V. K. Zol'nikov, S. A. Sazonova, E. A. Anikeev // Modelirovanie sistem i processov. – 2025. – T. 18, № 1. – S. 28-44. – DOIhttps://doi.org/10.12737/2219-0767-2025-28-44.

3. Dokumentaciya NVIDIA Triton Inference Server. URL: https://docs.nvidia.com/deeplearning/triton-inference-server/ (data obrascheniya: 10.02.2026).

4. Dokumentaciya Ray Serve. URL: https://docs.ray.io/en/latest/serve/index.html (data obrascheniya: 10.02.2026).

5. Elmelidzhi A. NVIDIA Triton Inference Server dostigaet vydayuscheysya proizvoditel'nosti v benchmarkah MLPerf Inference 4.1 // NVIDIA Developer Blog. 2024. URL: https://developer.nvidia.com/blog/nvidia-triton-inference-server-achieves-outstanding-performance-in-mlperf-inference-4-1-benchmarks/ (data obrascheniya: 10.02.2026).

6. Achkasov, A. V. Primenenie neyronnyh setey dlya optimizacii energopotrebleniya SBIS / A. V. Achkasov, A. S. Yagodkin, F. V. Makarenko, N. Yu. Zalenskaya // Modelirovanie sistem i processov. – 2025. – T. 18, № 1. – S. 7-16. – DOIhttps://doi.org/10.12737/2219-0767-2025-7-16.

7. Achkasov, D. A. Izuchenie i modelirovanie evristicheskih algoritmov optimizacii / D. A. Achkasov, K. V. Zol'nikov, N. N. Litvinov // Modelirovanie sistem i processov. – 2025. – T. 18, № 1. – S. 17-28. – DOIhttps://doi.org/10.12737/2219-0767-2025-17-28.

8. Kuripta, O. V. Arhitekturnoe reshenie proektirovaniya servisov prostranstvenno-vremennoy navigacii v obrazovatel'nyh uchrezhdeniyah / O. V. Kuripta, O. V. Minakova, I. V. Pocebneva // Modelirovanie sistem i processov. – 2024. – T. 17, № 1. – S. 65-72. – DOIhttps://doi.org/10.12737/2219-0767-2024-17-1-65-72. – EDN CNSEZA.

9. Bugaev, Yu. V. Analiz modeley i algoritmov optimizacii raskroya odnomernyh lesomaterialov / Yu. V. Bugaev, L. A. Korobova, I. Yu. Shurupova // Modelirovanie sistem i processov. – 2024. – T. 17, № 4. – S. 23-31. – DOIhttps://doi.org/10.12737/2219-0767-2024-17-4-23-31. – EDN MRHWHP.

10. Ray Summit 2025: Agenda. URL: https://www.anyscale.com/ray-summit/2025/agenda (data obrascheniya: 10.02.2026).

11. Kvon V. i dr. Effektivnoe upravlenie pamyat'yu dlya podachi bol'shih yazykovyh modeley s pomosch'yu PagedAttention // Proc. SOSP. 2023.

12. vLLM: retrospektiva 2024 goda i videnie 2025 // vLLM Blog. 2025. URL: https://blog.vllm.ai/2025/01/10/vllm-2024-wrapped-2025-vision.html (data obrascheniya: 10.02.2026).

13. Dokumentaciya vLLM. URL: https://docs.vllm.ai/en/latest/ (data obrascheniya: 10.02.2026).

14. Yurchishina, M. V. Algoritmicheskaya model' SPPR «Optimal'nyy uchebnyy plan» / M. V. Yurchishina, K. I. Bushmeleva // Modelirovanie sistem i processov. – 2024. – T. 17, № 4. – S. 84-95. – DOIhttps://doi.org/10.12737/2219-0767-2024-17-4-84-95. – EDN OJFBGD.

15. Bantyukov, S. M. Sozdanie intellektual'noy sistemy upravleniya kachestvom predpriyatiya «Etalon» v aviacionnoy promyshlennosti / S. M. Bantyukov // Modelirovanie sistem i processov. – 2024. – T. 17, № 2. – S. 15-23. – DOIhttps://doi.org/10.12737/2219-0767-2024-17-2-15-23. – EDN YHXUPY.

16. Al'-Kahtani i dr. Uskorenie inferensa glubokogo obucheniya: sravnitel'nyy analiz sovremennyh freymvorkov uskoreniya // Electronics. 2025. T. 14, № 15. St. 2977. DOI:https://doi.org/10.3390/electronics14152977.

Login or Create
* Forgot password?