Abstract and keywords
Abstract:
This article examines the principles of applying computer vision and natural language processing (NLP) technologies to create intelligent document processing (IDP) systems for electronic document management. Multimodal architectures based on transformers–LayoutLMv3, DocLLM, and UDOP–that enable the combined encoding of text, images, and spatial markup of documents are analyzed. Quantitative benchmark results for data extraction from forms and receipts, classification, and visual question-and-answer interactions are presented. The economic feasibility of implementing IDP solutions is substantiated using market statistics for 2023–2025. Key limitations and promising development areas are identified.

Keywords:
intelligent document processing, electronic document management, computer vision, natural language processing, optical character recognition, LayoutLMv3, multimodal models, transformers
References

1. Intelligent Document Processing (IDP) Market Size to Hit USD 43.92 Billion by 2034 // Precedence Research. – Updated: November 2025. – URL: https://www.precedenceresearch.com/intelligent-document-processing-market (data obrascheniya: 12.02.2026).

2. Rossiyskiy rynok SED uderzhivaet tempy rosta v 15–20% ezhegodno // CNews Analytics. – 2024. – URL: https://corp.cnews.ru/reviews/rynok_sed_2024 (data obrascheniya: 12.02.2026).

3. 2025 OCR Accuracy Benchmark Results: A Deep Dive Analysis // Sparkco AI. – 2025. – URL: https://sparkco.ai/blog/2025-ocr-accuracy-benchmark-results-a-deep-dive-analysis (data obrascheniya: 12.02.2026).

4. Xu Y. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding / Y. Xu, Y. Xu, T. Lv [et al.] // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). – 2021. – P. 2579–2591. DOI: https://doi.org/10.18653/v1/2021.acl-long.201

5. Huang Y. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking / Y. Huang, T. Lv, L. Cui [et al.] // Proceedings of the 30th ACM International Conference on Multimedia (MM '22). – 2022. – P. 4083–4091. DOI: https://doi.org/10.1145/3503161.3548112

6. Tang Z. Unifying Vision, Text, and Layout for Universal Document Processing / Z. Tang, Z. Yang, G. Wang [et al.] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 19254–19264. DOI: https://doi.org/10.1109/CVPR52729.2023.01845

7. Wang D. DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding / D. Wang, N. Raman, M. Sibue [et al.] // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). – 2024. – P. 8442–8468. DOI: https://doi.org/10.18653/v1/2024.acl-long.463

8. Yang X. Clinical Concept Extraction Using Transformers / X. Yang, J. Bian, W. R. Hogan, Y. Wu // Journal of the American Medical Informatics Association. – 2020. – Vol. 27, No. 12. – P. 1935–1942. DOI: https://doi.org/10.1093/jamia/ocaa189

9. Abilio R. Evaluating Named Entity Recognition: A comparative analysis of mono- and multilingual transformer models on a novel Brazilian corporate earnings call transcripts dataset / R. Abilio, L. A. F. Pereira, R. M. Marcacini // Expert Systems with Applications. – 2024. – Vol. 255. – Art. 124647. DOI: https://doi.org/10.1016/j.asoc.2024.112158

10. Lai H. Language models for data extraction and risk of bias assessment in complementary medicine / H. Lai, J. Tang, G. Liu [et al.] // npj Digital Medicine. – 2025. – Vol. 8, No. 1. – Art. 74. DOI: https://doi.org/10.1038/s41746-025-01457-w

Login or Create
* Forgot password?