main| new issue| archive| editorial board| for the authors| publishing house|
Ðóññêèé
Main page
New issue
Archive of articles
Editorial board
For the authors
Publishing house

 

 


ABSTRACTS OF ARTICLES OF THE JOURNAL "INFORMATION TECHNOLOGIES".
No. 6. Vol. 29. 2023

DOI: 10.17587/it.29.307-315

A. V. Berezhkov, Master's Degree, Lecturer, G. S. Larionova, Master Student, V. I. Martsinkevich, Master Student, V. V. Tereshchenko, Master Student,
ITMO University, St. Petersburg, 197101, Russian Federation

Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format

The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.
Keywords: parsing, electronic document, ODT, XML, Python, normative control, document styles

P. 307-315

References

  1. Berezhkov A. V., Valitova Yu. O., Klimenko A. I., Pono-marev D. D. The experience of the quality improvement of the decor concluding qualifying works of students technical university, St. Petersburg, Pedagogicheskii zhurnal, 2020, vol. 10, iss. 1A, pp. 367—375 (in Russian).
  2. Kashutina I. A., Lukovenkova O. O., Kudrinskaya O. V. Standard control of student works: a computer program; copyright holder of FGBOU VO "KamSu im. Vitus Bering", Russian Patent No. 2015615893, Byul. No. 2015612356 (in Russian).
  3. Mankevich O. V., Semenyuk P. A. Features of automation of standard control of text documents, Minsk, Information Technologies and Systems 2015 (ITS 2015): Proceeding of the International Conference, 2015, pp. 124—125 (in Russian).
  4. Johnson D. The eight most popular document formats on the Internet [Electronic resource], available at: http://duff- johnson.com/2014/02/17/the-8-most-popular-document-formats-on-the-web/#data (accessed 02.01.2023).
  5. Resolution of the Government of the Russian Federation on the establishment of the Government Commission on Import Substitution and its functions, 2015, ¹ 785. [Electronic resource], available at: http://static.government.ru/media/files/gP7IKC-c3BsBTtEQuYjUxArQ28Dr3oyA3.pdf (accessed 02.01.2023) (in Russian).
  6. Martsinkevich V. I., Larionova G. S., Tereshchenko V. V., Sitnikova K. A., Gorlushkina N. N. Analysis of the possibilities of parsing electronic text documents for the automation of standard control, St. Petersburg, Ekonomika. Pravo. Innovacii, 2022, no. 3, pp. 39—49 (in Russian).
  7. Open Document Format for Office Applications (Open-Document) [Electronic resource], available at: http://docs.oasis-open.org/office/v1.2/cd05/OpenDocument-v1.2-cd05-part1.html (accessed 10.01.2023).
  8. Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification [Electronic resource], available at: https://www.w3.org/ TR/CSS2/page.html#break-inside (accessed 10.01.2023).

To the contents