Curated Multilingual Language Resources for CEF. AT
Department of Computational Linguistics
Period: 01.06.2020 – 31.05.2022
Type of Project: collective, international
Partners: Section of Linguistics and Literary Scholarship, Hungarian academy of sciences; University of Zagreb, Croatia, Faculty of Humanities and Social Sciences; Institute of Computer sciences, Polish Academy of Sciences; Research Institute for Artificial Intelligence, Romanian Academy; Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences; Institut Jožef Stefan, Slovenia
Funding: Innovation and Networks Executive Agency (INEA). Management Centre Europe (МСЕ), Telecoms sector
Principal Investigator: prof. Svetla Koeva PhD
Participants: prof. Svetla Koeva PhD (from 01.07.2021), prof. Tinko Tinchev PhD, assist. Prof. Tsetana Dimitrova PhD, assist. Prof. Valentina Stefanova PhD, Martin Yalamov, Nikola Obreshkov
Abstract:
The project will provide resources on seven languages selected and processed by adequate method: Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovene. The resources will be from these thematic fields: finances, health care, scientific researches, cultural heritage, education, economics, politics, from Innovation and Networks Executive Agency (INEA). Management Centre Europe (МСЕ).
The specific goals of the projects are aimed at: creating of multilanguage resource with documents from different thematic fields; automatic linguistic annotation of the multilanguage resource; automatic linguistic processing and enriching of the multilanguage resource.
The results of the project will provide at least 140 milion words (20 milion per language). The results will be used in the education of the systems for the automatic translation of the Platform about the automatic translation of the Management Centre Europe (CEF.AT). The quality of the automatic translation depends from the education of the systems for translation on the basis of the large quantity of translatable documents from a certain thematic field. The value of the automatic translation increases more and more with the increase of the economical, political and cultural connections between the different (European) states.
The project is developed as a part of the priorirty scientific direction of the Institute for Bulgarian language „Electronic Language Resources and Tools for their Processing“.
Results: linguistic processed and annotated language data from different thematic fields (at least 20 milion words), software, infrastructure for extraction and semantic processing of documents.