Description of the system
The system is designed to automate the clusterization of multilingual corpora in CoNLL-U Plus format. It operates in four main steps, namly resources preprocessing, occurrences extraction, corpora representation and clusterization. All steps are controlled via a single configuration file described in the next section. The system supports four types of resources: EuroVoc, IATE, terms dictionary and lemmas dictionary. During the first step these resources are converted into lemma sequences which are used in the occurrences extraction step. When a lemma sequence exists in a document the corresponding dictionary entry is marked as present for this document. In case of overlapping occurrences between different resource types the following precedence is used: Eurovoc, IATE, Terms dictionary and Lemmas dictionary. If overlapping occurrences appear within the same resource type the longest match principle is applied. In the next step for every document a document representation is generated containing only the selected occurrences from the previous step. Every occurrence is repeated as many times in the document representation as it is present in the original document. The last step performs clusterization over the corpora representation and it is based on TF-IDF vectorization with k-Means clustering algorithm from the Python scikit-learn library. In order to improve scalability for a large number of clusters the MiniBatchKMeans version is used. After a complete clusterization the Silhouette Index, Davies–Bouldin Index and Calinski–Harabasz Index are being calculated for evaluation.

Resources Proprocess
php update_eurovoc.php config.txt
php update_iate.php config.txt
php update_terms.php config.txt
php update_lemmas.php config.txt

Occurrences extraction
php update_multi_corpus.php config.txt

Representation generation
php update_representation.php config.txt

Clusterization
python3 cluster_tfidf_kmeans.py config.txt

Description of the config file
LANGUAGES - language codes list separated by comma
ANNOTATION_LANGUAGE - language for the annotated document representation. Must be among LANGUAGES.
USE_EUROVOC - whether to use Eurovoc or not
USE_IATE - whether to use IATE or not
USE_TERM - whether to use the terms dictionary or not
USE_LEMMA - whether to use the lemmas dictionary or not
USE_INTERSECT - whether to use only dictionary entries present in all languages
EUROVOC_FOLDER - folder where to save and read Eurovoc data
IATE_FOLDER - folder where to save and read IATE data
IATE_XML - list of IATE TBX files ordered as in LANGUAGES
TERM_FOLDER -  folder where to save and read Term dictionary data
TERMS_FILE - path to the input TSV file with terms
LEMMA_FOLDER -  folder where to save and read Lemma dictionary data
LEMMAS_FILE - path to the input TSV file with lemmas
MULTI_CORPUS_FOLDER  folder where to save and read the resources related data extracted from the corpora 
CLUSTERS_COUNT - number of clusters
DATA_EUROVOC_2 - document representation with Eurovoc MTs
DATA_FOLDER - path to the document representation
DATA_ANNOTATED_FOLDER - path to the annotated document representation
RESULTS_FOLDER - folder to save the clusterization results. For every experiment a new subfolder is created.
CORPORA_FOLDER - path to the corpora. For every language from LANGUAGES a subfolder with the language code has to be present containing a conllup folder for the documents.
IATE_FILTER_DUPLICATES = whether to filters IATE terms that have multiple meanings
FIRST_MATCH = in case of overlapping matches with the same length whether to take the first or all of them
DATA_EUROVOC_2 = whether to converts IATE and Eurovoc annotations to Eurovoc MTs or not


Update of the resources
The system supports four types of resources. Eurovoc, IATE, term dictionary and lemma dictionary. Below are given instructions as to how any of these resources can be updated.
Update of Eurovoc
After downloading a new version of Eurovoc the language corresponding files for descriptors and used-for have to be extracted into the Eurovoc folder CONFIG->EUROVOC_FOLDER. For every language in CONFIG->LANGUAGES two files have to be present:
desc_<language_code>.xml
uf_<language_code>.xml
The execution of the script update_eurovoc.php automates the process of updating the new version of Eurovoc into the system.
php update_eurovoc.php config.txt
Internally the script executes the following steps:
eurovoc_forms_search.php <language>
eurovoc_forms_lemmas.php <language>
eurovoc_convert_lemmas.php <language>
eurovoc_lemmas_search.php <language>
In case of CONFIG->USE_INTERSET is set to 1 two additional steps are performed:
eurovoc_lemmas_corpus.php
eurovoc_intersected_ids.php
Update of IATE
After downloading a new version of IATE the TBX files for the languages from CONFIG->LANGUAGES have to be extracted from the main archive in the CONFIG->IATE_FOLDER. The names of these files have to be added into the config in the same order as given in the CONFIG->LANGUAGES. For example:
IATE_XML = export_BG_2021-01-08_All_Langs.tbx, export_HR_2021-01-08_All_Langs.tbx, export_HU_2021-01-08_All_Langs.tbx, export_PL_2021-01-08_All_Langs.tbx, export_RO_2021-01-08_All_Langs.tbx, export_SK_2021-01-08_All_Langs.tbx, export_SL_2021-01-08_All_Langs.tbx
The execution of the script update_iate.php automates the process of updating the new version of IATE into the system.
php update_iate.php config.txt
Internally the script executes the following steps:
iate_forms_search.php <language>
iate_forms_lemmas.php <language>
iate_convert_lemmas.php <language>
iate_lemmas_search.php <language>
In case of CONFIG->USE_INTERSET is set to 1 two additional steps are performed:
iate_lemmas_corpus.php
iate_intersected_ids.php
Update of terms
This source file for this resource has to be in a TSV (tab separated value) format where each column corresponds to the language in the corresponding position in CONFIG->LANGUAGES. Its location is given in CONFIG->TERMS_FILE. Updating the terms is performed by invoking the script update_terms.php.
php update_terms.php config.txt
The generated term specific data will be saved in CONFIG->TERM_FOLDER.
Update of lemmas
Analogically to terms, this source file for this resource has to be in a TSV (tab separated value) format where each column corresponds to the language in the corresponding position in CONFIG->LANGUAGES. Its location is given in CONFIG->LEMMAS_FILE. Updating the lemmas is performed by invoking the script update_lemmas.php. 
php update_lemmas.php config.txt
The generated lemma specific data will be saved in CONFIG->LEMMA_FOLDER.
Corpora processing
The script update_multi_corpus.php automates the process of extracting occurrences of dictionary entries from the four resources described above.
php update_multi_corpus.php config.txt
For every language in CONFIG->LANGUAGES occurrences are being searched for in the corresponding corpora CONFIG->CORPORA_FOLDER/<language_code>/conllup
Overlapping occurrences are not allowed. If overlapping occurrences appear between resource types the one from resource type with greater priority is selected. The resource types precedence is Eurovoc, IATE, Terms and Lemmas. If overlapping occurrences appear within the same resource type then the one with more words is selected.
Corpora representation
For every document a document representation is constructed containing only the selected occurrences from the previous step. Every occurrence is repeated as many times in the document representation as it is present in the original document. The corpora representation is automated by invoking the following script:
php update_representation.php config.txt
Internally for every language in CONFIG->LANGUAGES it executes:
php generate_representation.php <language-code> config.txt
php generate_annotated_representation.php <language-code> config.txt
The annotated representation is a human readable version of the representation. The annotation language is set in CONFIG->ANNOTATION_LANGUAGE and must be among the languages in CONFIG->LANGUAGES. For example the normal representation could be:
Eurovoc_1697 Eurovoc_1697 Eurovoc_1697 Eurovoc_1697 Eurovoc_1697
Eurovoc_8414
IATE_2232927
Eurovoc_148
IATE_1891378
TERM_16268
Eurovoc_348
Annotated version of this representation is:
закон    Eurovoc_1697    5
официален печат | гербова марка | клеймо | щемпел    Eurovoc_8414    1
група    IATE_2232927    1
конституция    Eurovoc_148    1
София    IATE_1891378    1
изм    TERM_16268    1
решение    Eurovoc_348    1
Document clusterization
The clusterization operates over the corpora representation documents and it is based on TF-IDF vectorization with k-Means clustering algorithm from the Python scikit-learn library. In order to improve scalability for large number of clusters the MiniBatchKMeans version is used:
https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
The clusterization is performed by executing the script:
python3 cluster_tfidf_kmeans.py config.txt
The result is saved in CONFIG->RESULTS_FOLDER. The number of clusters is specified from CONFIG->CLUSTERS_COUNT. The data representation is read from CONFIG->DATA_FOLDER. For every experiment a separate subfolder is created containing the experiment results. The Silhouette Index, Davies–Bouldin Index and Calinski–Harabasz Index are being calculated.
