Corpus Linguistics

The deeply annotated corpus of Russian texts SynTagRus (Syntactically Tagged Russian corpus), which has been under development at the Laboratory for a number of years, is an important autonomous part of the Russian National Corpus. As of the beginning of 2020, it contains over 1.1 million words (around 77 thousand sentences). The corpus is a collection of texts by different authors and of different genres, in which each sentence is assigned a detailed syntactic structure in the form of a dependency tree. The corpus also contains other types of annotation: lexical-semantic annotation (for ambiguous words, their actual meaning in the text is specified), lexical-functional (expressions are identified that can be interpreted in terms of lexical functions), anaphoric (antecedents of pronouns are marked), microsyntactic (syntactically sensitive phraseological units are identified), temporal (words and expressions with temporal meaning are marked). The last three types of annotation are experimental and are only present in part of the corpus.

Annotating of a new text is performed semi-automatically in several stages. First, the text is processed by the parser of the linguistic processor ETAP-4, which automatically creates for each sentence its syntactic structure and also makes lexical-semantic and lexical-functional annotation. Then the output of the processor is checked and corrected by specially trained linguists. After that, ETAP-4 uses the syntactic structures of the text to make anaphoric and temporal annotation. Finally, the linguists check these types of data and manually perform microsyntactic annotation.

At present, text corpora with annotation reaching the syntactic level are being developed for all major world languages, and their importance is widely recognized. On the one hand, they are a valuable source of well structured and systematized knowledge about the language syntax and can be used by linguists engaged in fundamental linguisic research. On the other hand, the corpus statistics may help optimize decision making in various automatic text processing procedures, including those used in the modern machine learning systems.

February 11, 2020