Etap-4 shell

Features and main functions

The etapbare tool is a shell of the linguistic processor ETAP, which offers:

  • automatic translation from Russian to English and from English to Russian
  • automatic conversion of Russian or English text into UNL and the reverse,
  • syntactic parsing,
  • semanctic parsing.

The application has a command line interface convenient for batch processing. ETAP processes plain text files in the UTF-8 encoding with byte order mark (BOM).

The main operation modes of ETAP supported by etapbare are:

  • translation of text from Russian to English and from English to Russian,
  • unl-conversion и unl-deconvertion, i.e., translation from natural language (English or Russian) to the Universal Networking Language (UNL) and reverse translation from UNL to NL (English or Russian),
  • syntactic parsing of Russian and English texts,
  • semantic parsing of Russian texts.

The easiest way to run this application is to use batch command files (.cmd, .bat).

Supported languages and their codes:

  • ru - Russian
  • en - English
  • unl - UNL
  • sem - Semantic representation

There are two input and output formats – plain text and tagged text (TGT). TGT is an XML-format developed for ETAP. It contains various data on sentence structure in addition to the sentence text strings. In the command line mode TGT files are processed by etapbare, and in the graphical mode the same is done by Stred (structure editor).

Dictionaries, rules, ontology and other data used by the ETAP system together constitute a set called the knowledge base. It is possible to control the output of ETAP by changing dictionary entries and parsing rules. The linguistic processor uses one knowledge base at a time, but there can be several registered KBs in the system. Each knowledge base has a name which is set at each run of ETAP. The basic installation package contains only one knowledge base called <LOCAL> and ETAP uses it by default.

Command line arguments:

etapbare <command> [-inp <path1> [-out <path2>] [-err <path3>]] [-inplang <lang1>] [-outlang <lang2>] [-opt <optpath>] [-kbname <kbname>] [-kbcfg <kb.cfg>] [-loadkb true/false] [-maxtime <time>]

Possible combinations of command line arguments depend on the <command>. Their description follows. This formal description of the command line arguments determines only their order.
Command line arguments:

  • -inp <path1> - sets the path to the input document path1. Path1 can be specified with a template that contains * wildcards. A template specifies a list of all input documents with matching names.
  • -out <path2> - sets the path to the output document path2. If this option is not given, the input document will be overwritten.
  • -err <path3> - sets the path to the document that contains sentences causing errors path3. If it is not set, the bad sentences will not be output.
  • -inplang <lang1> - sets the code of the input language lang1. This argument is optional. If it is not present, etapbare will autodetect what to do depending on the input and output formats.
  • -outlang <lang2> - sets the code of the output language lang2. It is optional for unl-deconversion.
  • -opt <optpath> - specifies the path to the document with translation options optpath. If this path is not set, the default settings are used. You can view and edit the default settings with winetap (Press Advanced button). Setting options for some task and saving them creates an .ini file inside the data\auxiliar subfolder in the main ETAP folder. This file can be used as an options file by etapbare and other ETAP tools.. You can store .ini files with different names and make a custom options file for each of your tasks.
  • -kbname <kbname> - specifies the name of the knowledge base to use. If a knowledge base is not set, the system uses the default one.
  • -maxtime

The commands (<command>) following the template <input format>2<output format> instructs the application to process the input documents. The standard command line is <input format>2<output format>, where <input format> sets the input document format, and <output format> sets the output document format.
There are three supported formats currently:

  • txt (natural language text, possible languages are: ru, en),
  • tgt (xml-document with syntactically tagged text, tagged sentences can contain a translation into ru, en, unl, sem - but not multiple translations) and
  • unl (in the textual form), and combinations thereof
  • txt2txt (translation of plain text from one language into the other),
  • txt2tgt (syntactic or semantic parsing of the text),
  • tgt2txt (translation of a text into a natural language different from the original language),
  • txt2unl (building a UNL-document),
  • unl2txt (UNL-deconversion into the output natural language, supported languages are: ru, en),
  • tgt2tgt (edit syntactic tags).

Example Etapbare commands:
1) Translate an English text en.txt into Russian with default settings. The result of the translation is written to ru.txt.

  • etapbare txt2txt -inp .\source\en.txt -out .\result\to_ru.txt -inplang en -outlang ru

2) Convert an English text en.txt to UNL with default settings. The result of the translation is written to unl.txt.

  • etapbare txt2unl -inp .\source\en.txt -out .\result\unl.txt -inplang en

3) Convert an English text en_unl.txt to UNL, using the test_kb.cfg knowledge base and settings from opt_etap.dat file. The result of translation is written to to_unl.txt.

  • etapbare txt2unl -inp .\source\en_unl.txt -out .\result\to_unl.txt -inplang en -outlang unl -loadkb false -kbcfg test_kb.cfg -err .\err\err_en_unl.txt -opt .\opt_etap.dat

4) Syntactically parse an English text en.txt with default settings. The result of parsing is written to en.tgt.

  • etapbare txt2tgt -inp .\source\en.txt -out .\result\en.tgt -inplang en

5) Perform semantic parsing of a Russian text ru.txt with default settings. The result is written to ru_sem.tgt. The syntactic trees built during the process are saved as well.

  • etapbare txt2tgt -inp .\source\ru.txt -out .\result\ru_sem.tgt -inplang en -outlang sem

6) Translate all English texts with filenames starting with en in the source folder into Russian with default settings. The translation is written to to_ru.txt.

  • etapbare txt2txt -inp .\source\en*.txt -out .\result\to_ru.txt -inplang en -outlang ru

7) Perform semantic parsing of all Russian texts in tgt-format inside the given folder, using default settings. The resulting semantic structures are stored in all_ru_sem.tgt.

  • etapbare txt2tgt -inp .\source\russian\*.tgt -out .\result\all_ru_sem.tgt -inplang ru -outlang sem