~runtime/python/examples/README (c) Prasanth Kolachina, 22 April 2015 ====================== TRANSLATION PIPELINE ====================== The module translation_pipeline.py is a Python replica of the translation pipeline used in Wide-coverage Translation demo. The pipeline allows for 1. simultaneous batch translation from one language into multiple languages 2. K-best translations 3. translate both text files and sgm files. The module defines functions for the standard lexer used in the pipeline, the callbacks used in robust parsing to partially deal with unknown words and proper nouns etc. Basic example usage: > python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -i -e > python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -K 20 -i -e > python translation_pipeline.py -g Translate11.pgf -s Eng -t Fin Swe Ger -i -e > python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -f sgm -i -e The full list and description of options accepted by the translation_pipeline module can be seen using the -h option. > python translation_pipeline.py -h ——— usage: translation_pipeline.py [-h] -g PGFFILE [-s SRCLANG] [-t [TGTLANGS [TGTLANGS ...]]] [-i INPUT] [-e EXP_DIRECTORY] [-f {txt,sgm}] [-p PROPSFILE] [-K BESTK] Run the GF translation pipeline on standard test-sets optional arguments: -h, --help show this help message and exit -g PGFFILE, --pgf PGFFILE PGF grammar file to run the pipeline -s SRCLANG, --source SRCLANG Source language of input sentences -t [TGTLANGS [TGTLANGS ...]], --target [TGTLANGS [TGTLANGS ...]] Target languages to linearize (default is all other languages) -i INPUT, --input INPUT input file (default will accept STDIN) -e EXP_DIRECTORY, --exp EXP_DIRECTORY experiement directory to write translation files -f {txt,sgm}, --format {txt,sgm} input file format (output files will be written in the same format) -p PROPSFILE, --props PROPSFILE properties file for the translation pipeline (specify the above arguments in a file) -K BESTK K value for K-best translation ====================== PREREQUISITES ====================== In order to use the examples in this directory, the following components are required: 1. GF C runtime (~runtime/c/) 2. Python bindings to the C runtime (~runtime/python/) 3. The path to Python library is added to PYTHONPATH environment variable (Note: by default, the setuptools installs the bindings to a location available for everyone, so this step is only required if you have done a custom installation of the Python bindings and you know what you are doing) > export PYTHONPATH="$GF/src/runtime/python/build/lib.*:$PYTHONPATH" ====================== WEB GF PARSING ====================== NEW!!! In it current state, we carry out parsing of large web texts using GF grammars. The same functions described in gf_utils.py are used, but we make it faster using multithreading. The multiprocessing module in Python allows for trivial parallelization, where each batch of sentences are parsed by different threads in the pool. We noticed one thing during these experiments: the GF parser can take an unusually long time for long and ambiguous sentences. Therefore, to avoid resource starvation, we use a `timeout' setting to raise a PgfParseError if it takes more than 5 minutes for a single sentence. With this simple trick, we manage to parse large corpora (Europarl texts) in both English and Swedish. Please contact prasanth.kolachina@cse.gu.se if you have any questions about this. ====================== GENERIC GF UTILITIES ====================== The module gf_utils.py contains functions to carry out four basic tasks: 1. 1-best parsing (parse) 2. K-best parsing (kparse) 3. 1-best linearization (linearize) 4. K-best linearization (klinearize) > usage: gf_utils.py [-h] {parse,kparse,linearize,klinearize} ... Detailed arguments for each function can be found using the "-h" option. An exhaustive list of options for all the functions are given below. Options marked with (*) are required, the others are optional. (*) -g/--pgf PGF Grammar file (*) -s/--src-lang Source language name i.e. code used in GF (for e.g. TranslateEng, TranslateFin). For parsing, the option specifies the language of the input sentences. (*) -t/--tgt-lang Target language name. For linearization, the option specifies the language into which they are linearized. (*) -K Prespecified K value for K-best parsing and linearization. -i/--input Input file name, either raw text sentences or abstract trees for linearization. -o/--output Output file name. -p/--start-sym Start symbol used for parsing Basic example usage: > python gf_utils.py parse -g TranslateEng.pgf -s TranslateEng -i -o [-p Phr] > python gf_utils.py kparse -g TranslateEng.pgf -s TranslateEng -K 20 -i -o [-p Phr] > python gf_utils.py linearize -g TranslateFin.pgf -t TranslateFin -i -o > python gf_utils.py klinearize -g TranslateFin.pgf -t TranslateFin -i -o ====================== File I/O formats ====================== 1. One-sentence-per-line The input sentences to the parser/kparser are written one sentence per line. This is also the standard format used in the translation pipeline. 2. SGM format The translation pipeline accepts SGM format file as both input and output files. The format is specifically used by automatic evaluation metrics used to measure quality of MT systems. The format is primarily used in by the NIST evaluation and the WMT Shared Task evaluations. 3. Parser output format The parser writes four columns, seperated by the -character for each sentence in a single line. The sentence index, time taken by the parser, the tree probability value and the abstract syntax tree. 4. K-best parser output format The k-best parser uses a representation that has come to be called CJ (Charniak-Johnson) format in the parsing community. a. The output consists of parsed blocks for each sentence. Two blocks are seperated by an empty line. b. The first line in the block contains two numbers: the number of parses in that block, and a identifier for that sentence. c. Each subsequent pair of lines contains the log probability of the abstract tree in one line followed by the actual parse tree in the next line. 5. K-best translations output format The k-best linearizer and k-best translation use the same format as Moses and other SMT toolkits to write K-best translation lists. a. The output consists of translation blocks for each sentence. b. Each block consists of several translations, one per each line. c. Each translation (or line) consists of four columns seperated by '|||' string. The first column contains a sentence identifier, the second column is the actual translation, followed by word-alignment information between the input sentence and the translation and the scores from statistical models used in parsing.