Page 1
The GF Cloud: web Apps and APIs
GF Summer School 2017
Thomas Hallgren
The wide-coverage translation system
Take a look:
Wide Coverage Translation Demo
Page 2
The wide-coverage translation system
How does it work?
The App grammar:
RGL
+ Phrasebook + Chunking
Things that aren't handled by the grammar:
Segmentation
Tokenization
Capitalization
Page 3
Segmentation
Done in JavaScript
function split_punct(s) { return s.split(/([.!?]+[ \t\n]+|\n\n+|[ \t\n]*[-•*+#]+[ \t\n]+)/) }
Multiple segments can be sent to the server and translated in parallel
But it seems ad-hoc: it should be part of the grammar...
Page 4
Tokenization
Done
in the server
.../App14.pgf?command=c-translate&lexer=text&unlexer=text
...
Separates punctuation from words
Parentheses and quotes?
Page 5
Tokenization
Was necessary before
Nowadays, grammars could be rewritten with the BIND, SOFT_BIND and SOFT_SPACE tokens instead.
Page 6
Capitalization
Done in the server, as part of the tokenization by the
lexer=text
Need to change the first word of a sentence to lower case
Causes problems if the first word is a name, or in English
I
,
I'm
, etc...
Keep upper case if the word is all caps
Keep upper case if it is a valid word in the grammar (
lookupMorpho
)
Misses multi-word expressions?
Page 7
Capitalization
Was necessary before
Nowadays, grammars could be rewritten to use the CAPIT token instead.
For robustness, maybe there should be a SOFT_CAPIT token too?