AR 28/3/2013 26/3 Morphology from Kotus. 27/3 Senses from Princeton. 27/3 Designed new paradigms. Filtered problematic/illegal things (PLURNOUN, ILLEGALVERB, POSTPONE, TODO). Just 9035 lemmas missing now. 28/3 Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. First results: 561 no linearization 960 lin with unknowns around 20 missing syntax constructions, 230 missing words Tests generated by gf -run ~/GF/lib/src/ParseEngFin.pgf 4-eng-fin-wsj.txt with l -treebank -bind PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) 29/3 Added most missing syntax constructions. Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721 are completely translated (but mostly not so well...) 317 no lin 182 lin with unknowns After implementing GerundN and GerundNP, only 40 lin with unknowns. But the implementations are bad: - applying to run-time V prevents correct vowel harmony - composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit Counting funs: gf ../GF/lib/src/ParseEng.pgf funs-wsj.txt with pt -funs PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ... From this, with some ghci commands, created freq-wsj.txt, showing AdvVP 1174 AdvNP 1075 UsePron 749 PossNP 749 UseV 675 in_Prep 671 and_Conj 659 UseComp 651 IIDig 620 and a total of 4512 fun's used in the 3220 trees. Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus! some_Quant 72 anyPl_Det 44 part_of_N2 34 both_Det 32 most_Det 28 ComplN2 21 several_Num 19 another_Quant 19 UseN2 16 neither_Det 11 CNNumNP 8 draw_V2 7 aware_of_A2 7 The next thing is to find out why ComplN2 and UseN2 are missing - they should be there. It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just "part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown constants. 314 without lin. Attacked the first ten missing constructs down to 4 occurrences. Now 13 with unknown, 167 missing. Thus almost 95% complete. Defined some more, down to the 34th with 2 occurrences. Now 32 missing, 18 with unknown (version 7, 7-eng-fin-wsj.txt). Thus over 98% complete. Soon time to fix errors in the things covered! Fixed obvious errors in "date" (taateli -> päivämäärä) and "force" (polttaminen -> voima). Effect on 24 examples. 30/3 Version 9: Changed subcat's of 170 of 230 V2's (the ones with 3 occurrences or more). One hour's work. Changes in 1124 translations. Also changed the default genitive of symbol (+n) to +in, to be uniform with the other cases. Works for words ending in a consonant: Inteln -> Intelin. But a proper morphological analysis with dynamic lex extension is what would really be needed. Fixed NounFin.IndefArt, which erroneously added "yksi" to the substantival form of numeral determiners. This changed 125 linearizations - but there are some mistaken parses of numbers in the treebank, in particular years. Also fixed the passive VP in the infinitive form, to better results in 95 sentences - but this structure should be different in Finnish. Fixing passive past tenses improved 250 sentences! Incredibly, they had been missing in the RGL. As well as the correct form of the compounds: "minut ollaan nähty" -> "minut on nähty" ("I have been seen"). Fixed the form for NPossNom and NPossGen. It had been mistakenly the Nom form. This gave "rakkausnsa" ("his love"). The proper form is the tk-2 prefix of the essive case: "rakkautensa"; the tk-1 genitive won't do ("rakkaudensa"). This changed to the better 81 sentences. Added NCompound, or form nr 10, to nouns. This may differ from Nom Sg, e.g. käteinenvirtaus -> käteisvirtaus. 107 errors corrected by this. Spent 30 minutes going through 230 10+ words. Changed a half of them. Some ARBITRARY, tried to avoid biasing to business language (e.g. "officer", "upseeri"/"vastaava"?). This changed 1362 sentences. Now, after 2 days of work, we have version 10-, where 2700 translations have changed from version 1-. Spent approximately - 3 hours manually correcting the lexicon - 5 hours analysing the problems and designing analysis tools - 8 hours fixing RGL bugs 31/3 Dictionary revision: 250 words with 6--10 occurrences ; 107 changed in 30 minutes. Changes in 557 sentences. RGL fix: more infinitive forms (näyttää nukkuvan, toivoo nukkuvansa); forms available through mkVV. Checked 50 VV's in Dict. 145 changes in translations. RGL fix: weakForm (projekdin -> projektin etc). Also noted that "olemaan"/"oleva" require one more VForm, separate from "ole" ("ottamaan", not "otamaan") and from "ovat" (*"omaan"). Received a corrected corpus from Krasimir, with weekdays and months recognized. This changes 100 translations. Now at version 13-eng-fin-wsj.txt, working with penn/wsj-3220/corr-wsj.full. Dictionary revision: 368 words with 5--4 occurrences, 150 changed in 30 minutes. Effect on 425 translations. It feels that FiWN - or maybe the method we have used it? - is not the optimal source, as the translations we get are often unusual translations, and even strange words. For instance, pay_N = "liksa", a slang word. Now at version 14. Work done: - 5 hours correcting the lexicon - 7 hours analysing - 10 hours fixing RGL 1/4 Calculation of returns - 22403 lemma tokens - 4333 lemma types - 390 types with 10 occurrences or more - 61 % of tokens covered by these - Going down from 10: (k=occs, n=lemmas with k occs, k*n) (9,58,522), (8,52,416), (7,87,609), (6,118,708), (5,169,845), (4,200,800), (3,388,1164), (2,745,1490), (1,2126,2126) Thus by covering >3 we now cover 79%. >2 is 84%, and >1 is 91%. >1 means 51% of the lemmas. That is, we need to revise 2100 words to achieve 90% accuracy. Revision taking 1h/600 words (with 50% OK) means 3.5h work. Maybe 8h work for all 4333 lemmas. Analysed the whole log4.txt. Statistics of types of metas: NP 25369 A 12837 N 11191 S 3961 Quant -> N -> NP 3609 N -> NP 3193 Prep -> S -> Adv 2581 NP -> VP -> S 2184 AP 2176 NP -> VPSlash -> NP 1680 S -> NP -> VP -> S 1635 Etc. 14,718 different types. Many of those could be dealt with by padding with nullables and coercions. Quant -> N -> NP ===> \quant, n -> DetCN (DetQuant q NumSg) (UseN n) Also tried linearization by chunks, defined as maximal fun-headed subtrees. Quite similar to smoothing with shorter n-grams one could say. Long-distance agreements lost, but chunks make sense. 2/4 Implemented an elementary chunking translator, located in svn://molto-project.eu/wp5/engfin/ For the first time, able to "translate everything" from English to Finnish. The quality is horrible of course. 3/4 Worked with analysis tools, completed most of the first 300 full-Penn words (>3) still missing in Dict. Changes in 250 sentences in wsj-3220. Rough estimate: in DictEngFin, there are 60k words, of which 57.5k from WN, 2.5k manual (based on grep mkW DictEngFin.gf). 4/4 Sent missing Penn words to Inari. Added missing syntactic constructs. Including a very nice one: VPSlashVS : VS -> VPSlash -> VPSlash, which permits "Mr. Bronner vaikuttaa uskovan että kuulustelut voivat tulla kummaksi tahansa tieksi". Having fixed the negativity of heither_Det yesterday, changed 22 sentences (only). Now at version 16. Checked through around 500 old-generation words, changing around 200. 300 changes in test corpus. Annotations in the grammar: MAN for manual, CHECKED for checked. MAN is not applied everywhere, so there is a heuristics to count manual words instead - those that don't come from WN/Kotus. grep -v mkW DictEngFin.gf | grep -v "K " | wc 1845 Checked words: grep CHECKED DictEngFin.gf | wc 2120 as of version 17-. Prepared a list ToCheckFin.gf (in MOLTO svn) for checking the top-3000 Penn words, which occur >9 times. Went through >99, with 260 changes in test corpus as result. Version 18-