AR 28/3/2013

26/3 Morphology from Kotus.

27/3 Senses from Princeton.

27/3 
Designed new paradigms. Filtered problematic/illegal things (PLURNOUN, ILLEGALVERB, POSTPONE, TODO).
Just 9035 lemmas missing now.

28/3
Set up an experiment with 3220 complete trees from Penn prepared by Krasimir. First results:
  561 no linearization
  960 lin with unknowns

around 20 missing syntax constructions, 230 missing words 

Tests generated by

  gf -run ~/GF/lib/src/ParseEngFin.pgf <wsj.full >4-eng-fin-wsj.txt

with 

  l -treebank -bind  PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul)


29/3
Added most missing syntax constructions.
Some new opers in ParadigmsFin, and 230 more words in DictEngFin: out of 3220 Penn trees now 2721 
are completely translated (but mostly not so well...)
  317 no lin
  182 lin with unknowns

After implementing GerundN and GerundNP, only 40 lin with unknowns. But the implementations are bad:
- applying to run-time V prevents correct vowel harmony
- composite forms with "minen" should be "mis", e.g. hinnoitteleminendetaljit

Counting funs:

  gf ../GF/lib/src/ParseEng.pgf <wsj.full >funs-wsj.txt

with 

  pt -funs   PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) ...

From this, with some ghci commands, created freq-wsj.txt, showing

AdvVP	1174
AdvNP	1075
UsePron	749
PossNP	749
UseV	675
in_Prep	671
and_Conj	659
UseComp	651
IIDig	620

and a total of 4512 fun's used in the 3220 trees.

Then created a list of missing funs in ParseFin: there are 8820 of them. However, only 80 missing funs appear in the corpus!

some_Quant	72
anyPl_Det	44
part_of_N2	34
both_Det	32
most_Det	28
ComplN2	21
several_Num	19
another_Quant	19
UseN2	16
neither_Det	11
CNNumNP	8
draw_V2	7
aware_of_A2	7

The next thing is to find out why ComplN2 and UseN2 are missing - they should be there.
It turned out that this happens just because there was no N2 in the lexicon. Strange... adding just
"part of" and "idea of" (as well as "familiar with") changes 35 sentences. Now only 9 with unknown
constants. 314 without lin.

Attacked the first ten missing constructs down to 4 occurrences. Now 13 with unknown, 167 missing. 
Thus almost 95% complete.

Defined some more, down to the 34th with 2 occurrences. Now 32 missing, 18 with unknown 
(version 7, 7-eng-fin-wsj.txt). Thus over 98% complete. Soon time to fix errors in the things covered!

Fixed obvious errors in "date" (taateli -> päivämäärä) and "force" (polttaminen -> voima). Effect on
24 examples.


30/3

Version 9: Changed subcat's of 170 of 230 V2's (the ones with 3 occurrences or more). One hour's work. Changes in 1124 
translations. 

Also changed the default genitive of symbol (+n) to +in, to be uniform with the other cases. Works for words ending
in a consonant: Inteln -> Intelin. But a proper morphological analysis with dynamic lex extension is what would
really be needed.

Fixed NounFin.IndefArt, which erroneously added "yksi" to the substantival form of numeral determiners. This changed 125
linearizations - but there are some mistaken parses of numbers in the treebank, in particular years. Also fixed the passive
VP in the infinitive form, to better results in 95 sentences - but this structure should be different in Finnish.

Fixing passive past tenses improved 250 sentences! Incredibly, they had been missing in the RGL. As well as the correct
form of the compounds: "minut ollaan nähty" -> "minut on nähty" ("I have been seen").

Fixed the form for NPossNom and NPossGen. It had been mistakenly the Nom form. This gave "rakkausnsa" ("his love").
The proper form is the tk-2 prefix of the essive case: "rakkautensa"; the tk-1 genitive won't do ("rakkaudensa"). 
This changed to the better 81 sentences.

Added NCompound, or form nr 10, to nouns. This may differ from Nom Sg, e.g. käteinenvirtaus -> käteisvirtaus. 107 errors
corrected by this.

Spent 30 minutes going through 230 10+ words. Changed a half of them. Some ARBITRARY, tried to avoid biasing to 
business language (e.g. "officer", "upseeri"/"vastaava"?). This changed 1362 sentences.

Now, after 2 days of work, we have version 10-, where 2700 translations have changed from version 1-. Spent approximately
- 3 hours manually correcting the lexicon
- 5 hours analysing the problems and designing analysis tools
- 8 hours fixing RGL bugs


31/3

Dictionary revision: 250 words with 6--10 occurrences ; 107 changed in 30 minutes. Changes in 557 sentences.

RGL fix: more infinitive forms (näyttää nukkuvan, toivoo nukkuvansa); forms available through mkVV. 
Checked 50 VV's in Dict. 145 changes in translations.

RGL fix: weakForm (projekdin -> projektin etc). Also noted that "olemaan"/"oleva" require one more VForm,
separate from "ole" ("ottamaan", not "otamaan") and from "ovat" (*"omaan").

Received a corrected corpus from Krasimir, with weekdays and months recognized. This changes 100 translations.
Now at version 13-eng-fin-wsj.txt, working with penn/wsj-3220/corr-wsj.full.

Dictionary revision: 368 words with 5--4 occurrences, 150 changed in 30 minutes. Effect on 425 translations.
It feels that FiWN - or maybe the method we have used it? - is not the optimal source, as the translations
we get are often unusual translations, and even strange words. For instance, pay_N = "liksa", a slang word.
Now at version 14. Work done:
- 5 hours correcting the lexicon
- 7 hours analysing
- 10 hours fixing RGL


1/4

Calculation of returns
- 22403 lemma tokens
-  4333 lemma types
-   390 types with 10 occurrences or more
-    61 % of tokens covered by these
- Going down from 10: (k=occs, n=lemmas with k occs, k*n)

(9,58,522),
(8,52,416),
(7,87,609),
(6,118,708),
(5,169,845),
(4,200,800),
(3,388,1164),
(2,745,1490),
(1,2126,2126)

Thus by covering >3 we now cover 79%. >2 is 84%, and >1 is 91%. >1 means 51% of the lemmas.

That is, we need to revise 2100 words to achieve 90% accuracy. Revision taking 1h/600 words (with 50% OK)
means 3.5h work. Maybe 8h work for all 4333 lemmas.

Analysed the whole log4.txt. Statistics of types of metas:

NP	25369
A	12837
N	11191
S	3961
Quant -> N -> NP	3609
N -> NP	3193
Prep -> S -> Adv	2581
NP -> VP -> S	2184
AP	2176
NP -> VPSlash -> NP	1680
S -> NP -> VP -> S	1635


Etc. 14,718 different types. Many of those could be dealt with by padding with nullables and coercions.

  Quant -> N -> NP ===> \quant, n -> DetCN (DetQuant q NumSg) (UseN n)

Also tried linearization by chunks, defined as maximal fun-headed subtrees. Quite similar to 
smoothing with shorter n-grams one could say. Long-distance agreements lost, but chunks make sense.


2/4

Implemented an elementary chunking translator, located in svn://molto-project.eu/wp5/engfin/
For the first time, able to "translate everything" from English to Finnish. The quality is horrible of course.


3/4

Worked with analysis tools, completed most of the first 300 full-Penn
words (>3) still missing in Dict. Changes in 250 sentences in
wsj-3220.

Rough estimate: in DictEngFin, there are 60k words, of which 57.5k from
WN, 2.5k manual (based on grep mkW DictEngFin.gf).


4/4

Sent missing Penn words to Inari.

Added missing syntactic constructs. Including a very nice one:
VPSlashVS : VS -> VPSlash -> VPSlash, which permits "Mr. Bronner
vaikuttaa uskovan että kuulustelut voivat tulla kummaksi tahansa
tieksi".

Having fixed the negativity of heither_Det yesterday, changed 22
sentences (only). Now at version 16.

Checked through around 500 old-generation words, changing around
200. 300 changes in test corpus.

Annotations in the grammar: MAN for manual, CHECKED for checked. MAN
is not applied everywhere, so there is a heuristics to count manual
words instead - those that don't come from WN/Kotus.

  grep -v mkW DictEngFin.gf | grep -v "K " | wc
    1845

Checked words:

  grep CHECKED DictEngFin.gf | wc
    2120

as of version 17-.

Prepared a list ToCheckFin.gf (in MOLTO svn) for checking the top-3000
Penn words, which occur >9 times. Went through >99, with 260 changes
in test corpus as result. Version 18-