[02:38:52] *** Joins: spectre (~fran@115.114.202.84.customer.cdi.no) [04:23:18] *** Joins: vinit-ivar (~vinit@122.179.146.7) [04:25:35] *** Quits: vin-ivar (~vinit@122.170.43.81) (Ping timeout: 246 seconds) [05:28:29] *** Quits: vinit-ivar (~vinit@122.179.146.7) (Quit: WeeChat 1.0.1) [05:29:52] *** Joins: vin-ivar (uid74719@gateway/web/irccloud.com/x-pxgpmgjgrsgrzjtv) [07:14:25] vin-ivar: GF doesn't have galician [07:14:31] or are you making a mini resource? [07:15:01] hi [07:15:41] I know, I'm trying/was trying to demonstrate how a quick lexicon could be bootstrapped using apertium [07:15:54] ok yeah :) [07:16:36] problem is paradigms are pretty unique for each language.. There isn't a standard set, is there? [07:16:43] for simple stuff, you can just import Vs, Ns and As, and leave V2, VV, VS etc. out [07:16:53] aaah right [07:16:57] sorry I just woke up [07:17:20] yeah so with lexicon, it assumes that there is already a morphology for the language in GF [07:17:27] mkV, mkN etc. [07:17:30] Isn't it like 6 in the morning in Sweden? [07:17:36] 7 in the morning :-D [07:17:43] okay [07:17:50] Haha, marginally better :p [07:18:07] yesterday I just got home and read a book until I fell asleep, it must've been before 21:30 [07:18:59] there aren't too many languages with morphologies yet no lexicons (lexica?), are there? [07:19:11] oh, phew. I can't imagine waking up at 7 :D [07:19:14] hmm yeah, I think turkish might be [07:20:25] oh hey btw, spectre, aarne said he has a student who wants to work on basque [07:20:40] and I told him yay, just use our stuff [07:21:44] vin-ivar: ok nevermind, turkis has a large monolingual lexicon [07:23:59] oh, damn [07:24:00] Hm [07:24:13] I'm wondering what the best way to use apertium resources would be [07:24:57] We have genders and word forms and stuff [07:25:18] So mkN can be done easy enough and generalised [07:25:27] lexicon sharing is definitely good, and even if many languages have monolingual dictionaries of >10k words, only 14 have bilingual [07:26:28] hmm there are some fragments for telugu [07:26:36] just checking if it's enough [07:26:42] (do you have telugu in apertium?) [07:27:29] ok not really [07:28:04] how about danish? [07:30:20] I mean, GF has danish complete syntax and morphology, but no big mono- nor bilingual dictionary [07:33:31] yep, apertium has dan-nor and dan-swe iirc [07:33:52] I'm in a seminar right now, may be a bit slow to reply [07:38:03] Danish seems to have all the words from the abstract lexicon, anyway [07:40:20] ooh, no large monolingual dictionary though yeah [07:45:26] looking at German, for nouns anyway, I think apertium's analysis could give us genders/plurals and stuff.. I don't speak danish but I'm assuming it isn't radically different from German, so perhaps all nouns in the danish monodix could be scraped/initialised with mkN, for starters.. [07:45:50] yeah, that's a start [07:46:45] also you could see what's the worst-case constructor for danish, and generate all needed forms from the apertium lexicon [07:48:13] yeah [07:54:34] en and het are Swedish, right? [07:56:01] en and ett are swedish [07:56:03] de and het are dutch [07:56:58] ah yeah, whoops [08:00:53] okay, worst case takes all four noun forms, this sounds pretty doable [08:05:49] *** Quits: Flammie (~flammie@80.111.32.7) (Ping timeout: 250 seconds) [08:18:28] could use lt-expand and grep out the good stuff [09:26:42] inariksit, cool! [09:27:00] vin-ivar, the danish lexicon in apertium is pretty reasonably sized (see apertium-dan) [09:27:15] inariksit, will you have time to work on the paper before the deadline or should we just submit as-is ? [09:30:18] -->u [09:31:59] aye, I'll work on it when I get home [09:32:42] thankfully danish declines nouns, not just articles [09:35:48] *** Quits: spectre (~fran@115.114.202.84.customer.cdi.no) (Ping timeout: 244 seconds) [10:00:18] *** Joins: spectre (~fran@dhcp890-ans.wifi.uit.no) [10:03:25] spectre: hmm, do you mean like run more experiments or just polish the writing/add references/..? [10:03:40] fix the reviewers comments [10:03:45] we don't have time for new experiments :) [10:03:55] yeah, that I thought too [10:03:57] and no logs [10:04:06] well, that's my fault, I was just leaving it to eero [10:04:07] aye [10:04:13] (the no logs part) [10:04:20] but yeah, I can do something on weekend [10:04:23] cool [10:04:27] let's see the comments [10:04:30] i will probably be around too... working on CG paper :s [10:04:34] my CG paper is going ok now [10:04:36] aww [10:04:37] yeah [10:05:42] "A point of terminology: The method termed "unsupervised" is potentially misleading in this context, as this is clearly not an unsupervised method in the machine learning sense of the term. The term "rule-based" is perhaps more appropriate." [10:05:46] :D [10:05:55] no buzzword bingo [10:06:04] we can put non-supervised [10:06:05] :D [10:06:08] hahah [10:06:13] or just rule-based ^_^ [10:06:18] so old-fashioned [10:06:22] but it's not entirely rule-based [10:06:24] hmm [10:06:50] or just describe what it is, a method requiring a small ruleset and target language model [10:07:03] yeah [10:07:21] the point is that it doesn't require annotated or parallel corpora [10:07:35] so is applicable to any language, not just those with massive resources (like finnish) ^^ [10:07:46] yeah, that's true [10:07:54] aww "Additionally, given the results, the method is not likely to be of practical interest to the community as proposed." [10:08:33] we need to stress the _this doesn't require a parallel corpus" --- and "this improves MT" [10:08:44] so yeah, maybe not use neither of the terms "unsupervised" or "rule based", just describe those strong points [10:08:45] (i mean, i kind of agree you could do something much better in a couple of weeks) [10:09:00] :-P [10:09:04] but on the other hand, it's not without merit (see previous points) [10:09:05] :D [10:09:09] yeah [10:09:58] I can also look at laura's comments [10:10:05] or did you do something with them already? [10:11:05] i haven't done anything [10:11:08] ok [10:11:08] apart from uncomment our names [10:11:10] and joonas' [10:11:15] ok [10:11:27] I'll spend today working on my cg paper, hopefully koen has some final comments [10:11:49] same here [10:11:56] we can look at the other one during the w/end [10:11:59] yeah [10:12:02] i have eamt reviews to do too :/ [10:12:05] aww [10:12:13] yeah you are the busier one here now [10:12:13] but my sister is coming on monday ^^ [10:12:15] cool! [10:12:49] also that basque-interested student should come on irc [10:13:18] yeah, so far I just got an email address, aarne forwarded my mail to them [10:13:19] aitor_osasuna@hotmail.com [10:13:32] (my email was in finnish and basque :-D) [10:14:20] :) [10:14:27] osasuna [10:14:30] heh [10:14:51] (I mean finnish text and basque examples; I was thinking if I should just write in english to aarne in the first place in case he wants to forward it, but then I thought that damn, this is what lang tech is for, that I can write finnish mails to finnish people [10:15:45] aye ^__^ [10:16:02] they can always use GF [10:16:03] to translate [10:16:09] yeah! [10:16:10] fin->spa or something ^^ [10:17:28] I suppose *->eng is still better quality than *->spa :-P [10:17:43] although "yarmulke an experience it slopes anglicized xmas" [10:17:53] is this the EAMT paper? [10:18:09] inariksit, :D [10:18:12] this paper is colloquial finnish -> standard finnish paper for nodalida [10:18:25] vin-ivar, eamt notifications aren't out yet [10:18:53] oh [10:18:58] ok now I should get back to work [10:19:16] I should start following research more ._. [10:19:40] you're on several irc channels with academics ;-D [10:20:06] that's a start, yeah :) [10:21:27] spectre: what's the point of the subreadings here? [10:21:28] ELECT:este_2 Det IF (0 ("este") + $$MascSg) (-1 Verb_Prep) (1 A + $$MascSg) (2 N + $$MascSg) ; [10:22:01] there is no subreading [10:22:03] MascSg is just a collection of masc-like and sg-like values [10:22:04] ah [10:22:09] I thought $$meant subreading [10:22:13] so what does it mean? :-D [10:22:16] that's for unification [10:22:19] hmm okay [10:22:21] e.g. [10:22:29] all of the instances $$MascSg must agree [10:22:33] it's a shorthand [10:22:36] instead of writing: [10:22:37] ah right now I understand [10:22:48] :) [10:23:07] "my CG paper is going well" ... "I don't know the difference between subreading and unification" [10:23:16] (in my implementation I've just ignored the unification) [10:23:22] haha :D [10:23:32] neither of those things are in cg2 [10:23:36] so you're probably safe :D [10:23:40] yeah [10:23:44] subreadingsi s an awful apertium hack [10:23:50] :) [10:24:18] is that the thing where you get like ^del/de + el$ [10:24:24] yeah [10:24:25] I've also ignored those [10:24:37] it's to deal with ambiguous tokenisation [10:24:43] and contractions [10:24:45] yeah [10:24:48] ok [10:25:46] thanks! [10:25:53] *now* hopefully back to work :-D [10:26:01] o/ [10:32:55] *** Joins: worldsayshi (~per.frede@h-5-150-197-146.na.cust.bahnhof.se) [11:28:29] they have "of went to" as a gloss >_> [11:38:02] is it the correct gloss for whatever language they're glossing? [11:38:15] not really [11:38:17] or is it meant to be as an idiomatic english translation? [11:38:19] (kazakh) [11:38:19] yeah [11:38:29] it's supposed to be idiomatic [11:38:34] haha ok [11:38:38] i'm going to suggest they use ILG [11:38:44] ILG? [11:38:48] interlinear glosses [11:38:54] now i'm looking at photos of contrabass saxophone [11:38:55] and giggling [11:39:10] paper reviewing sounds like fun [11:39:14] hahaha [11:39:22] I'm doing an experiment where I flip the rules of the apertium spanish CG [11:39:24] yeah, paper reviewing is usually combined with that [11:39:28] http://i.ytimg.com/vi/DtWqOyFcMwU/hqdefault.jpg <-- inariksit [11:39:42] # SELECT:pr_cnjadv_3 CnjAdv + PrepDe (1C Inf OR Enc) ; --> SELECT:pr_cnjadv_3 Inf OR Enc IF (-1 CnjAdv + PrepDe) ; [11:39:50] hahah [11:39:59] i wonder what sound that makes [11:40:08] whommmmmmm [11:40:17] you could use it in orchestral dubstep [11:40:22] so many CG rules are written like SELECT Det IF (0 ("este")) (1 N), instead of SELECT (det "este") IF (1 N) [11:40:36] http://www.chooi.co.uk/resources/saxophones/Contrabass.jpg <-- this is also cool, "hi i'm playing a massive saxophone" whommmm [11:40:49] and it's obvious that they mean that the SELECT Foo IF (0 Bar) is supposed to be Foo Bar in the tags [11:40:50] inariksit, yeah [11:41:46] is that some unofficial style guide, or just people don't know you can write SELECT (det "este")? [11:42:20] well ok I've just looked at two CGs in the apertium repo [11:42:20] neither i think [11:42:25] one of which was the english one [11:42:26] i don't like inline sets [11:42:28] oh dear [11:42:31] the english one ._. [11:42:38] where SELECT A if (1 "Party") [11:42:52] haHAHAHA [11:42:53] yeah [11:42:55] :((((( [11:43:06] let's have a party! [11:43:14] well, you can have LIST Este = "Este" "este" and then SELECT Det + Este IF (1 N) [11:43:22] or LIST Este = "este"i ^^ [11:43:24] yeah [11:43:29] ah that's what the i means :-D [11:43:34] i = case insensitive [11:43:38] yeah [11:43:39] r = tag contains regex [11:44:09] but you don't need it so much if you use the -w option [11:44:13] to get dictionary case on lemmas [11:44:50] the spanish grammar is also full of these REMOVE:r_como_1 Vblex (0 ("como") OR ("Como")) (*-1 Vblex BARRIER Cnj_Rel) [11:44:59] :/ [11:45:06] that was probably hèctor [11:45:55] i think it would be fair to say that most people writing CG do not have a firm grasp of the syntax [11:46:12] the spanish grammar also doesn't use IF [11:46:13] : [11:46:14] :/ [11:47:32] maybe the javascript-based syntax will save the situation 8D [11:47:42] ahahahaha [11:47:50] yeah, definitely [11:48:49] we had an idea about XML [11:48:55] rule formalism :D [11:49:05] for CG [11:49:49] hahah [11:49:58] that will make linguists love it [11:50:01] yeah [11:50:13] it will make people who like angle brackets very happy [11:51:56] does that set intersect with people who write CG? [11:51:58] SELECT:que_2 CnjSub (0 ("que")) (-1C Vblex LINK NOT 0 PP) ; [11:52:02] does this make any sense? [11:52:08] does that set intersect with people who write CG? [11:52:16] the intersection is non-null [11:52:27] but i don't know if the cardinality is >1 [11:52:29] :P [11:52:31] -1 is a word that is unambiguously Vblex and at the same position there is nothing tagged as PP [11:52:40] yeah [11:52:48] it means "a verb that is not a past participle" [11:52:51] e.g. a finite verb [11:53:01] ahh I thought PP was a preposition phrase [11:53:16] that would be SP [11:53:17] and was like "that's stupid, if it's unambiguously a verb, it can't be a phrase" [11:53:19] right [11:53:21] sintagma preposicional :D [11:53:22] ok, nevermind :-D [11:53:23] haha [11:54:02] so I'm flipping it as [11:54:03] SELECT:que_2 Vblex - PP IF (1 CnjSub + ("que")) ; [11:54:25] looks plausible :) [11:55:43] so far I don't have a good way to test if it actually retains right results; I'm just "ok I parse don quijote, and with the original grammar, my implementation agrees with vislcg 96% of the words, and with flipped grammar 92%" [11:55:50] did you say you had gold standard only for russian? [11:56:09] russian and kazakh [11:56:22] there was some other work on tagging too [11:56:23] (well, this flipping is hardly the most important thing, but in general, all my testing is just comparing same text + same grammar + vislcg3) [11:56:29] oh cool [11:56:37] and how big was the kazakh? [11:56:44] it should be around 20k [11:56:46] words [11:56:50] cool [11:56:51] aida is working on it [11:56:54] and the cg? [11:57:00] CG is ~190 rules [11:57:05] but it is as good as the russian one [11:57:08] ok, that's nice size! [11:57:11] ilnar wrote it [11:58:19] https://svn.code.sf.net/p/apertium/svn/branches/apertium-swpost [11:58:33] if you look in: [11:58:33] $ find . | grep tagger-data [11:58:38] then you'll find some more tagged corpora [11:59:06] ah right it's also in cyrillic >__> I should still find how to parse stuff with unicode [11:59:25] they should be in the same tagset as the morphological analyser / CG for spanish [11:59:27] and catalan and english [11:59:28] cool! [11:59:30] that sounds nice [12:00:25] ok now lunch [12:00:36] o/ [12:09:31] https://www.youtube.com/watch?v=DtWqOyFcMwU [12:09:33] uahahuahuahua [12:09:41] Rob is like "that's just the sound i've been looking for" [12:16:05] *** Quits: spectre (~fran@dhcp890-ans.wifi.uit.no) (Ping timeout: 252 seconds) [12:24:16] *** Joins: spectre (~fran@dhcp890-ans.wifi.uit.no) [12:35:22] *** Quits: spectre (~fran@dhcp890-ans.wifi.uit.no) (Ping timeout: 255 seconds) [13:01:33] wow, using the tagged spanish data, SAT-CG agrees 78% of the words and VISLCG3 77% [13:01:48] (totallt GF related :-D and spectre isn't even here) [13:34:59] *** Joins: spectre (~fran@c9ECC00C3.dhcp.as2116.net) [14:27:44] *** Quits: vin-ivar (uid74719@gateway/web/irccloud.com/x-pxgpmgjgrsgrzjtv) (Quit: Connection closed for inactivity) [14:32:29] *** Joins: Flammie (~flammie@ilazki.thinkgeek.co.uk) [15:36:28] *** Joins: vin-ivar (~vinit@122.179.146.7) [17:54:17] *** Quits: worldsayshi (~per.frede@h-5-150-197-146.na.cust.bahnhof.se) (Quit: worldsayshi) [19:15:24] *** Joins: jstar (~jstar@unaffiliated/jstar) [21:19:35] inariksit, spectre: [21:20:37] i scraped the danish monodix and got two huge lists of utrum/neutrum gendered nouns [21:23:03] now there's no eng-dan pair [21:23:25] so I can't directly get what paradigm is what word [21:23:53] but I was thinking I could use an intermediary language, maybe Swedish