[06:35:13] *** Joins: vin-ivar (~vinit@122.169.12.162) [07:14:32] *** Quits: vin-ivar (~vinit@122.169.12.162) (Read error: Connection reset by peer) [07:16:16] *** Joins: vin-ivar (~vinit@122.170.43.81) [09:50:03] *** Joins: DREAM (d5506a18@gateway/web/freenode/ip.213.80.106.24) [09:50:38] *** Quits: DREAM (d5506a18@gateway/web/freenode/ip.213.80.106.24) (Client Quit) [18:31:51] *** Quits: crazydiamond (~crazydiam@178.141.71.146) (Remote host closed the connection) [19:34:27] spectre, got a minute? [20:31:54] vin-ivar, hey [20:31:56] sure [20:41:46] spectre, got a minute? [20:41:48] whoops [20:41:50] okay [20:41:52] so [20:41:57] I was going through the paper [20:42:01] wait, should I PM? [20:42:23] whichever [20:43:08] right [20:43:30] how exactly are the lexical-selection paths related to the source language samples? [20:43:32] I mean [20:44:06] T seems to hold the translations for all elements in S [20:44:19] and each path corresponds to a set T [20:44:59] is it like every ambiguous word belongs in S, each mapped to a single word in T, via G? [20:45:08] G is a set of samples [20:45:37] ok wait [20:45:39] we have two G :D [20:45:48] isn't that G [20:45:52] so [20:45:52] the fancy one, I mean [20:45:54] yeah [20:46:29] so S is like [20:47:18] el estiu ser un estació humit [20:47:19] [20:47:33] right [20:47:45] G could contain four paths [20:47:48] in which case: [20:47:48] [20:48:00] the summer be a season wet [20:48:01] aah, station/season and long/lengthy? [20:48:02] the summer be a season rainy [20:48:08] wet* [20:48:14] the summer be a station wet [20:48:16] the summer be a station rainy [20:48:46] oh fuck, I was reading S as a set of unrelated words [20:48:50] this makes a lot more sense [20:48:52] thanks :D [20:49:05] :) [21:06:31] okay, so I think this makes sense now [21:06:46] essentially, what's happening is [21:07:34] *** Joins: crazydiamond (~crazydiam@178.141.72.249) [21:07:55] the probability of a particular selection path is calculated based on the probability of the translation through that path occurring in the target language corpus? [21:09:44] yeah [21:10:22] nice [21:10:31] and the target-language model is a maximum entropy classifier? [21:11:02] no [21:11:07] the target language model provides the counts [21:11:10] for input into the ME classifier [21:11:11] s [21:12:53] ah yeah [21:12:59] and [21:13:01] hang on [21:13:48] so for your above example [21:14:18] it'd be the counts of the occurences of "stations wet" or "station rainy" in the corpus [21:14:25] but not the whole sentence, a smaller context than that [21:14:28] yeah? [21:14:46] no [21:14:55] damn it [21:14:56] it would be the normalised language model scores [21:14:59] of the whole sentence [21:15:04] so e.g. [21:15:09] el estiu ser un estació humit is the input [21:15:29] [21:15:31] your paths are: [21:15:31] [21:15:36] the summer be a season wet [21:15:38] the summer be a season rainy [21:15:40] the summer be a station wet [21:15:42] the summer be a station rainy [21:15:46] [21:15:48] your translations are: [21:15:50] [21:15:52] the summer is a wet season [21:15:55] the summer is a rainy season [21:15:57] the summer is a wet station [21:16:00] the summer is a rainy station [21:16:00] [21:16:05] your LM scores are: [21:16:05] [21:16:22] p(the summer is a wet season) = 0.25 [21:16:35] p(the summer is a rainy season) = 0.45 [21:17:02] p(the summer is a wet station) = 0.19 [21:17:23] p(the summer is a rainy station) = 0.11 [21:17:24] [21:17:31] (normalised LM scores) [21:17:36] so your path: [21:17:36] [21:17:44] the summer be a season wet ==> gets 0.25 [21:17:50] the summer be a season rainy ==> gets 0.45 [21:17:50] etc. [21:17:50] [21:18:26] so when you are counting up the amount of times you see a translation given a context [21:18:30] you use these counts, and not 1 [21:19:04] okay, but what are these scores built on? i thought they were based on the occurences in the corpus [21:19:22] language model [21:19:38] http://en.wikipedia.org/wiki/Language_model [21:21:05] this is embarrassing, i thought language models worked only for bigrams or trigrams [21:21:56] well, you can throw a sentence at an LM [21:22:05] and it calculates the combined probability of the bigrams/trigrams/etc. [21:22:31] the model itself is bigram/trigram/pentagram* [21:22:36] * lol pentagram \m/ [21:23:12] yeah, I ought to have known that, I was continually thinking from a small n point of view [21:23:14] hahaha [21:24:26] ;D [21:24:41] *** Quits: vin-ivar (~vinit@122.170.43.81) (Read error: Connection reset by peer) [21:27:12] *** Joins: vin-ivar (~vinit@122.170.43.81) [21:27:30] laptop died, fml [21:27:48] anyway, thanks, spectre [21:27:57] this is pretty cool [21:28:02] :D [21:28:19] :) [21:28:30] that was my phd thesis work [21:28:53] damn, that sounds awfully interesting [21:29:14] aye :) [21:29:19] i'd like to do more work on lexical selection [21:31:23] isn't that more on the statistical side, though? I don't know why I assumed you weren't a stats guy, haha [21:31:32] i like unsupervised [21:31:39] but also rules :D [21:31:45] vin-ivar, that's why it's rules + unsupervised [21:32:03] basically i like stuff that can be applied to any language [21:32:18] so anything supervised is out [21:32:24] I'm getting sort of bored of supervised lately myself [21:32:25] because most languages don't have massive amounts of parallel text [21:32:31] or dependency treebank [21:32:34] because of the lack of corpora? [21:32:36] or hand-tagged corpora [21:32:36] yeah [21:32:43] and making those things is superboring [21:32:45] compared to writing rules [21:33:30] i wrote a rule today [21:33:37] the GF approach is pretty interesting, modeling grammatical constructions [21:33:59] I like how it resembles grammars :D [21:34:02] it says that н /n/ assimilates to ҥ /ŋ/ before ҥ /ŋ/ [21:34:06] yeah, gf is pretty cool [21:34:20] this isn't Russian [21:34:24] no it's yakut [21:34:34] Yakut is turkic? [21:34:37] yeah [21:34:52] nice [21:34:55] interlingua = cool, trees = cool, translation = cool [21:35:28] the best part about GF is Haskell :p [21:35:30] but then [21:35:31] haha [21:35:44] some people like haskell [21:35:48] I suppose that also makes it harder for the average contributor to pick up on [21:35:50] other people write python with curly braces [21:36:02] :D [21:36:07] haha [21:36:22] i'm not good at haskell [21:36:28] or indeed anything that isn't thinly veiled C [21:36:49] your bachelor's is comp sci right? [21:36:52] yeah [21:37:22] you should use a good Haskell guide, I was shit and couldn't "get" it for ages [21:37:36] CIS194 is the shit [21:38:49] i'd probably start with GF [21:38:56] because it's directly relevant to stuff i find cool [21:39:01] and then use that to learn haskell or something [21:39:21] yeah, that sounds cool [21:39:28] thinking imperative is fun [21:39:40] though I guess you could do that in Python too [21:39:42] i ike finite-state stuffs [21:39:48] *like [21:40:09] yeah, I loved that subject [21:41:17] I think there's a standard FA implementation in Haskell, maybe the state monad? inariksit would know [21:42:39] btw, Aarne also mentioned using what I'm working on to put some actual language data to use [21:42:50] e.g. ? [21:43:36] I'm thinking of how I could build a very generalised language model using apertium data, but it seems hard because [21:43:54] GF seems more sentence oriented, apertium is more word oriented. in terms of rules [21:44:09] yep [21:44:23] i guess the main thing you could do is bootstrap the morphology and lexicon in GF [21:44:32] well there's always the lexicon [21:44:34] yeah [21:45:07] what does apertium use for lexical selection right now? the same model in your paper? [21:45:19] most language pairs have nothing [21:45:25] some language pairs have hand-written rules [21:45:34] (using the same formalism) [21:45:52] nothing is directly using the method i describe yet [21:45:59] but mostly for engineering reasons [21:46:15] e.g. most language pairs were designed for 1:1 translation [21:46:17] not many:many [21:46:19] engineering, you mean fitting it into an already existing pipeline? [21:46:21] okay [21:46:25] no, it fits in perfectly [21:46:29] the problem is in the bilingual lexicons [21:47:20] I see [21:47:50] I'll think about this tomorrow, about to crash now [21:48:00] ok [21:48:00] nn! [21:48:06] have to give a seminar tomorrow :s [21:48:09] night! [22:26:55] *** Quits: spectre (~fran@115.114.202.84.customer.cdi.no) (Ping timeout: 244 seconds) [22:50:40] *** Joins: spectre (~fran@115.114.202.84.customer.cdi.no) [23:01:19] *** Quits: jmvanel (~jmvanel@66.0.88.79.rev.sfr.net) (Ping timeout: 255 seconds)