Aarne Ranta %%date %!Encoding:utf8 %!style(html): revealpopup.css %!postproc(tex) : "#BECE" "begin{center}" %!postproc(html) : "#BECE" "
" %!postproc(tex) : "#ENCE" "end{center}" %!postproc(html) : "#ENCE" "
" %!postproc(tex) : "#HR" "hline" %!postproc(html) : "#HR" "
" Also available for [Chinese gf-chinese.html] [Finnish gf-finnish.html] [French gf-french.html] [German gf-german.html] [Swedish gf-swedish.html] #HR **Digital grammars** are grammars usable by computers, so that they can mechanically perform tasks like interpreting, producing, and translating languages. The **GF Resource Grammar Library** (RGL) is a set of digital grammars which, at the time of writing, covers 28 languages. These grammars are written in GF, **Grammatical Framework**, which is a programming language designed for writing digital grammars. The grammars in the RGL have been written by linguists, computer scientists, and programmers who know the languages thoroughly, both in practice and in theory. Almost 50 persons from around the world have contributed to this work, and ongoing projects are expected to give us many new languages soon. The leading idea of the RGL is that different languages share large parts of their grammars, despite their observed differences. One important thing that is shared are the **categories**, that is, the types of words and expressions. For instance, every language in RGL has a category of **nouns**, but what exactly a noun is varies from language to language. Thus English nouns have four forms (singular and plural, nominative and genitive, as in //house, houses, house's, houses'//) whereas French nouns have just two forms (singular and plural //maison, maisons//, "house"), but they also have a piece of information that English nouns don't have, namely gender (masculine and feminine). Chinese nouns have just one form (房子 //fangzi// "house"), which is used for both singular and plural, but in addition, a little bit like the French gender, they have a **classifier** (间 //jian// for the word "house"). German nouns have 8 forms and a gender, Finnish nouns have 26 forms, and so on. This document provides a tour of the digital grammars in the RGL. It is intended to serve at least three kinds of readers. In the decreasing order of the number of potential readers, - those who want to learn the grammar of some language in a precise way, - those who want to use the RGL for a programming task, - those who want to write an RGL grammar for a new language. The document has two main parts: **Words** and **Syntax**. Both parts have a **general section**, explaining the RGL structure from a multilingual perspective, followed by a **specific section**, going into the details of the grammar in a particular language. The general sections are the same in all languages. The specific sections differ in length and detail, depending on the complexity of the language and on what aspects are particularly interesting or problematic for the language in question. +Words: general rules+ Categories of words are called **lexical categories**. The language-specific variation in lexical categories is due to **morphology**, that is, the different forms that one and the same word can have in different contexts. If we look at the 28 languages in the RGL, we can see that the classification of words is common to all the languages, and the differences are in morphology. In this chapter, we will explain all lexical categories and give an overview of their morphological aspects. Details of morphology for each language is given in the language-specific documents. ++Main parts of speech: content words++ The most important categories of words are given in the following table. More precisely, we will give the categories of **content words**, which, so so say, describe things and events in the real world. Content words are distinguished from **structural words**, whose purpose is to combine words into syntactic structures. Each category of content words may have thousands of words, and new words can be introduced continuously; therefore, these categories are also called **open categories**. In contrast, structural words are very few (maybe some dozens), and new ones are very seldom added. Each category has a GF name, that is, a short symbolic name, which is the name actually used in the GF program code. In the text we usually use the text names, but will sometimes find the GF names handy to use as well, since they give us a short and precise way to state grammatical rules. ===Table: categories of content words=== || GF name | text name | example | inflectional features | inherent features | semantics || | ``N`` | noun | //house// | number, case | gender, classifier | ``n`` = ``e -> t`` | ``PN`` | proper name | //Paris// | case | gender | ``e`` | ``A`` | adjective | //blue// | gender, number, case, degree | position | ``a`` = ``e -> t`` | ``V`` | verb | //sleep// | number, person, tense, aspect, mood | subject case | ``v`` = ``e -> t`` | ``Adv`` | adverb | //here// | (none) | adverb type (place, time, manner) | ``adv`` = ``v -> v`` | ``AdA`` | adadjective | //very// | (none) | (none) | ``a -> a`` In addition to the names and examples, the table lists the **inflectional features** and **inherent features** typical of each category. Inflectional features are those that create different forms of words. For instance, French nouns have forms for number (singular and plural) - or, as one often says, French nouns are //inflected for number//. In contrast to number, the gender does not give rise to different forms of French nouns: //maison// ("house") //is// feminine, inherently, and there is no masculine form of //maison//. (Of course, there are some nouns that do have masculine and feminine forms, such as //étudiant, étudiante// "male/female student", but this only applies to a minority of French nouns and shouldn't be taken as an indication of an inflectional gender.) ++Syntactic implications++ The features given in the table are rough indications for what one can expect in different languages. Thus, for instance, some languages have no gender at all, and hence their nouns and adjectives won't have genders either. But the table is a rather good generalization from the 28 language of the RGL: we can safely say that, if a language //does// have gender, then nouns have an inherent gender and adjectives have a variable gender. This is not a coincidence but has to do with **syntax**, that is, the combination of words into complex expressions. Thus, for instance, nouns are combined with adjectives that modify them, so that #BECE //blue// + //house// = //blue house// #ENCE Now, adjectives have to be combinable with all nouns, independently of the gender of the noun: there are no separate classes of masculine and feminine adjectives (again, with some apparent exceptions, such as //pregnant//, but even these adjectives have at least grammatically correct metaphoric uses with nouns of other genders). This means that we must be able to pick the gender of the adjective in agreement with the gender of the noun that it modifies, which means that the gender of adjectives must be inflectional. Thus in French the adjective for "blue" is //bleu//, with the feminine form //bleue//, and works as follows: #BECE //bleu// + //maison// = //maison bleue// ("blue house", feminine) //bleu// + //livre// = //livre bleu// ("blue book", masculine) #ENCE French also provides examples of adjectives with different **positions**: //bleu// is put after the noun it modifies, whereas //vieux// ("old") is put before the noun: //vieux livre// ("old book"). We will return to syntax later. At this point, it is sufficient to say that the morphological features of words are not there just for nothing, but they play an important role in how words are combined in syntax. In particular, they determine to a great extent how **agreement** works, that is, how the features of words depend on each other in combinations. ++Semantics of the categories++ //Notice: this section, and all "semantics" columns can be safely skipped, because// //the semantics types do not belong to the RGL proper, and don't appear anywhere in the code.// //Their understanding can however be useful, in particular for programmers who want to use the RGL to// //express logical relations, ontologies, etc// The last column in the category table shows the **semantic type** corresponding to each category. This type gives an indication of the kind of meaning that the word of each type has. Starting from the simplest meanings, ``e`` is the type of **entities** that serve as meanings of proper names. Nouns, adjectives, and verbs have the type ``e -> t``, which means **functions from entities to propositions** (where the symbol ``t`` for propositions comes from **truth values**). Such a function can be **applied** to an entity to yield a proposition. The type ``t`` itself is reserved for sentences, which are formed in syntax by putting words together. For example, the sentence //Paris is large// involves an application of the adjective //large// to //Paris//, and yields the value true if //large// applies to //Paris//. //Paris is a capital// works in the similar way with the noun //capital//, and //Paris grows// with the verb //grow//. The semantic types will be useful in syntax to explain the ways in which expressions are combined. They are also useful in explaining some differences between categories. For example, the categories ``PN`` and ``N`` are different, because a ``PN`` refers to an entity but an ``N`` expresses a property of an entity. Of course, the semantic types alone do not explain all distinctions of categories: nouns, verbs, and adjectives have the same semantic type, but different syntactic properties. We will occasionally use the **type synonyms** ``n``, ``a``, and ``v`` instead of ``e -> t``, to give a clearer structure to some semantic types. But from the semantic point of view, all these types are one and the same. We should notice that the semantic types given here are quite rough and do not give a full picture of the nuances. For instance, many adjectives work in a different way than straightforwardly yielding truth values from entities. An example is the adjective //large//. Being a //large mouse// is different (in terms of absolute size) from being //a large elephant//, and a logical type for expressing this is ``n -> e -> t``, with an argument ``n`` indicating the domain of comparison (such as mice or elephants). Another problem is that defining verbs as ``e -> t`` suggests that all verbs apply to all kinds of entities. But there are combinations of entities and verbs that make no sense semantically. For example, the verb //sleep// is only meaningful for animate entities, and a sentence like //this book sleeps//, if not senseless, requires some kind of a metaphorical interpretation of //sleep//. The following table summarizes the most important semantic types that will be used. We use more primitive types than most traditional approaches, which reduce everything to ``e`` and ``t``. For instance, we can't see any way to reduce the top-level category ``p`` of phrases to these types. From a type-theoretical perspective, ``p`` is the category of **judgements**, whereas ``e`` and ``t`` operate on the lower level of propositions. Some more types are defined in the category tables. ===Table: semantic types=== || name | text name | example | definition || | ``e`` | entity | //Paris// | (primitive) | ``t`` | proposition ("truth value") | //Paris is large// | (primitive) | ``q`` | question | //is Paris large// | (primitive) | ``p`` | top-level phrase | //Paris is large.// | (primitive) | ``n`` | substantive ("noun") | //man// | ``e -> t`` | ``a`` | quality ("adjective") | //large// | ``e -> t`` | ``v`` | action ("verb") | //sleep// | ``e -> t`` | ``np`` | quantifier ("noun phase") | //every man// | ``(e -> t) -> t`` ++Subcategorization++ In addition to the features needed for inflection and agreement, the lexicon must give information about //what// combinations are possible with each word. For most nouns and adjective, this is simple: a noun can be modified by an adjective, for instance, and there is a uniform syntax rule for this. However, there are some nouns and adjectives that are trickier, because they don't correspond to simple things but to **relations**. For instance, //brother// is a **relational noun**, since its primary usage is not alone bur in phrases like //brother of this man//. In the same way, //similar// is a **relational adjective**, since its primary use is in phrases like //similar to this//. The additional term attached to these words is called its **complement**; thus //this// is the complement in //similar to this//. The categories of words that take complements are called **subcategories**. They are morphologically similar to the main categories, but need extra information for the usage of complements. The RGL has categories for relational nouns and adjectives, and nouns also have a variant with two complements (e.g. //distance from Paris to Munich//). From the semantic point of you, complements are called **places**, and appear as supplementary argument places in semantic types. Thus the number of places is one plus the number of complements, so that the first place is reserved for the subject of a sentence and the rest of the places for the complements. The following table shows the categories of relational nouns and adjectives in the RGL. The inflectional and inherent features are the same as for one-place nouns and adjectives, but for each complement, the lexicon must tell what preposition, if any, is needed to attach that complement. For instance, the preposition for //similar// is //to//, whereas the preposition for //different// is //from//. In languages with richer case systems (such as German, Latin, and Finnish), the complement information also determines the case (genitive, dative, ablative, and so on). ===Table: subcategories of nouns and adjectives=== || GF name | text name | example | inherent complement features | semantics || | ``N2`` | two-place noun | //brother// (//of someone//) | case or preposition | ``e -> n`` | ``N3`` | three-place noun | //distance// (//from some place to some place//) | case or preposition | ``e -> e -> n`` | ``A2`` | two-place adjective | //similar// (//to something//) | case or preposition | ``e -> e -> t`` Verbs show a particularly rich variation in subcategorization. The most familiar distinction is the one between **intransitive** and **transitive** verbs: intransitive verbs need only a **subject** (like //she// in //she sleeps//), whereas transitive verbs also need an **object** (like //him// in //she loves him//). Our category ``V`` obviously includes intransitive verbs. But there is no category for transitive verbs in the RGL. Instead, we have a more general category of **two-place verbs**, which includes transitive verbs but also verbs that need a preposition (such as //at// in //she looks at him//). Just like for relational nouns and adjectives, the complement of a two-place verb has variations in cases and prepositions. The following table shows the subcategories of verbs in the RGL. The list is long but it may still be incomplete. For example, there are no four-place verbs (//she paid him one million pounds for the house//). Such constructions can be built, as we will see later, by using for instance a ``V3`` verb with an additional adverb. But we can envisage future additions of more subcategories for verbs. ===Table: subcategories of verbs=== || GF name | text name | example | inherent complement features | semantics || | ``V2`` | two-place verb | //love// (//someone//) | case or preposition | ``e -> e -> t`` | ``V3`` | three-place verb | //give// (//something to someone//) | two cases or prepositions | ``e -> e -> e -> t`` | ``VV`` | verb-complement verb | //try// (//to do something//) | infinitive form | ``e -> v -> t`` | ``VS`` | sentence-complement verb | //know// (//that something happens//) | sentence mood | ``e -> t -> t`` | ``VQ`` | question-complement verb | //ask// (//what happens//) | question mood | ``e -> q -> t`` | ``VA`` | adjective-complement verb | //become// (//something, e.g. old//) | adjective case | ``e -> a -> t`` | ``V2V`` | two-place verb-complement verb | //force// (//someone to do something//) | infinitive form, control type | ``e -> e -> v -> t`` | ``V2S`` | two-place sentence-complement verb | //tell// (//someone that something happens//) | object case, sentence mood | ``e -> e -> t -> t`` | ``V2Q`` | two-place question-complement verb | //ask// (//someone what happens//) | object case, question mood | ``e -> e -> q -> t`` | ``V2A`` | two-place adjective-complement verb | //paint// (//something in some colour, e.g. blue//) | object and adjective case | ``e -> e -> a -> t`` Of particular interest here is the infinitive form in ``VV`` and ``V2V``. For instance, English has three such forms: bare infinitive (//I must sleep//), (infinitive with //to// (//I try to sleep//), and the //ing// form (//I start sleeping//). The traditional English grammar makes a distinction between **auxiliary verbs** (such as //must//) and other verb-complement verbs (such as //try// and //start//), but this distinction is very specific to English (and some other Germanic languages) and hard to maintain in a multilingual setting like the RGL. Thus we make the distinction on the level of complement features and not on the level of categories. The **mood** of complement sentences and questions is relevant in languages like French and Ancient Greek, where some verbs may require sentences in the indicative, some in another mood such as subjunctive, conjunctive, or optative. English has only a few remnants of conjunctives, such as with the verb //suggest// as used in //I suggest that this part be struck out//. The type of **control** in ``V2V`` is interesting but subtle. It decides whether the verb complement of the verb agrees to the subject or the object. An example of a **subject-control verb** is //promise//: //I promised her to wash myself//. **Object-control verbs** seem to be more common: //I forced her to wash herself//, //I made her wash herself//, etc. Semantically, the type ``e -> e -> v -> t`` works for both of them. However, if you consider the proposition formed by applying them, then the two kinds of verbs apply their argument verb to different arguments: - ``promise subj obj verb`` is about the proposition ``verb subj`` - ``force subj obj verb`` is about the proposition ``verb obj`` Hence it would make sense to distinguish between subject-control and object-control ``V2V``'s on the category level rather than with a complement feature. The agreement behaviour would them become simpler to describe, and, what is more important, the semantic behaviour would be predictable from the category alone. As a final thing about subcategorizations, notice that one and the same verb can have different categories. In the above table, //ask// appears in both ``VQ`` and ``V2Q``. Now, these uses are related, in the sense that to //ask something// means the same as to //ask someone something//. But in some other cases, the meaning can be completely different. For instance, //walk// in ``V2`` (as in //I walk the dog//) is different from //walk// in ``V`` (as in //the dog walks//). The ``V2`` is in this case **causative** with respect to the ``V``: I cause the walking of the dog. From the multilingual perspective, it is just a coincidence that English uses the same verb for the intransitive and the causative meanings. In many other languages, different words would be used. And so would English do for many other verbs: one cannot say //I eat the dog// to express that I make the dog eat; the verb //feed// is used instead. ++Structural words++ We have defined the categories of content along three criteria: - **morphological**: words belonging to the same category must have the same types of inflectional and inherent features - **syntactic**: words belonging to the same category must have the same syntactic combination possibilities - **semantic**: words belonging to the same category must have the same semantic type Thus morphological criteria are, in most languages, enough to tell apart ``N``, ``A``, ``V``, and ``Adv``. Syntactic criteria are appealed to when distinguishing the subcategories of nouns, adjectives, and verbs. Semantic criteria are often obeyed as well, although we have noticed that finer distinctions could be useful for subject vs. object control verbs and for different kinds of adjectives. For structural words, following the same criteria leads to a high number of categories, higher than in many traditional grammars. Thus, for instance the category of **pronouns** is divided to at least, personal pronouns (//he//), determiners (//this//), interrogative pronouns (//who//), and relative pronouns (//that//). There is no way to see all these classes as subcategories of a uniform class of pronouns, as we did with the verb subcategories: for verbs, there was a uniform set of features, to which only complement feature information had to be added, but the same does not concern the things traditionally called "pronouns". Structural words moreover contain many categories that have no morphological variation or morphologically relevant features. For instance, interrogative adverbs (such as //why//) and sentential adverbs (such as //always//) are, in all languages we have encountered, equivalent from the morphological point of view. Yet of course they are syntactically different, as one cannot convert //why are you always late// into //always are you why late//. And semantically, sentential adverbs modify actions whereas interrogative adverbs form questions from sentences. The following tables give a summary of the structural word categories of the RGL, equipped with morphological and semantic information as we did for content words. The full details will be best explained in the sections on syntax, i.e. on how the structural words are actually used for building structures. ===Table: categories of structural words=== **Building noun phrases** || GF name | text name | example | inflectional features | inherent features | semantics || | ``Det`` | determiner | //every// | gender, case | number, definiteness | ``det`` = ``n -> (e -> t) -> t`` | ``Quant`` | quantifier | //this// | gender, number, case | definiteness | ``num -> det`` | ``Predet`` | predeterminer | //only// | gender, number, case | (none) | ``np -> np`` | ``Pron`` | personal pronoun | //he// | case, possessives | gender, number, person | ``e`` The most important thing to notice is the distinction between ``Det`` and ``Quant``. The latter covers determiners that have "two forms", for both numbers, such as //this-these// and //that-those//. The former covers determiners with a fixed number, such as //every// (singular). **Building number expressions** || GF name | text name | example | inflectional features | inherent features | semantics || | ``Num`` | number expression | //five// | gender, case | number | ``num`` = ``det`` | ``Card`` | cardinal number | //five// | gender, case | number | ``num`` = ``det`` | ``Ord`` | ordinal number | //fifth// | gender, number, case | (none) | ``e -> t`` | ``Numeral`` | verbal numeral | //five// | gender, case, card/ord | number | ``num`` | ``Digits`` | numeral in digits | //511// | card/ord | number | ``num`` | ``AdN`` | numeral-modifying adverb | //almost// | (none) | (none) | ``num -> num`` //Notice: under// ``Numeral``, //there is a category structure of its own, which is however of a technical// //nature and needs usually no attention by the library users.// **Building interrogatives and relatives** || GF name | text name | example | inflectional features | inherent features | semantics || | ``IP`` | interrogative pronoun | //who// | case | gender, number | ``(e -> t) -> q`` | ``IDet`` | interrogative determiner | //how many// | gender, case | number | ``n -> (e -> t) -> q`` | ``IQuant`` | interrogative quantifier | //which// | gender, number, case | (none) | ``num -> n -> (e -> t) -> q`` | ``IAdv`` | interrogative adverb | //why// | (none) | (none) | ``t -> q`` | ``RP`` | relative pronoun | //that// | gender, number, case | gender, number | ``(e -> t) -> rel`` The interrogative pronoun structure replicates a part of the determiner structure. For instance, ``IQuant`` such as //which// is usable for both singular and plural, whereas //IDet// has a fixed number: //how many// is plural. **Combining sentences** || GF name | text name | example | inflectional features | inherent features | semantics || | ``Conj`` | conjunction | //and// | (none) | number; continuity | ``t -> t -> t`` | ``PConj`` | phrasal conjunction | //therefore// | (none) | (none) | ``p -> p`` | ``Subj`` | subjunction | //if// | (none) | mood | ``t -> adv`` **Adverbial expressions** || GF name | text name | example | inflectional features | inherent features | semantics || | ``AdV`` | sentential adverb | //always// | (none) | (none) | ``v -> v`` | ``CAdv`` | comparative adverb | //as// | (none) | (none) | ``a -> e -> a`` | ``Prep`` | preposition | //through// | (none) | case, position | ``np -> adv`` One more thing to be taken into account is that many of the "structural word categories" also admit of complex expressions and not only words. That is, the RGL has not only words in these categories but also syntactic rules for building more expressions. Thus for instance //these five// is a ``Det`` built from the ``Quant`` //this// and the ``Num`` //five//. It is also common that a "structural word" in a particular language is realized as a feature of the other words it combines with, rather than as a word of its own. For instance, the determiner //the// in Swedish just selects an inflectional form of the noun that it is applied to: "the" + //bil// = //bilen// ("the car").