Navigation auf uzh.ch
These guidelines were used for selecting Swedish lemmas in SMULTRON, the Stockholm MULtilingual TReebank.
Each token gets exactly one lemma.
Each token gets one lemma as suggested by the morphology system Swetwol.
The following tokens do not get a lemma:
Examples: 37 4,9 30'261 +4.3
Examples: CX IV
If Swetwol provides more than one lemma for the given token (and the given Part-of-Speech), then the most appropriate lemma is chosen in the given context. This refers in particular to the disambiguation between different possible segmentations.
Example:
framtidsutsikter --> framtid\s#utsikt NOT: fram#tid\s#utsikt bostadsområdena --> bostad\s#område NOT: bo#stad\s#område arbetsförhållanden --> arbet\s#förhållande NOT: arbetsför#hållande NOT: arbetsför#hål#land
The disambiguation between multiple lemmas is done manually.
Important: We add information to the Swetwol lemma by marking the gap 's' with \s. [This gap morph is sometimes called an interfix.]
If Swetwol does not provide a lemma, then the human annotator chooses the correct lemma.
If the word is a proper name (PoS=NE), then the lemma is identical with the word form, unless the name is in genitive. The genitive suffix -s
will be removed.
Example:
IBMs --> IBM Schröders --> Schröder
If the word is a foreign word (PoS=FM), then the lemma is identical with the word form, unless it is an English word in plural. In that case the suffix -s
will be removed.
Example:
Directors --> Director
If a foreign word is identical with a Swedish (loan) word, there might be a Swetwol lemma for it (including segmentations). In this case the Swetwol lemma is not used.
If the token is an abbreviation, then the full word is taken as the basis for lemmatisation.
Example:
kl --> klocka % --> procent
Acronyms are not spelled out.
SEB --> SEB USA --> USA
If the token is an elliptical compound, then the full compound is taken as the basis for lemmatisation.
Example:
lång- och kortfristiga --> lång#fristig och kort#fristig kapital- och likviditetsfrågor --> kapital#fråga och likviditet\s#fråga lågspänningsbrytare och -omkopplare --> låg#spänning\s#brytare och låg#spänning\s#omkopplare
For the lemmatisation of determiners we follow the SUC (Stockholm Umeå Corpus) conventions.
Word form PoS Lemma
Word form | PoS | Lemma |
---|---|---|
de | Determiner DT | den |
Pronoun PN | de | |
den | Determiner DT | den |
Pronoun PN | den | |
dem | Pronoun PN | de |
det | Determiner DT | den |
Pronoun PN | det |
de Determiner DT den Pronoun PN de den Determiner DT den Pronoun PN den dem Pronoun PN de det Determiner DT den Pronoun PN det
In order to subclassify foreign words and names we assign the following labels.
Each foreign word (PoS=FM) gets a label specifying its language. The label is the two-character ISO language code.
Example:
Board --> Board EN Crédit --> Crédit FR
Each noun (PoS=NN) gets a label specifying its grammatical gender. We use the following labels.
Example:
utrustning --> utrustning UTR antalet --> antal NEU
Foreign words (PoS=UO) and names (PoS=PM) do not get a gender label.
If a word is misspelled, then the word is not corrected according to the principle of faithfulness to the original text. But the (imagined) corrected word is taken as the basis for the lemmatisation.
Example:
avfallshateringsregler --> avfalls#hantering\s#regel