SMULTRON Swedish lemmatisation guidelines

These guidelines were used for selecting Swedish lemmas in SMULTRON, the Stockholm MULtilingual TReebank.

Goal

Each token gets exactly one lemma.

Method

Each token gets one lemma as suggested by the morphology system Swetwol.

Exceptions

The following tokens do not get a lemma:

XML tags
Punctuation symbols
Numbers that consist only of digits (and related symbols)

      Examples: 37  4,9  30'261   +4.3

Roman numbers

      Examples:  CX  IV

Case 1: Multiple Swetwol Lemmas

If Swetwol provides more than one lemma for the given token (and the given Part-of-Speech), then the most appropriate lemma is chosen in the given context. This refers in particular to the disambiguation between different possible segmentations.

Example:

	framtidsutsikter	-->	framtid\s#utsikt
      					NOT: fram#tid\s#utsikt
	bostadsområdena		-->	bostad\s#område
      					NOT: bo#stad\s#område
	arbetsförhållanden	-->  	arbet\s#förhållande
      					NOT: arbetsför#hållande
					NOT: arbetsför#hål#land

The disambiguation between multiple lemmas is done manually.

Important: We add information to the Swetwol lemma by marking the gap 's' with \s. [This gap morph is sometimes called an interfix.]

Case 2: No Swetwol Lemma

If Swetwol does not provide a lemma, then the human annotator chooses the correct lemma.

If the word is a proper name (PoS=NE), then the lemma is identical with the word form, unless the name is in genitive. The genitive suffix -s will be removed.

Example:

	IBMs      -->  IBM
	Schröders -->  Schröder

If the word is a foreign word (PoS=FM), then the lemma is identical with the word form, unless it is an English word in plural. In that case the suffix -s will be removed.

Example:

	Directors  -->  Director

If a foreign word is identical with a Swedish (loan) word, there might be a Swetwol lemma for it (including segmentations). In this case the Swetwol lemma is not used.

If the token is an abbreviation, then the full word is taken as the basis for lemmatisation.

Example:

	kl   -->  klocka
	%    -->  procent

Acronyms are not spelled out.

	SEB   -->  SEB
	USA   -->  USA

Deviations from the Swetwol Suggestions

If the token is an elliptical compound, then the full compound is taken as the basis for lemmatisation.

Example:

	lång- och kortfristiga               -->  lång#fristig och kort#fristig
	kapital- och likviditetsfrågor       -->  kapital#fråga och likviditet\s#fråga
	lågspänningsbrytare och -omkopplare  -->  låg#spänning\s#brytare och låg#spänning\s#omkopplare

For the lemmatisation of determiners we follow the SUC (Stockholm Umeå Corpus) conventions.

Word form PoS Lemma

Word form	PoS	Lemma
de	Determiner DT	den
Pronoun PN	de
den	Determiner DT	den
Pronoun PN	den
dem	Pronoun PN	de
det	Determiner DT	den
Pronoun PN	det

Word form

PoS

Lemma

de

Determiner DT

den

Pronoun PN

de

den

Determiner DT

den

Pronoun PN

den

dem

Pronoun PN

de

det

Determiner DT

den

Pronoun PN

det

de Determiner DT den Pronoun PN de den Determiner DT den Pronoun PN den dem Pronoun PN de det Determiner DT den Pronoun PN det

Type Information

In order to subclassify foreign words and names we assign the following labels.

Each foreign word (PoS=FM) gets a label specifying its language. The label is the two-character ISO language code.

Example:

	Board  -->  Board   EN
	Crédit -->  Crédit  FR

Gender Information

Each noun (PoS=NN) gets a label specifying its grammatical gender. We use the following labels.

UTR - utrum
NEU - neuter
NONE - none

Example:

	utrustning    -->  utrustning   UTR
	antalet       -->  antal        NEU

Foreign words (PoS=UO) and names (PoS=PM) do not get a gender label.

Misspelled Words

If a word is misspelled, then the word is not corrected according to the principle of faithfulness to the original text. But the (imagined) corrected word is taken as the basis for the lemmatisation.

Example:

	avfallshateringsregler  -->  avfalls#hantering\s#regel

Department of Computational Linguistics Text Technologies

Quicklinks und Sprachwechsel

Main navigation

SMULTRON Swedish lemmatisation guidelines

Goal

Method

Exceptions

Case 1: Multiple Swetwol Lemmas

Case 2: No Swetwol Lemma

Deviations from the Swetwol Suggestions

Type Information

Gender Information

Misspelled Words

Weiterführende Informationen

Title