Nayiri Developers: Corpus of Western Armenian

Corpus of Western Armenian — Nayiri Markup Language

Background

The Nayiri Markup Language (NML) was originally developed as a way to represent complex, hierarchically structured dictionary entries such as that of the Malkhasiants (Մալխասեանց) dictionary (1944-1945) — the Oxford dictionary of the Armenian language — and Հայոց Լեզուի Նոր Բառարան (1992), which is the largest modern Western Armenian dictionary.

Complex dictionary entries may have one or more headwords, each with annotations such as tags, pronunciation guides, inflectional rules and etymology. Each dictionary entry may further have one or more definitions, each with an inflection guide, etymology, one or more example sentences, and one or more Derivations such as Phrases, Phrasal Verbs, and Derivatives. Definitions may even contain sub-definitions and notes at various levels.

This only scratches the surface of complex dictionary entries. For some examples, see the entries for the words as ընդ, ի, լեզու, and վիշապ.

Usage in the Corpus of Western Armenian

The Nayiri Armenian Text Corpus dataset uses a small subset of the Nayiri Markup Language for the following purposes:

Explicit Tokenization
Explicit Lemmatization
Part of Speech Tagging

Explicit Tokenization

Tokenization refers to the way that a sequence of characters are grouped into a meaningful unit, or Token. The Token is then used for things like indexing, lexical analysis, AI model training, and so on.

In the Nayiri Armenian Text Corpus, a Token is treated as a full Word Form as described in the Nayiri Armenian Lexicon. This means that in the case of verbs, a Token includes any auxiliary words. For example, մատիտները, գրեմ, կը գրեմ and պիտի գրեմ are each distinct tokens.

As such, the terms "token" and "word form" are used interchangably in this documentation and mean the same thing in the context of the Lexicon and Text Corpus.

This differs in the way Large Language Models perform tokenization. Tokens in Large Language Models are usually not complete human lexemes or word forms, but rather fragments of them. For example, an LLM may tokenize մատնիները not as a single token, but as մատիտ and ները.

Futhermore, other text corpora may not consider auxiliary words to be part of a token. For example, կը and գրեմ may be considered as distinct tokens.

The Nayiri Tokenizer

In the backend, the Nayiri Tokenizer normally automatically tokenizes character streams, according to the rules of the Armenian language. This process is implicit tokenization.

However, on occasion, explicit tokenization is required to manually disambiguate situations in which there may be more than one way to tokenize a group of characters.

Explicit tokenization is also required when explicity lemmatizing and providing part of speech tagging (see below).

Syntax

In the Nayiri Markup Language, any sequence of characters can form a Token by surrounding it with double opening and double closing brackets: [[any word form]]

For example, [[կը գրեմ]] creates a Token with the value "կը գրեմ".

Explicit Lemmatization

Lemmatization (sometimes called stemming) is the act of resolving a token (word form) to its corresponding lemma (its canonical form, i.e. its stem). Similar to tokenization, lemmatization provides important additional information for the purposes of indexing, lexical analysis, and AI model training.

Like tokenization, lemmatization is often done automatically (implicitly), whenever a token can unambiguously be resolved to a lemma. For example, the word forms գրեմ and կը գրեմ unambiguously resolve to the lemma գրել.

However, on occasion, a given word form may resolve to more than one lemma. In such situations, explicit lemmatization is required to disambiguate the lemma.

For example, the word form գրէ can either resolve to the nominal lemma գիր (as its Singular, Ablative case inflected form) or the verbal lemma գրել (as its Present Tense, Subjunctive Mood, Third Person, Singular inflected form, or its Imperative Mood, Singular inflected form).

Syntax

In the Nayiri Markup Language, any explicitly tokenized word form can be lemmatized using the following syntax: [[word form >>> lemma]].

For example, [[գրէ >>> գիր]] associates the word form գրէ with the nominal lemma գիր.

Similarly, [[գրէ >>> գրել]] associates the word form գրէ with the verbal lemma գրել.

To minimize verbosity, if the lemma is the same as the word form, a dot . may be used in place of the lemma.

For example, the word մասին may resolve to either the lemma մաս ("part, piece, portion") or the postposition մասին ("about, concerning, relating to").

[[մասին >>> .]] specifies մասին as the lemma of մասին (which means the postposition մասին).

Part of Speech Tagging

In some cases, simply providing a lemma is not sufficient to disambiguate a given word form to the underlying Lemma object.

This happens when more than one Lemma object is associated with a given lemma string.

For example, the lemma string համար may refer to either the postposition համար ("for, on account of") or the noun համար ("account, number, count, calculation, enumeration").

To fully disambiguate such lemmas, a part of speech tag must be added explicitly to the lemma.

Syntax

In the Nayiri Markup Language, you can add a part of speech tag to the lemma of any lemmatized word form using the following syntax: [[word form >>> lemma@partOfSpeech]].

For example, [[համար >>> համար@ADP]] specifies that the lemma is the postposition համար.

Similarly, [[համար >>> համար@N]] specifies that the lemma is the noun համար.

Part of Speech	Tag (Long Form)	Tag (Short Form)
noun	NOUN	N
pronoun	PRONOUN	PRO
verb	VERB	V
adjective	ADJECTIVE	ADJ
adverb	ADVERB	ADV
conjunction	CONJUNCTION	CON
interjection	INTERJECTION	INT
article	ARTICLE	ART
determiner	DETERMINER	DET
adposition	ADPOSITION	ADP

For backwards compatibility with an older version of the NML, the PREP or PREPOSITION tag may also be used in the case of an adposition.

Note that while English rarely uses postpositions, Armenian uses both prepositions and postpositions; in some cases (such as բացի), an adposition can be used either as a preposition or a postposition.

For this reason, the "adposition" classification is used in the Nayiri Lexicon as a superset of prepositions and postpositions.

More examples

Some common examples of the subset of the Nayiri Markup Language described above and used in the Corpus are:

[[որ >>> որ@PRO]] versus [[որ >>> որ@CON]] to disambiguate the pronoun որ ("who, which, that") (e.g. "մարդ մը որ կարդալ կը սիրէ", "a man who likes to read") from the conjunction որ ("that") (e.g. "գիտեմ որ պիտի սիրես", "I know that you will like it").

[[այս >>> այս@PRO]] versus [[այս >>> այս@DET]] to disambiguate the pronoun այս ("this thing") (e.g. "ես այս հաւնեցայ", "I liked this") from the determiner այս ("this") (e.g. "այս մարդը", "this person").

[[նման >>> նման@ADJ]] versus [[նման >>> նման@PREP]] to disambiguate the adjective նման ("similar") (e.g. "նման մարդիկ", "similar people") from the adposition նման ("similar to, like") (e.g. "անոր նման աշխատասէր է", "he is hardworking like her").

[[մասին >>> մաս]] versus [[մասին >>> .]] or [[մասին >>> մասին]] to disambiguate the inflected form of մաս ("part, piece, portion") (e.g. "յօդուածի առաջին մասին մէջ", "in the first part of the article") from the adposition մասին ("about, concerning, relating") (e.g. "անոնց մասին ի՞նչ գիտես", "what do you know about them?")

Back to: Corpus of Western Armenian Dataset Home

Nayiri for Developers

Corpus of Western Armenian — Nayiri Markup Language

Background

Usage in the Corpus of Western Armenian

Explicit Tokenization

The Nayiri Tokenizer

Syntax

Explicit Lemmatization

Syntax

Part of Speech Tagging

Syntax

Part of Speech Tag Reference

More examples