Nayiri Developers: Corpus of Western Armenian

Corpus of Western Armenian — Concepts

Document

A Document object represents any short-form or medium-form content such as a news article or essay. It does not support pagination.

A Document consists of metadata (such as a unique identifier, title, author, publication, date of publication, etc.) and its content, which is text annoated in the Nayiri Markup Language.

Documents are stored in the folder-based Document Data Store, which is the main data store of the Corpus.

Support for Paginated Documents is in development to support books and other long-form content that naturally is divided into pages.

Author

Documents optionally have an associated Author that identifies who wrote it.

Author objects are stored in the Authors Data Store, which is one of the three data stores of the Corpus.

Publication

A Publication is any periodical or published series such as a newspaper or journal.

Each Document may optionally be associated with one Publication object.

Publication objects are stored in the Publications Data Store, which is one of the three data stores of the Corpus.

Token

In the context of the Nayiri Armenian Text Corpus, a Token is equivalent to a Word Form in the Nayiri Armenian Lexicon.

In particular, a Token can be composed of multiple words.

For example, the following inflected forms of the verb վազել ("to run") are all considered full Tokens: վազեմ, կը վազեմ, պիտի վազեմ, and վազած պիտի ըլլայի.

This was an engineering choice, because we determined that it's far easier to "downgrade" from a system that supports multi-word tokens to one that supports single-word tokens only, as compared to "upgrading" from a system that supports single-word tokens to one that supports multi-word tokens.

As such, the Nayiri Armenian Lexicon and Text Corpus are designed to do the hard things first for you, and if you need to, you can "downgrade" to a lower resolution system if that suits your application better.

Next: File Structure