Nayiri for Developers

Corpus of Western Armenian — Dataset

Overview

The Corpus of Western Armenian is a partially annotated text corpus of a selection of texts in the Western Armenian language. It was first released in May 2024 and is under continual development. Check out the Overview page for some background.

As part of the Nayiri Institute's stated goal in serving as a catalyst for the development of meaningful software in the Armenian language, here we provide the underlying annotated dataset to software developers to develop and train Armenian language systems, from AI and language models to traditional software.

Current Release

The current release (2026-02-25-v2) focuses on providing a modest but broad, high-quality dataset. In particular, the focus has been on short-form texts that are mostly in the form of essays and news articles, as well as a few songs and poems. This allows for a reasonable breadth of topics, authors, writing styles, and vocabularies.

The corpus dataset is currently is limited to Western Armenian. It contains:

  • 396 Documents
  • 21 Publications
  • 100 Authors
  • 170,745 Tokens

Development of the corpus is ongoing, with the dataset expanding over time.

Support for long-form, paginated text (for example, books and magazines) is in development.

Documentation

Before downloading and working with the corpus dataset, we recommend reviewing the documentation in the following order:

  1. Concepts – Start here to understand some general concepts about the dataset.
  2. File Structure – Understand the organization of the various data stores on disk.
  3. Document Data Store – Learn the file format of Document files in the Document Data Store.
  4. Authors Data Store – Review how Authors are stored on disk.
  5. Publications Data Store – Review how Publications are stored on disk.
  6. Nayiri Markup Language – Finally, learn how explicit tokenization, lemmatization, and part of speech tagging are done.

Licensing and Attribution

Licensing

This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

While we don't hold a copyright on the content, we reserve rights over the annotation.

You are free to:

  • Use the data for commercial and non-commercial purposes
  • Modify, adapt, and bulid upon the data
  • Redistribute the original or modified versions

Provided that you:

  • Give appropriate credit to the original source
  • Indicate if changes were made
  • Do not imply endorsement by the Nayiri Institute or Serouj Ourishian

A copy of the full license text is available at:
https://creativecommons.org/licenses/by/4.0/

Attribution

When using or redistributing this dataset, please include attribution in a reasonable and visible manner. The attribution should include the name of the dataset, the authoring organization, and the license.

Recommended attribution format:

Nayiri Armenian Text Corpus © Serouj Ourishian. Licensed under CC BY 4.0.

If you modify the data, please indicate that changes were made, for example:

Nayiri Armenian Text Corpus © Serouj Ourishian. Modified by <Your Name or Organization>. Licensed under CC BY 4.0.

Attribution in Software and Derived Works

For software applications, attribution may be included in:

  • Project documentation or README files
  • An “About” or “Credits” screen
  • License or NOTICE files

For academic or research use, attribution should appear in:

  • Papers, footnotes, or bibliographies
  • Dataset citations

Rationale

The CC BY 4.0 license is intended to encourage broad adoption and reuse of the Nayiri Armenian Text Corpus while ensuring proper attribution to the original work.

Download

The three data stores — one each for the Documents, Authors, and Publications — are provided inside a single ZIP file.

The Documents Data Store is contained in a folder named "data-store" that contains one text file per document.

The Authors and Publications Data Stores are provided as authors.properties and publications.properties files.

Before downloading, consult the Documentation for more information on the file structure, document file format, the individual data stores, and the Nayiri Markup Language.

Download: nayiri-corpus-of-western-armenian-2026-02-25-v2.zip (1.3 MB)

(1.3 MB ZIP archive containing a ~3.9 MB file structure)

Sponsorship

The design, creation, and open-source release of the Corpus of Western Armenian Dataset has been supported by the Calouste Gulbenkian Foundation.


with the sponsorship of the Calouste Gulbenkian Foundation