Transactions of KarRC RAS :: Scientific publications
Transactions of KarRC RAS :: Scientific publications

Transactions of KarRC RAS :: Scientific publications
Karelian Research Centre of RAS
ISSN (print): 1997-3217
ISSN (online): 2312-4504
Transactions of KarRC RAS :: Scientific publications
Background Editorial committee Editorial Office For authors For reviewer Russian version
Transactions of KarRC RAS :: Scientific publications

Electronic Journal OJS



Series

Biogeography

Experimental Biology

Mathematical Modeling and Information Technologies

Precambrian Geology

Ecological Studies

Limnology and Oceanology

Research in the Humanities (2010-2015)

Region: Economy and Management (2012-2015)



Issues

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

1999-2008


SCIENTIFIC PUBLICATIONS
.. , .. , .. , .. , .. , .. , .. , .. .
-
//  . No 7. . 2021. C. 100-115
T.P. Boyko, N.G. Zaitseva, N.B. Krizhanovskaya, A.A. Krizhanovsky, I.P. Novak, N.A. Pellinen, A.P. Rodionova, E.D. Trubina. The linguistic corpus VepKar is a language refuge for the Baltic-Finnish languages of Karelia // Transactions of Karelian Research Centre of Russian Academy of Science. No 7. Multidisciplinary scientific research at KarRC RAS. 2021. Pp. 100-115
Keywords: Karelian language; Veps language; corpus linguistics; Open corpus of Veps and Karelian languages; corpus manager; inflectional paradigm
The purpose of creating conservation areas is to protect endangered plant and animal species. Large, tagged linguistic corpora with a great variety of genres are used for the preservation and research of safe and endangered languages. The article describes the history, structure and development of the Open Corpus of the Veps and Karelian languages. The Veps language corpus was created in 2009 under the leadership of Nina Zaitseva. Three Karelian subcorpora (Karelian proper, Livvi and Ludian) were included in the linguistic corpus in 2016. The united linguistic platform was named The Open Corpus of the Veps and Karelian languages (VepKar). This linguistic corpus includes texts and dictionaries stored in a database, and a computer program (corpus manager) for searching and processing the data. This corpus manager was written in the PHP programming language in the Laravel framework. The data are stored in a MySQL database. Corpus and dictionaries data are available online (dictorpus.krc.karelia.ru). YouTube and Wikipedia are used by VepKar authors to popularize the corpus. Dictionaries and corpus texts are strongly interrelated. Multifunctional dictionaries of the Veps and Karelian languages contain definition, translation, dialect labels, semantic relations (synonyms, antonyms, etc.), examples of word usage with reference to texts, as well as complete inflectional paradigms. All texts are automatically marked up and there are references from words in the text to the corresponding meanings in the dictionary entries. The developers continue adding useful new features to the corpus manager to make the work of editors easier. For example, over the past three years, nominal and verbal inflection rules have been formulated and programmed for all dialects of the Veps language and its newly-written version, as well as for the Livvi-Karelian, North Karelian and Tver newly-written versions of the Karelian language. Thanks to this, 2.1 million word forms were generated in the VepKar system in a semi-automatic mode. The semantic markup in the corpus is 2.1 million links between words from the text and the meanings of lemmas in the dictionary. The grammatical markup was added, namely, 1.1 million links between words from the text and the grammatical features of word forms from the dictionary were automatically established. The multilingual VepKar corpus is divided into subcorpora according to languages and dialects, and the texts are also classified into styles and genres. The corpus has a sophisticated search system (with filtering of texts by language, style and dialect, by informant, collector or author, by year of recording or year of publication). It is possible to search for lemmas by dialects, parts of speech, grammatical features, and even by lexical-semantic categories. These categories appeared due to the integration of the data of the outstanding Comparative and Onomasiological Dictionary of the Dialects of the Karelian, Veps and Sami Languages into the vocabulary part of VepKar. In 2021, the Sanahelmi electronic dictionary was created on the basis of VepKar for Android phones. The development of mobile applications based on corpus data is our bright future.
Indexed at RSCI, Google Scholar


  Last modified: June 12, 2022