Transactions of KarRC RAS :: Scientific publications
Transactions of KarRC RAS :: Scientific publications

Transactions of KarRC RAS :: Scientific publications
Karelian Research Centre of RAS
ISSN (print): 1997-3217
ISSN (online): 2312-4504
Transactions of KarRC RAS :: Scientific publications
Background Editorial committee Editorial Office For authors For reviewer Russian version
Transactions of KarRC RAS :: Scientific publications

Electronic Journal OJS



Series

Biogeography

Experimental Biology

Mathematical Modeling and Information Technologies

Precambrian Geology

Ecological Studies

Limnology and Oceanology

Research in the Humanities (2010-2015)

Region: Economy and Management (2012-2015)



Issues

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

1999-2008


SCIENTIFIC PUBLICATIONS
Быков Ф.Ю., Крижановский А.А.
Поиск почти похожих текстов в лингвистическом корпусе ВепКар
// Труды КарНЦ РАН. No 4. Сер. Математическое моделирование и информационные технологии. 2023. C. 16-23
Bykov F.Yu., Krizhanovsky A.A. Search for near-duplicate texts in the linguistic corpus VepKar // Transactions of Karelian Research Centre of Russian Academy of Science. No 4. Mathematical Modeling and Information Technologies. 2023. Pp. 16-23
Keywords: corpus linguistics; near-duplicate texts; Kendall rank correlation
Developers of linguistic corpora need to spot and eliminate text duplicates. An overview of approaches to searching for near-duplicate texts in various corpora is presented in this article. An algorithm and a program for searching for nearduplicate texts (based on the number of common bigrams) have been developed. Experiments were carried out with texts from the Veps and Karelian Open Corpus VepKar. The program found 100 pairs of the most similar texts and offered them to an expert, who confirmed 42 cases to be duplicates. Three metrics of text similarity were considered. The metric that was the closest to the expert’s output in its pairwise text alignments was identified using Kendall’s rank distance. The newly developed program will be a useful tool for editors of the VepKar text corpus.
Indexed at RSCI


  Last modified: June 27, 2023