Preprint, 2024

Utilizing Phonetic Similarity for Cross-source and Cross-language Toponym Matching - a Benchmark and Prototype

Research Square, 10.21203/rs.3.rs-4136375/v1

Contributors

Sagi, Tomer

0000-0002-8916-0128 [1] Zaga, Moran

0000-0002-2197-116X [2] Rusinek, Sinai [2] Fekete, Marcell Richard

0009-0007-5025-7866 [1] Bjerva, Johannes

0000-0002-9512-0739 [1] Hose, Katja Hannelore

0000-0001-7025-8099 [3]

Affiliations

[1] Aalborg University

[NORA names:

AAU Aalborg University

University

[2] University of Haifa

[NORA names:

[3] TU Wien

[NORA names:

Abstract

The writings of one ancient civilization often overlap in time and space with others. Many of these sources comprise unstructured text in ancient languages, causing scholars studying these civilizations to be siloed, often relying on sources in a single language. Recent efforts to extract structured information from historical scripts into place (toponym) and people databases (prospographies) have followed this pattern, focusing on one civilization or even one scholar. The path to creating a common database runs through aligning names or toponyms between sources from disparate languages utilizing different scripts. Existing multi-lingual orthographic (string-based) comparison often relies on transliteration to a common script (Latin/English). Transliteration often creates multiple options and even more confusion. However, when integrating sources that overlap in space and time, the languages often share a common phonetic background. This commonality may prove beneficial. In this work, we present a benchmark for comparing toponyms from two linguistically and culturally related languages, namely Hebrew and Arabic. We provide a benchmark comprised of a set of dataset pairs created from both historical sources written in ancient variants of these languages as well as a modern dataset curated from Wikidata. We empirically evaluate several toponym comparison approaches over the benchmark: transliteration to a common script, direct transliteration, and phonetic comparison using a common phonetic representation. We discuss the results and the limitations of the various methods and outline future work.

Keywords

Arabic, Wikidata, approach, background, benchmarks, civilization, commonalities, comparison, comparison approach, confusion, cross-source, culture, database, dataset, dataset pairs, extract structural information, historical scripts, historical sources, information, language, limitations, matching, method, multiple options, names, options, orthographically, overlap, pairs, path, patterns, people, people database, phonetic comparison, phonetic representation, phonetic similarity, prototype, relational language, representation, results, scholars, scripts, similarity, source, space, structural information, time, toponym matching, toponyms, transliteration, variants, writing

Utilizing Phonetic Similarity for Cross-source and Cross-language Toponym Matching - a Benchmark and Prototype

Contributors

Affiliations

Abstract

Keywords

Data Provider: Digital Science

LINKS
-

Matching Records in NORA

SUBJECTS
+

DK Main Research Area

UN SDG Classification

OECD Classification

AU/NZ FOR Classification

METRICS
+

Citation Metrics

Attention Metrics

Attention Metrics

DK Open Access Indicator

Contributors

Affiliations

Abstract

Keywords

Data Provider: Digital Science

LINKS-

Matching Records in NORA

SUBJECTS+

DK Main Research Area

UN SDG Classification

OECD Classification

AU/NZ FOR Classification

METRICS+

Citation Metrics

Attention Metrics

Attention Metrics

DK Open Access Indicator

Matching Records in NORA

DK Open Access Indicator

DK Green Classification

LINKS
-

SUBJECTS
+

METRICS
+