spsim

Description

spsim is a spelling similarity measure for identifying cognates across languages, taking into account spelling differences that are characteristic of each language pair, as described in (Gomes and Lopes, 2011). .

The program learns orthographic transformations that are particular to a language pair from example cognates provided beforehand. Then it takes those learned transformations into consideration when computing the similarity score between pairs of words.

There are four python (*.py) source files in the distribution package:

Python ≥ 3.1 is required. These programs were tested on Linux and Mac OS X.

Usage

Note: In the following examples the dollar sign ($) represents the prompt of the shell in a *NIX-like environment.

To compute spsim for pairs of words:

$ ./spsim.py EXAMPLES_FILE INPUT_FILE

To compute spsim for pairs of phrases:

$ ./phrase_spsim.py EXAMPLES_FILE INPUT_FILE

To compute EDSim (similarity based on the Edit Distance) for pairs of words:

$ ./stringology.py edsim INPUT_FILE

Important: These programs expect the input files to be in UTF-8. Make sure that the LC_ALL environment variable is set to UTF-8.

The EXAMPLES_FILE should contain one pair of cognates per line, separated by a tab.

The INPUT_FILE must contain two words (or two phrases in the case of phrase_sim.py) per line, separated by a tab, just like the EXAMPLES_FILE. (You may give the minus character (-) as INPUT_FILE causing the programs to read from the standard input.)

Important: The words of one language must appear always in the same column (in the EXAMPLES_FILE and the INPUT_FILE).

Important: All the examples should be pairs of words that you known for sure to be cognates. In other words, don't use cognates that were automatically extracted by some program as examples to be learned by spsim.

The programs write to the standard output. Each line of the output contains one pair of words (or phrases) and the respective spsim score in the third column (columns are separated by tabs like in the input).

The spsim score is a floating point value between 0.0 (meaning that the words have completely different orthographies) and 1.0 indicating a very similar orthography.

Example

If you have already downloaded spsim then you may try this yourself:

$ wget http://research.variancia.com/spsim/examples_enpt.txt
$ cat examples_enpt.txt
alcohol     álcool
alpha       alfa
anomaly     anomalia
mathematics matemática
methodology metodologia
metric      métrica
morphine    morfina
photos      fotos

$ wget http://research.variancia.com/spsim/maybe_enpt.txt
$ cat maybe_enpt.txt
pharmacy    farmácia
arithmetic  aritmética

$ ./spsim.py examples_enpt.txt maybe_enpt.txt
pharmacy    farmácia    1.0
arithmetic  aritmética  1.0

Compare the above with the similarity based on the Edit Distance:

$ ./stringology.py edsim maybe_enpt.txt
pharmacy    farmácia    0.375
arithmetic  aritmética  0.7

Note: You may find odd that spsim assigns 1.0 to pairs of words that are not spelled exactly the same. You should interpret the score given by spsim as the similarity between the two words or phrases, ignoring any substitutions that spsim has already learned from the examples.

Download

A draft implementation is available (requires Python≥3).

New: Also available on PyPI.

Bibliography

If you use this similarity measure in your work, please cite the following paper:

Measuring Spelling Similarity for Cognate Identification
Luís Gomes and José Gabriel Pereira Lopes
in Progress in Artificial Intelligence, 15th Portuguese Conference in Artificial Intelligence, EPIA 2011, Lisboa, Springer Berlin / Heidelberg, LNCS 7026, pp. 624-633, October 2011
(pdf)