spsim is a spelling similarity measure for identifying cognates across languages, taking into account spelling differences that are characteristic of each language pair, as described in (Gomes and Lopes, 2011). .
The program learns orthographic transformations that are particular to a language pair from example cognates provided beforehand. Then it takes those learned transformations into consideration when computing the similarity score between pairs of words.
There are four python (*.py) source files in the distribution package:
Python ≥ 3.1 is required. These programs were tested on Linux and Mac OS X.
Note: In the following examples the dollar sign ($) represents the prompt of the shell in a *NIX-like environment.
To compute spsim for pairs of words:
$ ./spsim.py EXAMPLES_FILE INPUT_FILE
To compute spsim for pairs of phrases:
$ ./phrase_spsim.py EXAMPLES_FILE INPUT_FILE
To compute EDSim (similarity based on the Edit Distance) for pairs of words:
$ ./stringology.py edsim INPUT_FILE
Important: These programs expect the input files to be in UTF-8. Make sure that the LC_ALL environment variable is set to UTF-8.
The EXAMPLES_FILE should contain one pair of cognates per line, separated by a tab.
The INPUT_FILE must contain two words (or two phrases in the case of phrase_sim.py) per line, separated by a tab, just like the EXAMPLES_FILE. (You may give the minus character (-) as INPUT_FILE causing the programs to read from the standard input.)
Important: The words of one language must appear always in the same column (in the EXAMPLES_FILE and the INPUT_FILE).
Important: All the examples should be pairs of words that you known for sure to be cognates. In other words, don't use cognates that were automatically extracted by some program as examples to be learned by spsim.
The programs write to the standard output. Each line of the output contains one pair of words (or phrases) and the respective spsim score in the third column (columns are separated by tabs like in the input).
The spsim score is a floating point value between 0.0 (meaning that the words have completely different orthographies) and 1.0 indicating a very similar orthography.
If you have already downloaded spsim then you may try this yourself:
$ wget http://research.variancia.com/spsim/examples_enpt.txt $ cat examples_enpt.txt alcohol álcool alpha alfa anomaly anomalia mathematics matemática methodology metodologia metric métrica morphine morfina photos fotos $ wget http://research.variancia.com/spsim/maybe_enpt.txt $ cat maybe_enpt.txt pharmacy farmácia arithmetic aritmética $ ./spsim.py examples_enpt.txt maybe_enpt.txt pharmacy farmácia 1.0 arithmetic aritmética 1.0
Compare the above with the similarity based on the Edit Distance:
$ ./stringology.py edsim maybe_enpt.txt pharmacy farmácia 0.375 arithmetic aritmética 0.7
Note: You may find odd that spsim assigns 1.0 to pairs of words that are not spelled exactly the same. You should interpret the score given by spsim as the similarity between the two words or phrases, ignoring any substitutions that spsim has already learned from the examples.
A draft implementation is available (requires Python≥3).
New: Also available on PyPI.
If you use this similarity measure in your work, please cite the following paper:
Measuring Spelling Similarity for Cognate Identification Luís Gomes and José Gabriel Pereira Lopes in Progress in Artificial Intelligence, 15th Portuguese Conference in Artificial Intelligence, EPIA 2011, Lisboa, Springer Berlin / Heidelberg, LNCS 7026, pp. 624-633, October 2011 (pdf)
Luís Gomes luismsgomes@gmail.com http://research.variancia.com/ updated 23 June 2010