unl-aligner

Description

This program implements the algorithm described in (Ildefonso and Lopes, 2005) for aligning parallel texts using the Longest Increasing Subsequence. That method has evolved from (Ribeiro, Dias, Lopes, and Mexia, 2001). You can find a brief description of these methods in section 2.2.2 of (Gomes, 2009).

Please note that this is not an implementation of my own method for parallel texts alignment.

This implementation diverges from the paper (Ildefonso and Lopes, 2005) in a minor detail: whenever there are multiple Increasing Subsequences with the same length, this implementation chooses the leftmost while the paper prescribes selecting the righmost. I may change this in the future to make the implementation fully compliant.

Usage

Command syntax:

./unl-aligner.py TEXT_X TEXT_Y MINCPL MINSIM MAXREC

The parameters TEXT_X and TEXT_Y are the texts to be aligned. Tokenization of the texts is of your responsability. The program assumes that tokens are separated by spaces.

MINCPL is a positive integer value. It is the minimum common prefix length that a pair of words must have to be considered as possible cognate.

MINSIM is a decimal value between 0 and 1. It is the minimum spelling similarity that a pair of words must have to be considered as possible cognate.

MAXREC is the maximum depth of recursion allowed.

Example usage:

$ cat English.txt

Our ordinary measures of distance fail us here in the realm of the galaxies .
We need a much larger unit : the light year .
It measures how far light travels in a year, nearly 10 trillion kilometers .

$ cat Portuguese.txt

As nossas habituais medidas de distância falham-nos aqui no reino das galáxias .
Precisamos de uma unidade bastante maior : o ano-luz .
Ele mede a distância que a luz percorre durante um ano , cerca de 10 biliões de quilómetros .

$ ./unl-aligner.py English.txt Portuguese.txt 3 .6 10 | column -s $'\t' -t

Our ordinary measures of              As nossas habituais medidas de            ?
distance                              distância                                 1
fail us here in the realm of the      falham-nos aqui no reino das              ?
galaxies                              galáxias                                  0
.                                     .                                         0
\n                                    \n                                        0
We need a much larger unit            Precisamos de uma unidade bastante maior  ?
:                                     :                                         0
the light year                        o ano-luz                                 ?
.                                     .                                         0
\n                                    \n                                        0
It measures how far light travels in  Ele mede a distância que                  ?
a                                     a                                         0
year, nearly                          luz percorre durante um ano , cerca de    ?
10                                    10                                        0
trillion kilometers                   biliões de quilómetros                    ?
.                                     .                                         0
\n                                    \n                                        0

The third column may contain either an integer value, a question mark (?), or an exclamation mark (!):

Download

unl-aligner-rev6.tar.gz

The program requires Python 3.1.

This code is released under a Creative Commons Attribution 3.0 Unported License.

Bibliography

Longest Sorted Sequence Algorithm for Parallel Text Alignment.
Tiago Ildefonso and José Gabriel Pereira Lopes.
In Computer Aided Systems Theory – EUROCAST 2005, pages 81-90, 2005
(pdf from Springer)

Cognates Alignment.
António Ribeiro, Gaël Dias, Gabriel Lopes, and João Mexia.
In Machine Translation Summit 2001 (MTS-2001).
(pdf)

Parallel Texts Alignment.
Luís Gomes.
MSc Thesis, Universidade Nova de Lisboa, 2009
(pdf)