This program implements the LocalMaxs algorithm for extracting multiword units (MWUs) from plain text, described in (Silva and Lopes, 1999).
There are two versions of the algorithm: strict (described in the paper) and relaxed (slightly higher recall). This program implements the relaxed version, which is better than the strict version. If for some odd reason you need the strict version, get it here.
./multiwords.py dice|scp MAXN TEXTFILE OUTPUTDIR
MAXN is an integer ≥ 2
TEXTFILE is the corpus file, previously tokenized and lowercased
OUTPUTDIR is the name of a directory (it will be created if it doesn't exist) where the program writes temporary and output files. The output files will be named OUTPUTDIR/Nmwus.txt, N being 2, 3, ..., MAXN.
For example this command will extract bigrams and trigrams from the given corpus, using scp as the "glue" function:
./multiwords.py scp 3 corpus.txt results
The output files will be results/2mwus.txt and results/3mwus.txt
This code is released under a Creative Commons Attribution 3.0 Unported License.
A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora.
Joaquim Ferreira da Silva, and José Gabriel Pereira Lopes.
In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, Florida July 23-25, 1999. pp. 369-381.
updated 28 June 2011