multiwords

Description

NOTE: The programs in this page are obsolete. Use multiwords2 instead.

These programs implement the LocalMaxs algorithm for extracting multiword units (MWUs) from plain text, described in (Silva and Lopes, 1999). There are two versions of the algorithm: strict and relaxed. (The paper describes the strict version.)

There are two implementations of strict LocalMaxs:

So far I have only a simple implementation (that cannot handle large corpora) of the relaxed LocalMaxs. You can find it in the archive multiwords-relaxed-rev*.tgz.

Which one should I use?
If your corpus is large then it is possible that only bigcorpus version is able to cope with it. Otherwise, you may choose the relaxed version for its greater recall or the strict version for its greater precision.

Usage

Command syntax:

./multiwords.py dice|scp MAXN

For example this command will extract bigrams and trigrams from the given corpus, using scp as the "glue" function:

./multiwords.py scp 3 < corpus.txt > mwus.txt
Download

This code is released under a Creative Commons Attribution 3.0 Unported License.

Bibliography

A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora.
Joaquim Ferreira da Silva, and José Gabriel Pereira Lopes.
In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, Florida July 23-25, 1999. pp. 369-381.
(pdf)