czech_stemmer

Description

A stemmer for Czech implemented in Python.

I ported the algorithm from the Java implementation by Ljiljana Dolamic, University of Neuchatel.

The file (czech_stemmer.py) may be used as a standalone program or as a module. When used as a program, it reads text from stdin and writes the stemmed text to stdout. Examples:

$ echo "listina základních práv evropské unie" | ./czech_stemmer.py light
list základn práv evropsk uni
$ echo "listina základních práv evropské unie" | ./czech_stemmer.py aggressive
lis základ práv evrops uni
Note: the input should be tokenized, ie, words should be separated from punctuation by whitespace; multiple spaces between words are not preserved at the output.
Usage

The program reads from stdin and writes to stdout.
Command syntax:

$ ./czech_stemmer.py MODE < input.txt > output.txt

MODE is either light (may understem some words) or aggressive (may overstem some words).

Download

czech_stemmer_rev0.tar.gz

The program requires Python ≥ 3.1. Tested on linux.

This code is released under a Creative Commons Attribution 3.0 Unported License.