Lucene Analyzer, Tokenizer and TokenFilter

Lucene is a very powerful and widely used search framework, and I heavily use it for the Web 2.0 product I currently create together with netoCiety. In this post I explain Lucene Analyzer, Tokenizer and TokenFilter. What they are and how you can create your custom Analyzer. Additionally I show you how to create an Analyzer for HTML documents and discuss Analyzers for different languages.

When you create a search index, you have documents with fields that contain text. For instance if you create an index of HTML documents the fields can be the title, description and body of the HTML page. In the index you don’t store the entire text but separate tokens of the text. For instance let’s take the text “This is a test”. When adding this text to the index only the separate words (tokens) are stored in the index. This split of text in tokens is handled by the Analyzer. For instance using the WhitespaceAnalyzer on the test text above results in the following index structure.

luke-1.png

For browsing the search index I use Luke. In the table in the bottom right corner you see the individual tokens.

Usually the Analyzer first builds a Tokenizer which breaks the entire text string into raw tokens, and then one or more TokenFilters are applied to the output of the Tokenizer. To demonstrate this let’s create a simple Analyzer that first builds a StandardTokenizer that splits the text at white space characters. Then apply a LengthTokenFilter that filters out tokens with a length less than 3 characters. Additionally apply a LowerCaseFilter to store all tokens in lower case. The final result will create the following index of the previous example.

luke-2.png

As you can see the tokens “a” and “is” that have a length less than 3 are ignored, and the term “this” has changed to lower case.

Now let’s have a look at the CustomAnalyzer class:

public class CustomAnalyzer extends Analyzer {
    public TokenStream reusableTokenStream(
        String fieldName, Reader reader) throws IOException {

        SavedStreams streams =
            (SavedStreams) getPreviousTokenStream();

        if (streams == null) {
            streams = new SavedStreams();
            setPreviousTokenStream(streams);

            streams.tokenizer = new StandardTokenizer(reader);
            streams.stream = new StandardFilter(streams.tokenizer);
            streams.stream = new LengthTokenFilter(streams.stream, 3);
            streams.stream = new LowerCaseFilter(streams.stream);
        } else {
            streams.tokenizer.reset(reader);
        }

        return streams.stream;
    }

    private class SavedStreams {
        Tokenizer tokenizer;
        TokenStream stream;
    }

    public TokenStream tokenStream(
        String fieldName, Reader reader) {

        Tokenizer tokenizer = new StandardTokenizer(reader);
        TokenStream stream = new StandardFilter(tokenizer);
        TokenStream stream = new LengthTokenFilter(stream, 3);
        stream = new LowerCaseFilter(stream);

        return stream;
    }
}

For performance reasons Lucene tries to re-use as much as possible. This is why it is important to implement the reusableTokenStream method. Lucene programmers usually use a private class to save the TokenStreams. Important is the part where a StandardTokenizer is created and then passed to the StandardFilter, LengthTokenFilter and LowerCaseFilter.

I don’t show you the implementation of the Tokenizer because usually you first use the StandardTokenizer and then apply one or more filters.

Now let’s have a look at the LengthTokenFilter:

public class LengthTokenFilter extends TokenFilter {
    private int minLength;

    protected LengthTokenFilter(
        TokenStream input, int minLength) {

        super(input);
        this.minLength = minLength;
    }

    public Token next(Token result)
        throws IOException {

        while ((result = input.next(result)) != null) {
            if (result.termLength() >= minLength) {
                return result;
            }
        }

        return null;
    }
}

The method “next” has to return the next token. Therefore the it checks if the length of the token is greater or equal the minimum length. If so it returns the token, otherwise it checks the next token.

To test the functionality I simply created a simple test class that creates the index with the CustomAnalyzer:

public class AnalyzerTest {
    public static void main(String[] args)
        throws CorruptIndexException,
        LockObtainFailedException, IOException {

        File tmpDir = new File(
            System.getProperty("java.io.tmpdir"));
        File indexDir = new File(tmpDir, "idx");

        IndexWriter indexWriter = new IndexWriter(
            indexDir, new CustomAnalyzer());
        indexWriter.addDocument(getDocument());
        indexWriter.close();
    }

    private static Document getDocument() {
        String text = "This is a test";

        Document document = new Document();
        document.add(new Field("text", text,
            Field.Store.YES,
            Field.Index.TOKENIZED));

        return document;
    }
}

So now we have covered Analyzer, Tokenizer and TokenFilter, and I think it is pretty clear how they work and how to create a custom Analyzer. But quite often you don’t want to index normal text but HTML text. Let’s try to index the following simple HTML with the CustomAnalyzer created above.

<html>
<head>
    <title>Lucene Test</title>
</head>
<body>
    <h1>Lucene Test</h1>
    <p>Simple HTML document</p>
</body>
</html>

The index created looks like this.

luke-3.png

As you can see not only the text is indexed but also the tags. But this does not make sense because you will never search for the “body” tag when performing a search. Luckily the Solr team created a HTMLStripReader that simply wraps a Reader and ignores HTML tags. Now you are able to create a HtmlTokenizer that extends StandardTokenizer and replace the StandardTokenizer in the CustomAnalyzer we created above. The HtmlTokenizer looks like this:

public class HtmlTokenizer extends StandardTokenizer {
    public HtmlTokenizer(Reader input) {
        super(new HTMLStripReader(input));
    }
}

Then you have the output you wanted:

luke-4.png

The last topic I discuss are Analyzers for different languages. These kind of Analyzers not only split the text into tokens but also use stemming to return the base form of a word. Wikipedia writes about stemming: “… Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. … A stemming algorithm reduces the words “fishing”, “fished”, “fish”, and “fisher” to the root word, “fish”. …

Lucene provides some Analyzers for specific languages like the GermanAnyalyzer or FrenchAnalyzer, but more interesting is the SnowballAnalyzer which is based on Snowball, a framework for writing stemming algorithms.

The only thing you have to be aware when using Analyzers that use stemming is that the index only stores the base words. For instance if a text contains the word “fishing” the stemming process reduces it to “fish”, and only “fish” is then stored in the index. Now when you search for “fishing” you first have to convert the word to its base word “fish”, otherwise you wont find anything. Therefore the search terms are usually tokenzied with the same Analyzer initially used to create the index.

3 Responses to “Lucene Analyzer, Tokenizer and TokenFilter”

  1. DeeDee Says:

    Very good. Thanks a lot for that

  2. Jon Says:

    One mistake here I think with defining stream twice.

    Tokenizer tokenizer = new StandardTokenizer(reader);
    TokenStream stream = new StandardFilter(tokenizer);
    TokenStream stream = new LengthTokenFilter(stream, 3);
    stream = new LowerCaseFilter(stream);

    Other than that awesome…

  3. markus Says:

    @Jon, you are absolutely correct. Our implementation of the Analyzer looks a bit different, it was a copy/paste error.

Leave a Reply