Decent multi-lingual stemmer or analyzer for Lucene / ElasticSearch?

Tag: lucene , elasticsearch , multilingual Author: zxz2997 Date: 2013-06-11

I'm curious if there are generic analyzers which do a decent job of stemming / analyzing text which could be in different languages. For certain tasks, doing proper multi-lingual search (e.g. splitting a field name into name.english, name.french, etc.) seems like overkill.

Is there an analyzer which will strip suffixes (e.g. "dogs" --> "dog") and work for more than just English? I don't really care whether it does language detection, etc., and working on just e.g. romantic & germanic languages would probably be good enough. Or, is the loss of quality serious enough that it's always worth just using language-specific analyzers and language-specific queries?

AFAIK this doesn't exist, and will be immensely tough to implement, given the wide differences in each language's semantics.

Other Answer1

Your best bet would be to use the icu analyzers. They are useful for normalizing but less useful for things like stemming, which is inherently language specific.

Additionally, it is possible to use a separate language field and use different ananalyzers based on the value of that field. So, you could combine both approaches and fall back to the icu tokenizer and support languages you care about with specialized analyzers: http://www.elasticsearch.org/guide/reference/mapping/analyzer-field/

You might want to watch this presentation from the recent Berlin Buzzwords conference about multi language support: http://www.youtube.com/watch?v=QI0XEshXygo. There's a lot of good stuff in there. Jump to the 27th minute for an example of the using different analyzers.