WordNet Support for ElasticSearch with English Stemming. Part 2: Create Index and Mapping from Code with Nest.

Nest is a high level .Net client for ElasticSearch. Refer to the Part 1 for the instructions to configure ElasticSearch cluster to support WordNet.

To create an ElasticSearch index that would support synonyms via WordNet provider from code with Nest, we construct three tokenizer filters in the same way we have done in json request in Part 1 and then concatenate them minding the order in a new custom analyser. As discussed in Part 1, synonym filter should always come last after stop words removal and tokenization.


public bool CreateIndexWithSynonymSupport(string elasticSearchApiPath, string index)
{
    var elasticSearchClient = new ElasticClient(new ConnectionSettings(new Uri(elasticSearchApiPath), index).ExposeRawResponse());
    var stopFilter = new StopTokenFilter { Stopwords = new List<string> {"a", "an", "the"} };
    var stemmerFilter = new StemmerTokenFilter { Language = Language.English.ToString() };
    var synonimFiler = new SynonymTokenFilter
    {
        Format = "wordnet",
        SynonymsPath = "analysis/wn_s.pl"
    };
    var analyser = new CustomAnalyzer
    {
        Tokenizer = "standard",
        Filter = new List<string> { "lowercase", "english_stop", "english_stemmer", "synonym" }
    };
    return elasticSearchClient.CreateIndex(index,
        c => c.Analysis(a => a.TokenFilters(tf => tf.Add("english_stop", stopFilter)
                                                    .Add("english_stemmer", stemmerFilter)
                                                    .Add("synonym", synonimFiler))
        .Analyzers(an => an.Add("synonym", analyser)))).Acknowledged;
}

Next, we want to set analyser to “synonym” for all analyzed fields of the objects that are to be types in the index. I am using attribute mapping, so in my case attributes are:


[ElasticProperty(Analyzer = "synonym")]
[DataMember]
public string Name { get; set; }

The prolog flat file wn_s.pl that we are using for synonym detection has the following structure:


s(102727825,1,'apparatus',n,1,22).
s(102727825,2,'setup',n,1,2).
s(102728440,1,'apparel',n,1,0).
s(102728440,2,'wearing apparel',n,1,0).
s(102728440,3,'dress',n,3,0).
s(102728440,4,'clothes',n,1,44).

The file is comprised of the frequently used word senses. The 9-digit number in the beginning of each line is the unique ID of a word sense; next number is the id or the word that can be used to express the sense; next we can see string representation of the word and other parameters.
If we need to break association of a synonym (i.e. we do not want to find “clothes” when searching for “dresses”), we can remove word sense (all lines with ID 102728440) and re-upload edited file to ElasticSearch nodes (see Part 1).

Leave a Reply

Your email address will not be published. Required fields are marked *