WordNet Support for ElasticSearch with English Stemming. Part 2: Create Index and Mapping from Code with Nest.

Nest is a high level .Net client for ElasticSearch. Refer to the Part 1 for the instructions to configure ElasticSearch cluster to support WordNet.

To create an ElasticSearch index that would support synonyms via WordNet provider from code with Nest, we construct three tokenizer filters in the same way we have done in json request in Part 1 and then concatenate them minding the order in a new custom analyser. As discussed in Part 1, synonym filter should always come last after stop words removal and tokenization.


public bool CreateIndexWithSynonymSupport(string elasticSearchApiPath, string index)
{
    var elasticSearchClient = new ElasticClient(new ConnectionSettings(new Uri(elasticSearchApiPath), index).ExposeRawResponse());
    var stopFilter = new StopTokenFilter { Stopwords = new List<string> {"a", "an", "the"} };
    var stemmerFilter = new StemmerTokenFilter { Language = Language.English.ToString() };
    var synonimFiler = new SynonymTokenFilter
    {
        Format = "wordnet",
        SynonymsPath = "analysis/wn_s.pl"
    };
    var analyser = new CustomAnalyzer
    {
        Tokenizer = "standard",
        Filter = new List<string> { "lowercase", "english_stop", "english_stemmer", "synonym" }
    };
    return elasticSearchClient.CreateIndex(index,
        c => c.Analysis(a => a.TokenFilters(tf => tf.Add("english_stop", stopFilter)
                                                    .Add("english_stemmer", stemmerFilter)
                                                    .Add("synonym", synonimFiler))
        .Analyzers(an => an.Add("synonym", analyser)))).Acknowledged;
}

Next, we want to set analyser to “synonym” for all analyzed fields of the objects that are to be types in the index. I am using attribute mapping, so in my case attributes are:


[ElasticProperty(Analyzer = "synonym")]
[DataMember]
public string Name { get; set; }

The prolog flat file wn_s.pl that we are using for synonym detection has the following structure:


s(102727825,1,'apparatus',n,1,22).
s(102727825,2,'setup',n,1,2).
s(102728440,1,'apparel',n,1,0).
s(102728440,2,'wearing apparel',n,1,0).
s(102728440,3,'dress',n,3,0).
s(102728440,4,'clothes',n,1,44).

The file is comprised of the frequently used word senses. The 9-digit number in the beginning of each line is the unique ID of a word sense; next number is the id or the word that can be used to express the sense; next we can see string representation of the word and other parameters.
If we need to break association of a synonym (i.e. we do not want to find “clothes” when searching for “dresses”), we can remove word sense (all lines with ID 102728440) and re-upload edited file to ElasticSearch nodes (see Part 1).

Access VM in Azure Resource Group via SSH client

  • In Azure portal go to your resource group
  • Click +Add at the top of the screen
  • Type ‘load balancer’ in the search group that appears and choose Load Balancer by Misrosoft, click Create
  • Choose the name for the load balancer, Cteate New for Public IP Address
  • When deployed, add backend pool with the needed availability set and VM
  • After backend pool is saved, add health probe for TCP port 22
  • When probe creation is finished, create load balancer rule for TCP port 22
  • SSH to machine through newly created IP with Putty or other SSH client

Add WordNet Synonym Support to ElasticSearch with English Stemming

Taking this detailed and helpful post as a base, let’s try to introduce English stemming and preserve synonym token replacement by WordNet.

First, to add WordNet prolog file to existing ElasticSearch nodes (in my case Ubuntu) perform the following:

  • sudo su #switch to superuser to access ElasticSearch folder freely
  • wget http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz #download ANSI Prolog version of the WordNet db
  • tar -xvzf WNprolog-3.0.tar.gz #decompress tar
  • cd ../../etc/elasticsearch #go to ElasticSearch config directory
  • mkdir analysis #create analysis subdirectory
  • mv /home/onehydraadmin/prolog/wn_s.pl /etc/elasticsearch/analysis/wn_s.pl #move WordNet file to new directory

Now we are able to create ElasticSearch index that can access WordNet db.

What do we need in terms of synonym mapping? We need both synonyms and queries to be tokenized with English stemmer after English stop words removal. Then query tokens need to be mapped to tokens in synonyms source. After that, list of synonym tokens obtained need to act as a search query tokens against indexed documents.

To achieve this, we create an index with custom synonym analyser that utilises three filters (the order matters!): english_stop, english_stemmer, synonym.
PUT request to http://localhost:9200/synonym_test/


{
  "settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "synonym" : {
                    "tokenizer" : "standard",
                    "filter" : ["english_stop", "english_stemmer","synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                    "type": "synonym",
                        "format": "wordnet",
                        "synonyms_path": "analysis/wn_s.pl"
                },
        		"english_stop": {
          			"type":       "stop",
          			"stopwords":  "_english_" 
        		},
        		"english_stemmer": {
          			"type":       "stemmer",
          			"language":   "english"
        		}
            }
        }
    }
  },
  "mappings" : {
       "_default_": {
           "properties" : {
               "name" : {
                   "type" : "string",
                   "analyzer" : "synonym"
               }
           }
        }
    }
}

Following the blog post, let’s insert two values to the index: “baby” and “child”:

POST request to http://localhost:9200/synonym_test/1


{
    "name" : "baby"
}
POST request to http://localhost:9200/synonym_test/2

{
    "name" : "child"
}

Now we can search with singular and plurals queries alike and still get all synonyms in response.

POST request to http://localhost:9200/synonym_test/_search?pretty=true

{
   "query" : {
        "match": {
             "name": {
				"query": "babies"
             }
        }
    }
}

Response


{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.30685282,
        "hits": [
            {
                "_index": "projects6",
                "_type": "project",
                "_id": "1",
                "_score": 0.30685282,
                "_source": {
                    "name": "baby"
                }
            },
            {
                "_index": "projects6",
                "_type": "project",
                "_id": "2",
                "_score": 0.19178301,
                "_source": {
                    "name": "child"
                }
            }
        ]
    }
}