Add WordNet Synonym Support to ElasticSearch with English Stemming

Taking this detailed and helpful post as a base, let’s try to introduce English stemming and preserve synonym token replacement by WordNet.

First, to add WordNet prolog file to existing ElasticSearch nodes (in my case Ubuntu) perform the following:

  • sudo su #switch to superuser to access ElasticSearch folder freely
  • wget http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz #download ANSI Prolog version of the WordNet db
  • tar -xvzf WNprolog-3.0.tar.gz #decompress tar
  • cd ../../etc/elasticsearch #go to ElasticSearch config directory
  • mkdir analysis #create analysis subdirectory
  • mv /home/onehydraadmin/prolog/wn_s.pl /etc/elasticsearch/analysis/wn_s.pl #move WordNet file to new directory

Now we are able to create ElasticSearch index that can access WordNet db.

What do we need in terms of synonym mapping? We need both synonyms and queries to be tokenized with English stemmer after English stop words removal. Then query tokens need to be mapped to tokens in synonyms source. After that, list of synonym tokens obtained need to act as a search query tokens against indexed documents.

To achieve this, we create an index with custom synonym analyser that utilises three filters (the order matters!): english_stop, english_stemmer, synonym.
PUT request to http://localhost:9200/synonym_test/


{
  "settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "synonym" : {
                    "tokenizer" : "standard",
                    "filter" : ["english_stop", "english_stemmer","synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                    "type": "synonym",
                        "format": "wordnet",
                        "synonyms_path": "analysis/wn_s.pl"
                },
        		"english_stop": {
          			"type":       "stop",
          			"stopwords":  "_english_" 
        		},
        		"english_stemmer": {
          			"type":       "stemmer",
          			"language":   "english"
        		}
            }
        }
    }
  },
  "mappings" : {
       "_default_": {
           "properties" : {
               "name" : {
                   "type" : "string",
                   "analyzer" : "synonym"
               }
           }
        }
    }
}

Following the blog post, let’s insert two values to the index: “baby” and “child”:

POST request to http://localhost:9200/synonym_test/1


{
    "name" : "baby"
}
POST request to http://localhost:9200/synonym_test/2

{
    "name" : "child"
}

Now we can search with singular and plurals queries alike and still get all synonyms in response.

POST request to http://localhost:9200/synonym_test/_search?pretty=true

{
   "query" : {
        "match": {
             "name": {
				"query": "babies"
             }
        }
    }
}

Response


{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.30685282,
        "hits": [
            {
                "_index": "projects6",
                "_type": "project",
                "_id": "1",
                "_score": 0.30685282,
                "_source": {
                    "name": "baby"
                }
            },
            {
                "_index": "projects6",
                "_type": "project",
                "_id": "2",
                "_score": 0.19178301,
                "_source": {
                    "name": "child"
                }
            }
        ]
    }
}