WordNet Support for ElasticSearch with English Stemming. Part 2: Create Index and Mapping from Code with Nest.

Nest is a high level .Net client for ElasticSearch. Refer to the Part 1 for the instructions to configure ElasticSearch cluster to support WordNet.

To create an ElasticSearch index that would support synonyms via WordNet provider from code with Nest, we construct three tokenizer filters in the same way we have done in json request in Part 1 and then concatenate them minding the order in a new custom analyser. As discussed in Part 1, synonym filter should always come last after stop words removal and tokenization.


public bool CreateIndexWithSynonymSupport(string elasticSearchApiPath, string index)
{
    var elasticSearchClient = new ElasticClient(new ConnectionSettings(new Uri(elasticSearchApiPath), index).ExposeRawResponse());
    var stopFilter = new StopTokenFilter { Stopwords = new List<string> {"a", "an", "the"} };
    var stemmerFilter = new StemmerTokenFilter { Language = Language.English.ToString() };
    var synonimFiler = new SynonymTokenFilter
    {
        Format = "wordnet",
        SynonymsPath = "analysis/wn_s.pl"
    };
    var analyser = new CustomAnalyzer
    {
        Tokenizer = "standard",
        Filter = new List<string> { "lowercase", "english_stop", "english_stemmer", "synonym" }
    };
    return elasticSearchClient.CreateIndex(index,
        c => c.Analysis(a => a.TokenFilters(tf => tf.Add("english_stop", stopFilter)
                                                    .Add("english_stemmer", stemmerFilter)
                                                    .Add("synonym", synonimFiler))
        .Analyzers(an => an.Add("synonym", analyser)))).Acknowledged;
}

Next, we want to set analyser to “synonym” for all analyzed fields of the objects that are to be types in the index. I am using attribute mapping, so in my case attributes are:


[ElasticProperty(Analyzer = "synonym")]
[DataMember]
public string Name { get; set; }

The prolog flat file wn_s.pl that we are using for synonym detection has the following structure:


s(102727825,1,'apparatus',n,1,22).
s(102727825,2,'setup',n,1,2).
s(102728440,1,'apparel',n,1,0).
s(102728440,2,'wearing apparel',n,1,0).
s(102728440,3,'dress',n,3,0).
s(102728440,4,'clothes',n,1,44).

The file is comprised of the frequently used word senses. The 9-digit number in the beginning of each line is the unique ID of a word sense; next number is the id or the word that can be used to express the sense; next we can see string representation of the word and other parameters.
If we need to break association of a synonym (i.e. we do not want to find “clothes” when searching for “dresses”), we can remove word sense (all lines with ID 102728440) and re-upload edited file to ElasticSearch nodes (see Part 1).

Access VM in Azure Resource Group via SSH client

  • In Azure portal go to your resource group
  • Click +Add at the top of the screen
  • Type ‘load balancer’ in the search group that appears and choose Load Balancer by Misrosoft, click Create
  • Choose the name for the load balancer, Cteate New for Public IP Address
  • When deployed, add backend pool with the needed availability set and VM
  • After backend pool is saved, add health probe for TCP port 22
  • When probe creation is finished, create load balancer rule for TCP port 22
  • SSH to machine through newly created IP with Putty or other SSH client

Add WordNet Synonym Support to ElasticSearch with English Stemming

Taking this detailed and helpful post as a base, let’s try to introduce English stemming and preserve synonym token replacement by WordNet.

First, to add WordNet prolog file to existing ElasticSearch nodes (in my case Ubuntu) perform the following:

  • sudo su #switch to superuser to access ElasticSearch folder freely
  • wget http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz #download ANSI Prolog version of the WordNet db
  • tar -xvzf WNprolog-3.0.tar.gz #decompress tar
  • cd ../../etc/elasticsearch #go to ElasticSearch config directory
  • mkdir analysis #create analysis subdirectory
  • mv /home/onehydraadmin/prolog/wn_s.pl /etc/elasticsearch/analysis/wn_s.pl #move WordNet file to new directory

Now we are able to create ElasticSearch index that can access WordNet db.

What do we need in terms of synonym mapping? We need both synonyms and queries to be tokenized with English stemmer after English stop words removal. Then query tokens need to be mapped to tokens in synonyms source. After that, list of synonym tokens obtained need to act as a search query tokens against indexed documents.

To achieve this, we create an index with custom synonym analyser that utilises three filters (the order matters!): english_stop, english_stemmer, synonym.
PUT request to http://localhost:9200/synonym_test/


{
  "settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "synonym" : {
                    "tokenizer" : "standard",
                    "filter" : ["english_stop", "english_stemmer","synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                    "type": "synonym",
                        "format": "wordnet",
                        "synonyms_path": "analysis/wn_s.pl"
                },
        		"english_stop": {
          			"type":       "stop",
          			"stopwords":  "_english_" 
        		},
        		"english_stemmer": {
          			"type":       "stemmer",
          			"language":   "english"
        		}
            }
        }
    }
  },
  "mappings" : {
       "_default_": {
           "properties" : {
               "name" : {
                   "type" : "string",
                   "analyzer" : "synonym"
               }
           }
        }
    }
}

Following the blog post, let’s insert two values to the index: “baby” and “child”:

POST request to http://localhost:9200/synonym_test/1


{
    "name" : "baby"
}
POST request to http://localhost:9200/synonym_test/2

{
    "name" : "child"
}

Now we can search with singular and plurals queries alike and still get all synonyms in response.

POST request to http://localhost:9200/synonym_test/_search?pretty=true

{
   "query" : {
        "match": {
             "name": {
				"query": "babies"
             }
        }
    }
}

Response


{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.30685282,
        "hits": [
            {
                "_index": "projects6",
                "_type": "project",
                "_id": "1",
                "_score": 0.30685282,
                "_source": {
                    "name": "baby"
                }
            },
            {
                "_index": "projects6",
                "_type": "project",
                "_id": "2",
                "_score": 0.19178301,
                "_source": {
                    "name": "child"
                }
            }
        ]
    }
}

Lab: TensorFlow Neural Network on Windows 7

Get docker toolbox here. After installation complete run Docker Quickstart Terminal. If you see error saying VT-X/AMD-x is required, turn on virtualisation in BIOS. After shell is successfully started, git clone the lab repository in the terminal and run Jupyter server as suggested in Udacity description for Windows. To access the notebook substitute localhost with default virtual machine IP (can be accessed by docker-machine ip default in cmd).

If you get out of memory exceptions when running cells, power off VM from Oracle VirtualBox, increase memory to 4GB, start VM and run Docker.

Jupiter Notebook on Windows Tips

Change StartUp Folder

In cmd enter jupyter notebook –generate-config
Config will be generated here: C:\Users\username\.jupyter\jupyter_notebook_config
Edit config; change line
#c.NotebookApp.notebook_dir = ”
to
c.NotebookApp.notebook_dir = ‘new startup folder path’

Display Multiple Plots in Jupiter Cell

After the first plot inser plt.figure() and follow by plt.imshow(your_image).

Udacity Deep Learning Course: Setting Up Environment for Assignments on AWS

In EC2 Dashboard press Launch Instance. Choose Amazon Linux AMI, t2.micro and press Next: Instance Details. Leave everything default until you are on Configure Security Group pane. There create a new security rule SSH/TCP/22 rule for your public IP in CIDR format (if one IP in the range just add /32).

On pressing Launch, Key Pair popup appears. Choose Create a new key pair, type in key name specific to your instance and click Download Key Pair. Save .pem key to your keys folder. Finally, press Launch Instance.

Wait few minutes until instance gets up and running (or press View Instances and wait for Status Check column to show green tick).

Download Putty.exe and PuttyGen.exe . Follow instructions from sections Converting Your Private Key Using PuTTYgen and Starting a PuTTY Session of EC2 Connection Guide.

When connected, use Docker Basics tutorial to install Docker on your new instance. Go to DL assignments docker repository and follow the instructions to run docker.  Go back to your instance security rules and add All traffic rule from your IP.

You can now access assignments from your host machine browser by connecting to http://<AWS instance public DNS>:8888