Relevance Optimization

Every customer wants their application not just to show the most relevant results at the top of the results list but also to tune the logic responsible for logic. It is usually called relevance optimization, and we are going to check what can be adjusted and how it affects user behavior.

Solution

When we talk about relevance, we mean the user should see results as close as possible to what he has in mind. However, the user rarely can formulate his request using proper words and terms, and the search engine has to take it into account. So, each search engine calculates a score that tells how close the result is to the user’s request. Relevance optimization in most cases means finding a proper way to calculate this scope value.

In most cases, default full-text search algorithms are clever enough to calculate a reasonable score for standard requests. They often increase the score if the requested word is found several times in the original data, decrease the score for a partial match, and add other modifiers.

However, a customer may want to add an additional multiplier to the equation to affect score calculation. This multiplier is called search weight. It may be used to promote new pages or articles or do the opposite and lower the position of results that nobody is usually looking for. Each of such positive or negative scenarios has an associated multiplier passed together with the original data, and then the search engine multiplies the initial score by this additional multiplier. Let’s say if the original score is 10, then for promoted articles with a multiplier of 5, it becomes 10*5=50, and for old articles with a multiplier 0.3, it becomes 10*0.3=3.

Another way to optimize relevance is to boost relevant results based on some conditions. We already have a separate article that describes boost best practices and shows how to make it work.

Finally, there is a way to customize the relevance score by manually setting the formula to calculate it. It may include fields from the search index, external values, original full-text search score, and other parameters. You have to set a formula to tell the search index how exactly the score has to be calculated, and so the search index will sort results using this computed score.

There is one more trick you may want to use in some cases. If you have lots of false-positive results or want to limit the number of results shown to the client, you may set the minimum score, so the search index will automatically skip all the results with the score below some threshold.

Implementation

Now let us try to put all this knowledge to good use. We will check only a couple of common cases to demonstrate how to affect relevance score and change it based on the internal and external data.

Let us create a simple index with two fields: data to perform a full-text search query and weight used as a search weight. In our case, the weight field represents the document's importance for customers. Obviously, in real life, you may have more than one parameter that affects the score, and so you will need to use all of them.

Here are queries that create our test index with appropriate mapping for described two fields:

curl -X PUT "localhost:9200/relevance-optimization"
curl -X PUT "localhost:9200/relevance-optimization/_mapping" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "data": {
      "type": "text",
      "analyzer": "english"
    },
    "weight": {
      "type": "float"
    }
  }
}
'
curl -X PUT "localhost:9200/relevance-optimization/_doc/1" -H 'Content-Type: application/json' -d'
{
  "data": "Apple An apple is an edible sweet fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus.",
  "weight": 0.5
}
'
curl -X PUT "localhost:9200/relevance-optimization/_doc/2" -H 'Content-Type: application/json' -d'
{
  "data": "Orange The orange is the fruit of various citrus species in the family Rutaceae (see list of plants known as orange); it primarily refers to Citrus × sinensis,[1] which is also called sweet orange, to distinguish it from the related Citrus × aurantium, referred to as bitter orange.",
  "weight": 3
}
'

But before we will try to optimize relevance and change the score, let us perform a query without any modification to have good reference data.

curl -X POST "localhost:9200/relevance-optimization/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match" : { 
      "data" : "fruit" 
    }
  },
  "_source": false
}
'

Both documents from our search index are presented in the result set with reasonable scores of 0.199 and 0.167.

{
  "took" : 32,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.1999656,
    "hits" : [
      {
        "_index" : "relevance-optimization",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.1999656
      },
      {
        "_index" : "relevance-optimization",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.16753875
      }
    ]
  }
}

Now let try to use that weight field as a multiplier for the relevance score. We will use the function score query with the field value factor function. This function may accept many parameters, but in our case, we will use it just as a multiplier function.

curl -X POST "localhost:9200/relevance-optimization/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "function_score": {
      "query": {
        "match" : { 
          "data" : "fruit" 
        }
      },
      "field_value_factor": {
        "field": "weight"
      }
    }
  },
  "_source": false
}
'

And here are the results:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5026162,
    "hits" : [
      {
        "_index" : "relevance-optimization",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5026162
      },
      {
        "_index" : "relevance-optimization",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0999828
      }
    ]
  }
}

Pay attention to changes in the score. The first document had an original score of 0.199, but after adding a search weight multiplier of 0.5, it decreased to 0.099. On the opposite, the original score of the second document, 0.167, has been multiplied by three and become 0.502.

In our second example, we will set a custom formula for the score calculation, pass some external parameters, and use all this data to calculate the score. Again, we will use the function score query, but this time with the script score function.

curl -X POST "localhost:9200/relevance-optimization/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "function_score": {
      "query": {
        "match" : { 
          "data" : "fruit" 
        }
      },
      "script_score": {
        "script": {
          "params": {
            "threshold": 2
          },
          "source": "doc[\u0027weight\u0027].value >= params.threshold ? Math.sqrt(_score) : doc[\u0027weight\u0027].value * _score"
        }
      }
    }
  },
  "_source": false
}
'

Let us check the formula first. Suppose the weight in the document is above the threshold, implemented as an external parameter with value 2. In that case, we need to calculate a square root of the original score; otherwise, weight still has to work as a multiplier for the score. Here are the results of this query:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.06857612,
    "hits" : [
      {
        "_index" : "relevance-optimization",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.06857612
      },
      {
        "_index" : "relevance-optimization",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.019993119
      }
    ]
  }
}

There is one more interesting Elasticsearch feature you can use. If you need to debug how exactly the search engine calculates the score, you may use explain API to check the actual formula used to calculate the score.