Multi-language Search

If the application supports multiple languages, the customer may want to customize search behavior for each language or even run a search against all support languages. Here is how to do it.

Solution

The typical solution for such a task is to have two separate fields with language-specific search behavior. You may put these fields in the same document, different documents, or even different indices. No matter which option you prefer, you have to set a proper language configuration for each field. This way search index will automatically apply the right behavior.

However, two separate fields solve the only case when you need to search against one language. But what should you do if the customer wants to search at all language data?

You have two options: manually build an expression and iterate over all fields or build a separate global field for global search.

Building an expression is simple, you can use OR operator to run a search against all language fields. E.g., value found in English data OR value found in French data OR value found in Russian data. It is a simple approach that does the trick. However, it has a disadvantage, you have to run N search iterations instead of one that is not ideal from the performance point of view.

Usage of the separate global field is an alternative option. You need to add another field that contains all information from all languages. The main advantage of this approach is that you have to perform only one iteration. However, because you will have the same field, the search may not be as accurate as via separate fields. That is why I recommend customizing behavior and split the string into search tokens using a custom algorithm. One common approach is dividing the data into words and then building N-gram tokens for each word.

Implementation

Here we will create an index that contains documents with four fields: one for English, one for French, one for Russian, and the final one for the global search. Language-specific fields will use standard language analyzers, and the global field will use the analyzer with custom N-gram tokens.

The first request creates an index and defines additional settings. These settings contain two important things:

custom search analyzer global_search_analyzer that splits data into words using whitespace analyzer and then applies custom N-gram token filter;
custom Edge N-gram filter word_edge_ngram to split words into tokens from the beginning of the word starting from two characters, e.g. word first will be spit into tokens fi, fir, firs, first.

curl -X PUT "localhost:9200/multi-language-search" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "global_search_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "word_edge_ngram" ]
        }
      },
      "filter": {
        "word_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  }
}
'

The second request defines the structure of the index called mapping. It consists of four fields, three for standard language-based search and one for a global search using the custom analyzer defined above in the index settings.

curl -X PUT "localhost:9200/multi-language-search/_mapping" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "data_en": {
      "type": "text",
      "analyzer": "english"
    },
    "data_fr": {
      "type": "text",
      "analyzer": "french"
    },
    "data_ru": {
      "type": "text",
      "analyzer": "russian"
    },
    "data_global": {
      "type": "text",
      "analyzer": "global_search_analyzer"
    }
  }
}
'

Now we need to fill this index with data. We will index two sentences in three languages to create two documents, the first one about apple (id=1) and the second one about orange (id=2). Pay attention that field data_global consists of all three language values combined.

curl -X PUT "localhost:9200/multi-language-search/_doc/1" -H 'Content-Type: application/json' -d'
{
  "data_en": "Apple An apple is an edible fruit produced by an apple tree",
  "data_fr": "Pomme Une pomme est un fruit comestible produit par un pommier",
  "data_ru": "Яблоко Яблоко это съедобный плод яблони",
  "data_global": "Apple An apple is an edible fruit produced by an apple tree Pomme Une pomme est un fruit comestible produit par un pommier Яблоко Яблоко это съедобный плод яблони"
}
'
curl -X PUT "localhost:9200/multi-language-search/_doc/2" -H 'Content-Type: application/json' -d'
{
  "data_en": "Orange The orange is the fruit of various citrus species in the family Rutaceae",
  "data_fr": "Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées",
  "data_ru": "Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые",
  "data_global": "Orange The orange is the fruit of various citrus species in the family Rutaceae Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые"
}
'

Finally, we have to run a search against these documents to find what we need.

First, let us run a standard search to find a document using a specific language, French.

curl -X GET "localhost:9200/multi-language-search/_search?q=data_fr:diverse&pretty"

And here is the result:

{
  "took" : 30,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6548753,
    "hits" : [
      {
        "_index" : "multi-language-search",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6548753,
        "_source" : {
          "data_en" : "Orange The orange is the fruit of various citrus species in the family Rutaceae",
          "data_fr" : "Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées",
          "data_ru" : "Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые",
          "data_global" : "Orange The orange is the fruit of various citrus species in the family Rutaceae Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые"
        }
      }
    ]
  }
}

Now let us run a global search using building OR expression to iterate over all language fields. We are going to use the match query in combination with the bool query to implement such behavior. Please pay attention that we are not using the data_global field here.

curl -X POST "localhost:9200/multi-language-search/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool" : {
      "should" : [
        { "match" : { "data_en" : "diverse" } },
        { "match" : { "data_fr" : "diverse" } },
        { "match" : { "data_ru" : "diverse" } }
      ]
    }
  }
}
'

And the result is the same:

{
  "took" : 69,
  "timed_out" : false,
  "_shards" : {
    "total" : 192,
    "successful" : 192,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6548753,
    "hits" : [
      {
        "_index" : "multi-language-search",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6548753,
        "_source" : {
          "data_en" : "Orange The orange is the fruit of various citrus species in the family Rutaceae",
          "data_fr" : "Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées",
          "data_ru" : "Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые",
          "data_global" : "Orange The orange is the fruit of various citrus species in the family Rutaceae Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые"
        }
      }
    ]
  }
}

Now let us search for the same phrase in the global search field. We are going to use the data_global field to do that and not use language-specific fields. The query will look like that:

curl -X GET "localhost:9200/multi-language-search/_search?q=data_global:diverse&pretty"

And the result is identical:

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.3957636,
    "hits" : [
      {
        "_index" : "multi-language-search",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.3957636,
        "_source" : {
          "data_en" : "Orange The orange is the fruit of various citrus species in the family Rutaceae",
          "data_fr" : "Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées",
          "data_ru" : "Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые",
          "data_global" : "Orange The orange is the fruit of various citrus species in the family Rutaceae Orange Lorange est le fruit de diverses espèces dagrumes de la famille des Rutacées Апельсин Апельсин - это плод различных видов цитрусовых из семейства Рутовые"
        }
      }
    ]
  }
}

Please pay attention that despite the results being identical in all three cases mentioned above, they may generally vary. For example, the global search field with a simplified analyzer may catch stopwords and not catch some language-specific data that can be caught only by an appropriate language analyzer (e.g., because of stemming).

When choosing between these two approaches, you always have to keep in mind two things: business requirements and technical limitations. Check which approach will implement business requirements better and start from it. You should adapt your search engine to fit these requirements, not vice versa. Technical limitations come to play only when you already know how to do it.

However, sometimes you may be able to optimize data structure and save some resources. For example, if you are OK with using OR expression, you may skip the global search field and save CPU time and disk space.