Improving the search relevancy for Japanese (including Chinese and Korean, known as CJK) poses unique challenges, but Senior Web Engineer Ivan Kristianto is on the case. Read on to find out what techniques he’s using to overcome difficulties with context, multiple scripts, and segmentation.
The issue of search relevancy for Japanese language content can be quite complex due to several unique characteristics of the language and its use, particularly because words are not separated by whitespaces.
Full-text search technique is a common way to improve the search relevancy, and we are going to talk about how to implement full-text search for Japanese with morphological analysis and n-gram analysis.
The search for meaning
Japanese writing uses three scripts, Katakana, Hiragana and Kanji, and each can be mixed inside the content.
Let’s take a look at an example. These two words have very different meanings, but both words have almost the same characters:
- インフルエンサー translated to Influencer
- インフルエンザ translated to Influenza
When a user uses these search keywords 「インフルエンザ 夏の沖縄」、, they are translated to “Okinawa in the summer of influenza”. However, search results come back with a lot more information about “Influencer in Okinawa”, because it has a higher relevancy score based on the content in the database.
The result is confusing and not good for user experience.
The good news is that we have successfully improved the search result relevance using Elasticsearch on top of the Altis platform.
Precision and recall
Precision and recall stand as fundamental metrics for evaluating the efficacy of a full-text search system. Precision quantifies the accuracy of the search results by measuring the ratio of relevant results to the total number of retrieved results, reflecting the system’s ability to avoid irrelevant information. Conversely, Recall gauges the system’s capacity to retrieve all relevant documents, indicating the extent to which it avoids omitting pertinent information.
For instance, a search demonstrating high precision would yield results exclusively pertinent to the query, such as returning only items categorised as “phone” when prompted with the term. Conversely, a search emphasising high recall would retrieve all items containing the term “phone,” encompassing diverse contexts beyond mere telecommunications devices, including saxophones.
It’s imperative to recognise the inherent trade-off between precision and recall. Balancing these metrics requires aligning them with the specific requirements of your use case. Consequently, a meticulous approach involving iterative testing and adjustment becomes essential in optimising search performance to meet the unique demands of your application.
Technical implementation summary
In most European languages rely on whitespace to separate words, making it straightforward to analyze sentences.. With Elasticsearch on Altis, the analyzer is doing a good job in breaking the sentences based on words that exist in Kuromoji index.
However, languages like Japanese (CJK – Chinese, Japanese, Korean) lack inherent word separation. Two primary methods address this challenge in search engines:
- N-gram analysis: Splits text into sequences of N characters.
- Drawbacks: Large indexes due to redundancy, inability to analyse by part of speech, and creation of meaningless fragments. (Low search omissions, high search noise)
- Morphological analysis: Segments text into meaningful words based on a dictionary.
- Drawbacks: Difficulty handling new or unknown words not present in the dictionary. (Low search noise, high search omissions).
Combining Techniques for Optimal Search Relevancy
To achieve optimal search relevance for Japanese text, we leverage both analysis methods:
- N-gram analysis: Reduces search omissions by capturing various word combinations.
- Morphological analysis: Minimises search noise by ensuring results are meaningful words.
To make this work, we required two Elasticsearch extensions, which were already installed on Altis. See this supporting document for more details.
- ICU Analysis Plugin: Integrates the Lucene ICU module into Elasticsearch, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalisation, Unicode-aware case folding, collation support, and transliteration.
- Japense (Kuromoji) Analysis Plugin: Integrates Lucene Kuromoji analysis module into Elasticsearch
By leveraging a combination of n-gram and morphological analyses, we’ve effectively enhanced the search relevancy for the Japanese language within Elasticsearch on Altis.
Through the strategic implementation of the ICU Analysis Plugin and the Japanese (Kuromoji) Analysis Plugin, we’ve achieved a delicate balance between minimising search omissions and reducing search noise.
This meticulous approach ensures that our search system delivers precise and comprehensive results, even in the intricate linguistic landscape of Japanese. With these enhancements in place, users can expect a significantly improved search experience, characterised by increased accuracy and relevance.