Index Punjabi, Hindi, Bengali and other non english scripts in Koha

Share this post on:

Being Multilingual is an important aspect for the library catalog. it is a feature that koha offers. It facilitates users who want to search the library records in their preferred language. Since koha supports UTF-8, UTF-16, and Unicode standards, the librarians can make catalog entires of regional language books in their regional scripts like Punjabi, Hindi, Bengali, Tamil etc.

But the problem arises when it comes to indexing and searching these vernacular languages.

Koha uses Zebra as its default search engine for indexing and retrieving the records. Zebra is a high-performance, general-purpose structured text indexing and retrieval engine. It reads records in a variety of input formats (eg. email, XML, MARC) and provides access to them through a powerful combination of Boolean search expressions and relevance-ranked free-text queries.

Zebra supports large databases (tens of millions of records, tens of gigabytes of data). It allows safe, incremental database updates on live systems. Because Zebra supports the industry-standard information retrieval protocol, Z39.50, you can search Zebra databases using an enormous variety of programs and toolkits, both commercial and free, which understands this protocol…” Zebra – User’s Guide and Reference, p. 1, http://www.indexdata.dk/zebra/doc/zebra.pdf

But by default, the Zerbra search engine does not support indexing languages other than English. So the solution to this is to install and enable ICU chains. in order to do this, first install yaz-icu package.

Install the Yaz-icu package:

 sudo apt-get install yaz-icu

Then in In the staff interface go to More > Administration > Global system preferences > Searching.

  • In this tab Change the UseICUStyleQuotes system preference to Using.
  • then Change the QueryFuzzy system preference to Don’t try.
  • and also Change the QueryStemming system preference to Don’t try.

Then Edit /etc/koha/zebradb/etc/default.idx with the following command

sudo nano /etc/koha/zebradb/etc/default.idx

Change or add the bolded lines as follows:

 # Traditional word index
 # Used if completenss is 'incomplete field' (@attr 6=1) and
 # structure is word/phrase/word-list/free-form-text/document-text
 index w
 completeness 0
 position 1
 alwaysmatches 1
 firstinfield 1
 icuchain words-icu.xml
 
 # Phrase index
 # Used if completeness is 'complete {sub}field' (@attr 6=2, @attr 6=1)
 # and structure is word/phrase/word-list/free-form-text/document-text
 index p
 completeness 1
 firstinfield 1
 icuchain phrases-icu.xml 

Restart Zebra and rebuild the search index with the following commands one by one.

sudo koha-zebra --restart {yourinstancename}
sudo koha-rebuild-zebra -f -v {yourinstancename}

This should take some time depending on the size of your catalog, after the indexing is finished, you will be able to search and browse regional language records in the Koha catalog.

Author: Rupinder Singh

I am a tireless intelligence seeker, coincidentally I am a computer guy too, who is passionate about Information Tools and Open-Source software. I Read Books, play Computer Games, Climb Mountains, when I am not changing the code.

View all posts by Rupinder Singh >

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.