Archive for January, 2009

Simple “full text” search with CouchDB

Monday, January 19th, 2009

To have a shiny application you need domain specific search. E.g. if our call center wants to enter a new order, they might not have the customer number ready. So they need a snappy way to get the customer number based on name, city or whatever.

We did experiment with lot’s of LIKE queries to our In our legacy ERP database system. This didn’t feel good, had some SQL injection vulnerabilities and required lot’s of full table scans.

Looking for alternatives we decided to use CouchDB. There is some work on full text indexing for CouchDB you can build something much more simple yourself.

Once a day we copy all customer data from the legacy system into CouchDB. Then we use the map function to emit a line for each word in each data field of each document. It looks like this:

function(doc) {
    function output(value) {
        // Split into search terms
        if(value && (value != "-") && (value.length > 2)) {
            emit(value, 1);
            for(var word in value.split(" ")) {
                if(word && (word != "-") && (word.length > 2)) {
                    emit(word, 1);
                }
            }
        }
    }
    output(doc.kundennr);
    output(doc.name1);
    output(doc.name2);
    output(doc.ort);
    output(doc.land + "-" + doc.plz);
}

This basically generates a view (index) containing every word and the document it occurs in.

You now can use that for a prefix based search in a function like this:

from couchdb.client import *

def finde_kundendaten(searchstring):
    server = Server('http://couchdb.local.hudora.biz:5984/')
    db = server['kunden']
    rows = []
    while len(rows) < 1 and len(searchstring) > 2:
        rows = db.view('suche/alle_felder', startkey=searchstring, limit=25)
        if rows:
            break
        searchsting = searchstring[:-1]
    return [(x.id, x.key) for x in rows]
>>> finde_kundendaten("Sport Dornseif") # no exact match in the DB
[(u'51320', u'Sport Alm SysIntersport'),
 (u'27094', u'Sport Freizeit'),
 (u'31071', u'SPORT FREIZEIT  TREFF'),
...]

Nifty!