Background
In my compay we use a logistics application called huLOG. Part of huLOGs (award winning) functionality is to aggregate track and trace events from about two dozen sources. The events are schema free in there nature:
some might contain a ZIP code, some may geographic coordinates attached, some relate to a certain packet or pallet (called “movable unit” in huLOG-spreak) some relate to a certain shipment (which is a group of movable units). Some have file attachments, e.g. Images of the packages, signatures proofing delivery or pictures of the number plates of the trucks.
My first approach was an SQL database. Later I coupled it with a self-designed Document store called DoDoStorage. For background on that project see here, here and here.
In summer 2007 we experienced severe performance problems with DoDoStorage. I started a rewrite in Erlang and came across Damien Katz and his then obscure CouchDB project. It was just in the transition from XML to JSON and Damion assured me that “it will not be production ready for an other two years”.
So we decided to solve the DoDoStorage speed issues with more hardware for the time being.
In Fall 2008 the landscape for CouchDB hat changed: it was now the hot thing in database technology and everybody’s darling.
CouchDB 0.9 came around and it started to look usable for serious use. We hired Jan Lenhardt, one of the CouchDB core team to work with us on moving huLOG from PostgreSQL and DoDoStorage to CouchDB.
In December we started migrating services and data over to a CouchDB based system. While CouchDB has its wards we are very happy with it so far.
Data Model
As stated above tracking data comes in many different flavors ad colors. It might reference an MUI (”movable unit ID”), a shipment or both. Let’s concentrate on events referencing a MUI. Some typical tracking events might look like this:
{
"_id": "01420000000378-20061005T064500.000000",
"mui": "01420000000378",
"shipment": 572,
"message": "Processing, 0132-Vlotho, Route 0132, Code 101",
"code": "410",
"timestamp": "20061005T064500.000000",
"facility": "DPD Depot 0132",
"ort": "Vlotho (DE)",
"plz": "32657"
},
{
"_id": "01420000000378-20061005T085400.000000",
"mui": "01420000000378",
"shipment": 572,
"message": "proof of delivery",
"code": "421",
"timestamp": "20061005T085400.000000",
"plz": "32634",
"_attachments": {
"POD.pdf": {
"stub": true,
"content_type": "application/pdf",
"length": 136446
}
}
}
One thing about my choice of keys (the _id field). I have choosen a “meaningful” ID over traditional “random” UUIDs. The IDs we use consist of MUI-Timestamp The Timestamp is the time the event was generated (e.g. the pallet was loaded). The good news is that this automatically keeps my database free of duplicates.
If I import the same file with events twice, the events will have the same IDs during both imports and thus the second import will overwrite the first one: exactly what I want.
The bad news is that while physically there can’t really happen two events at once due to clock drift etc. I might get two different events for a MUI with exactly the same timestamp. This would result in the same key being generated for both events and thus the second event would overwrite the first one. With the relatively sparse populated space of timestamps I expect this to happen once in 10.000 events or so.
For huLOG it is acceptable to use one in 10.000 events – especially if it is the earlier one. We only get about 95% of the events we should get due to problems in the track and trace infrastructure of the freight companies. As long as the all important “has been handed to the customer” messages arrive, our system has to be able to handle missing messages.
Data Access by MUI
With that we just need a few view functions to work with the data. Most obvious we want to get all documents for a certain MUI.
This is easy since our document IDs already contain the MUI. We only have to get a list of all IDs starting with MUI. The CouchDB Document API already provides that functionality:
$ curl -s 'http://localhost:5984/hulog_events/_all_docs
?startkey=%22094147562251-0%22
&endkey=%2209445147562253-9%22
&include_docs=true'
{"total_rows":184708,"offset":184705,"rows":[
{"id":"09445147562251-20081114T165600.000000",
"key":"09445147562251-20081114T165600.000000",
"value":{"rev":"463392510"},
"doc": {..., "message":"Einrollung, 0147, Route 0280, Code 461 105",
"mui":"09445147562251", "shipment":124524}},
{"id":"09445147562253-20081114T165700.000000",
"key":"09445147562253-20081114T165700.000000",
"value":{"rev":"2436256439"},
"doc": {..., "message":"Einrollung, 0280, Route 0234, Code 461 105",
"mui":"09445147562253", "shipment":124524}}
]}
Some things to note: startkey and endkey must be set to valid JSON objects. We want a string containing the mui so we have to put “quotes” arround it.
After URL-encoding we end up with %22094147562251%22.
The returned documents are within the interval [startkey:endkey]. If we choose a startkey which is guaranteed same or smaller than the key we want to retrieve. That’s easy: 09445147562251 is smaller than any string which has more characters and starts with 09445147562251. Endkey is somewhat more tricky. What is the biggest value? 09445147562252? This might catch one record to much, because there actually might be a key 09445147562252.
A common idiom used in the Erlang community is to use something like 09445147562251Z as the endkey. But then what is if there actually is a key 09445147562251Zabc? So use 09445147562251ZZZZZ, but …
An other suggestion is using a “high unicode character” like \u9999. But at least there is no highest unicode code point and you are well of in considering the sorting rules for obscure unicode characters as indetermistic. See the CouchDB Wiki for further information.
In our case all this is no problem. The MUI is followed by a ISO 8601 timestamp. So
startkey=094147562251-0, endkey=09445147562253-9 should work fine until the end of year 8999.
Data Access by Shipment
That was easy. Now we want to access data based on something which is not encoded in the document ID. Say the “shipment” number. For that we need a map function which gives us shipment numbers.
map: function(doc) {
if(doc.shipment) {
emit(doc.shipment, null);
}
}
We permanently save this function in the server to allow CouchDB to play its clever optimization tricks. See the CouchDB wiki for more information on the HTTP View API. You could use the Futon GUI at http://localhost:5984/_utils/ for that or curl:
curl -X PUT -H 'Content-Type: applicatioe": "javascript",
"views": {"all":{"map":"function(doc){if(doc.shipment)
{emit(doc.shipment, null);}}"}}}
' http://localhost:5984/hulog_events/_design%2fsendung
{"ok":true,"id":"_design/sendung","rev":"23237918"}
Now we can query the view and with the startkey and endkey parameters. Here we get again into trouble for choosing the endkey. The hack we used for the IDs did work because there was a timestamp of well known format in the ID so we construct a “slightly bigger” value. Shipments are numeric. So what is bigger than 128996? 128997 is obviously bigger but would get us documents for two shipments (128996 and 128997). Since the shipments are numeric, a trick like “128996Z” does not work out.
But numeric comparing of numeric values in JavaScript work nicely with different numeric types. Shipments are integers. If we use a float key we nicely can place it to be bigger than 128996 but smaller than 128997. We use 128996.1 as our endkey.
$ curl 'http://localhost:5984/hulog_events/_view/shipment/all
?startkey=128996&endkey=128996.1'
{"total_rows":184707,"offset":184674,"rows":[
{"id":"09445122481336-20081223T105212.491567","key":128996,"value":null},
{"id":"09445122481336-20081223T155200.000000","key":128996,"value":null},
{"id":"09445122481336-20081223T202600.000000","key":128996,"value":null},
{"id":"09445122481337-20081223T105216.377305","key":128996,"value":null},
{"id":"09445122481337-20081223T155300.000000","key":128996,"value":null},
{"id":"09445122481337-20081223T202500.000000","key":128996,"value":null}
]}
You can also add &include_docs=true to the query – this will get you the complete documents, not only the IDs.
At this point we have a scalable, elegant datastore for our Track & trace related events ad files.
Usage
Currently we are running CouchDB with subset of our tracking archives. This subset is about a quarter million documents of wich 20% or so have attachments resulting in a database size of about 5 GB. No complains so far.
Posted in couchdb, write code | 4 Comments »