Archive for the ‘write code’ Category

Building a Track and Trace Application with CouchDB

Sunday, December 28th, 2008

Background

In my compay we use a logistics application called huLOG. Part of huLOGs (award winning) functionality is to aggregate track and trace events from about two dozen sources. The events are schema free in there nature: some might contain a ZIP code, some may geographic coordinates attached, some relate to a certain packet or pallet (called “movable unit” in huLOG-spreak) some relate to a certain shipment (which is a group of movable units). Some have file attachments, e.g. Images of the packages, signatures proofing delivery or pictures of the number plates of the trucks.

My first approach was an SQL database. Later I coupled it with a self-designed Document store called DoDoStorage. For background on that project see here, here and here.

In summer 2007 we experienced severe performance problems with DoDoStorage. I started a rewrite in Erlang and came across Damien Katz and his then obscure CouchDB project. It was just in the transition from XML to JSON and Damion assured me that “it will not be production ready for an other two years”.

So we decided to solve the DoDoStorage speed issues with more hardware for the time being.

In Fall 2008 the landscape for CouchDB hat changed: it was now the hot thing in database technology and everybody’s darling. CouchDB 0.9 came around and it started to look usable for serious use. We hired Jan Lenhardt, one of the CouchDB core team to work with us on moving huLOG from PostgreSQL and DoDoStorage to CouchDB.

In December we started migrating services and data over to a CouchDB based system. While CouchDB has its wards we are very happy with it so far.

Data Model

As stated above tracking data comes in many different flavors ad colors. It might reference an MUI (”movable unit ID”), a shipment or both. Let’s concentrate on events referencing a MUI. Some typical tracking events might look like this:

{
   "_id": "01420000000378-20061005T064500.000000",
   "mui": "01420000000378",
   "shipment": 572,
   "message": "Processing, 0132-Vlotho, Route 0132, Code 101",
   "code": "410",
   "timestamp": "20061005T064500.000000",
   "facility": "DPD Depot 0132",
   "ort": "Vlotho (DE)",
   "plz": "32657"
},
{
   "_id": "01420000000378-20061005T085400.000000",
   "mui": "01420000000378",
   "shipment": 572,
   "message": "proof of delivery",
   "code": "421",
   "timestamp": "20061005T085400.000000",
   "plz": "32634",
   "_attachments": {
       "POD.pdf": {
           "stub": true,
           "content_type": "application/pdf",
           "length": 136446
       }
   }
}

One thing about my choice of keys (the _id field). I have choosen a “meaningful” ID over traditional “random” UUIDs. The IDs we use consist of MUI-Timestamp The Timestamp is the time the event was generated (e.g. the pallet was loaded). The good news is that this automatically keeps my database free of duplicates. If I import the same file with events twice, the events will have the same IDs during both imports and thus the second import will overwrite the first one: exactly what I want.

The bad news is that while physically there can’t really happen two events at once due to clock drift etc. I might get two different events for a MUI with exactly the same timestamp. This would result in the same key being generated for both events and thus the second event would overwrite the first one. With the relatively sparse populated space of timestamps I expect this to happen once in 10.000 events or so.

For huLOG it is acceptable to use one in 10.000 events – especially if it is the earlier one. We only get about 95% of the events we should get due to problems in the track and trace infrastructure of the freight companies. As long as the all important “has been handed to the customer” messages arrive, our system has to be able to handle missing messages.

Data Access by MUI

With that we just need a few view functions to work with the data. Most obvious we want to get all documents for a certain MUI.

This is easy since our document IDs already contain the MUI. We only have to get a list of all IDs starting with MUI. The CouchDB Document API already provides that functionality:

$ curl -s 'http://localhost:5984/hulog_events/_all_docs
  ?startkey=%22094147562251-0%22
  &endkey=%2209445147562253-9%22
  &include_docs=true' 

{"total_rows":184708,"offset":184705,"rows":[
{"id":"09445147562251-20081114T165600.000000",
 "key":"09445147562251-20081114T165600.000000",
 "value":{"rev":"463392510"},
 "doc": {..., "message":"Einrollung, 0147, Route 0280, Code 461 105",
        "mui":"09445147562251", "shipment":124524}},
{"id":"09445147562253-20081114T165700.000000",
 "key":"09445147562253-20081114T165700.000000",
 "value":{"rev":"2436256439"},
 "doc":  {..., "message":"Einrollung, 0280, Route 0234, Code 461 105",
          "mui":"09445147562253", "shipment":124524}}
]}

Some things to note: startkey and endkey must be set to valid JSON objects. We want a string containing the mui so we have to put “quotes” arround it. After URL-encoding we end up with %22094147562251%22.

The returned documents are within the interval [startkey:endkey]. If we choose a startkey which is guaranteed same or smaller than the key we want to retrieve. That’s easy: 09445147562251 is smaller than any string which has more characters and starts with 09445147562251. Endkey is somewhat more tricky. What is the biggest value? 09445147562252? This might catch one record to much, because there actually might be a key 09445147562252.

A common idiom used in the Erlang community is to use something like 09445147562251Z as the endkey. But then what is if there actually is a key 09445147562251Zabc? So use 09445147562251ZZZZZ, but …

An other suggestion is using a “high unicode character” like \u9999. But at least there is no highest unicode code point and you are well of in considering the sorting rules for obscure unicode characters as indetermistic. See the CouchDB Wiki for further information.

In our case all this is no problem. The MUI is followed by a ISO 8601 timestamp. So
startkey=094147562251-0, endkey=09445147562253-9 should work fine until the end of year 8999.

Data Access by Shipment

That was easy. Now we want to access data based on something which is not encoded in the document ID. Say the “shipment” number. For that we need a map function which gives us shipment numbers.

map: function(doc) {
    if(doc.shipment) {
        emit(doc.shipment, null);
    }
}

We permanently save this function in the server to allow CouchDB to play its clever optimization tricks. See the CouchDB wiki for more information on the HTTP View API. You could use the Futon GUI at http://localhost:5984/_utils/ for that or curl:

 curl -X PUT -H 'Content-Type: applicatioe": "javascript",
  "views": {"all":{"map":"function(doc){if(doc.shipment)
                          {emit(doc.shipment, null);}}"}}}
  ' http://localhost:5984/hulog_events/_design%2fsendung

{"ok":true,"id":"_design/sendung","rev":"23237918"}

Now we can query the view and with the startkey and endkey parameters. Here we get again into trouble for choosing the endkey. The hack we used for the IDs did work because there was a timestamp of well known format in the ID so we construct a “slightly bigger” value. Shipments are numeric. So what is bigger than 128996? 128997 is obviously bigger but would get us documents for two shipments (128996 and 128997). Since the shipments are numeric, a trick like “128996Z” does not work out.

But numeric comparing of numeric values in JavaScript work nicely with different numeric types. Shipments are integers. If we use a float key we nicely can place it to be bigger than 128996 but smaller than 128997. We use 128996.1 as our endkey.

$ curl 'http://localhost:5984/hulog_events/_view/shipment/all
  ?startkey=128996&endkey=128996.1'

{"total_rows":184707,"offset":184674,"rows":[
{"id":"09445122481336-20081223T105212.491567","key":128996,"value":null},
{"id":"09445122481336-20081223T155200.000000","key":128996,"value":null},
{"id":"09445122481336-20081223T202600.000000","key":128996,"value":null},
{"id":"09445122481337-20081223T105216.377305","key":128996,"value":null},
{"id":"09445122481337-20081223T155300.000000","key":128996,"value":null},
{"id":"09445122481337-20081223T202500.000000","key":128996,"value":null}
]}

You can also add &include_docs=true to the query – this will get you the complete documents, not only the IDs.

At this point we have a scalable, elegant datastore for our Track & trace related events ad files.

Usage

Currently we are running CouchDB with subset of our tracking archives. This subset is about a quarter million documents of wich 20% or so have attachments resulting in a database size of about 5 GB. No complains so far.

departicularifier for NRG MP 161 Scanfaxprinter

Saturday, November 1st, 2008

If you own a NRG/Rex-Rotary/Aficio/Ricoh/Gestetner MP 161 Fax/Scan/Print combination you might have noticed the annoying habit of that device to chop E-Mail messages of scanned Images into multiple parts. This is implemented by using MIME message/partial Semantics.

While there are obscure mail applications which are able to read message/partial modern software ignores it, because it is a huge security issue.

This in turn means, I can’t open longer documents mailed by the MP 161 to me.

Because I like to scan large documents, I hate this quirk of the MP 161. So I wrote a Python tool called departicularifier which connects to an IMAP server, downloads the mails send by the fax, reassembles and decodes the messages and dumps the files on your harddisk.

$ python departicularifier.py --help
Usage: departicularifier.py --user=you@example.com [options].
Try departicularifier.py --help for details.

Extract files from message/partial MIME messages generated by NRG MP 161
scanfax appliances.

Options:
  --version            show program's version number and exit
  -h, --help           show this help message and exit
  --server=SERVER      hostname of the IMAPS server where the messages are
                       stored (default: "mail.hudora.biz")
  --user=USER          User name for logging into the server (default: "none")
  --password=PASSWORD  Pasword used to login (default: ask for password)
  --folder=FOLDER      Folder to scann for messages (default: "INBOX")
  --sender=SENDER      Sender whose messages to process (default: "scanner@")
  --dir=DIR            Destination directory (default: ".")

For you it need probably some fiddeling with the parameters. A run shoul look like this:

$ python decoder.py --user=m.dornseif@example.com
connecting to 'mail.hudora.biz' as 'm.dornseif@example.com'.
Password:
[long wait]
writing ./20080528132721094.pdf
writing ./20080528130042306.pdf
writing ./20080528131755217.pdf

You can download departicularifier.py here.

Reliable Software Design

Thursday, September 18th, 2008

Avoiding Race conditions

A typical situation in ERP systems is, that you have to use SQL tables as an “interface” between applications.
Let’s say you want to submit an order. You write an “order header” containing the customer number etc and orderlines how much to ship of what.

order = {'customer_no': 12345,
         'delivery_date': '2001-01-01',
         'orderlines': [{'item_no': '10105', 'quantity': 80},
                        {'item_no': '14600', 'quantity': 12}]
        }

In SQL you use two tables and add some primary keys:

order_head:

id   | customer_no | delivery_date | processed
-----+-------------+---------------------------
5123 | 12345       | 2001-01-01    | False

orderline:

orderid | item_no | quantity
--------+---------+----------
5123    | 10105   | 80
5123    | 14600   | 12

Since oldschool databases usually don’t have auto incrementing keys, you usually have to write something like this to insert an order:

next_id = query('SELECT MAX(id) FROM order_head')
query('INSERT INTO order_head (id, customer_no, delivery_date, processed)
       VALUES(%d, '12345', '2001-01-01', False)" % next_id)
query('INSERT INTO orderline (orderid | item_no | quantity)
       VALUES(%d, '10105', 80)" % next_id)
query('INSERT INTO orderline (orderid | item_no | quantity)
       VALUES(%d, '14600', 12)" % next_id)

This has some issues: what if some programm starts reading after you have written order_head and the first orderline, but not the second one? The ustomer would not get the goods from the second orderline. Transaction in modern SQL databases can solve this, but if you don’t have transactions you can solfe this problem in otehr ways. Let’s assume the code in the ERP for processing this interface looks like this:

for order in query('SELECT * FROM order_head'):
    for orderlines in query("SELECT * FROM orderline
                             WHERE orderid='%d'" % order.id):
        # do something with order and orderline
    query("UPDATE orderline SET processed=True
           WHERE orderid='%d'" % order.id)

This means that orderlines are never read unless there is also an header with the respective ID. So if we write the orderlines first we are save from them beeing read if tey are not ready yet. So by reordering our code we can make it already much more robust:

next_id = query('SELECT MAX(id) FROM order_head')
query('INSERT INTO orderline (orderid | item_no | quantity)
       VALUES(%d, '10105', 80)" % next_id)
query('INSERT INTO orderline (orderid | item_no | quantity)
       VALUES(%d, '14600', 12)" % next_id)
query('INSERT INTO order_head (id, customer_no, delivery_date, processed)
       VALUES(%d, '12345', '2001-01-01', False)" % next_id)

This already avoids the race condition “reading while writing”. But what if two processes try to write orders at once? Then next_id = query('SELECT MAX(id) FROM order_head') would result in both processes getting the same next_id and this mixing up two orders into a single one.

The easiest solution is to ensure that only one process can ever write orders. This avoids lot’s of trouble. But it is suprisingly hard to enforce that. So by having code which reduces the risk of dual write race conditions we can make our program much more robust.

One approach would be to use SQL functionality like this to insert the head?

query('INSERT INTO order_head (id, customer_no, delivery_date, processed)
       VALUES(SELECT MAX(id) FROM order_head, '12345', '2001-01-01', False)")

This would ensure that no id is used twice but how would we get the ID we generated for our order? SELECT MAX(id) FROM order_head would not help since some other program might have inserted another record in the meantime.

We are here in an situation where we can’t make the code reliable without resorting to some tricks. In this case the processed field is our solution: Teh application reading the database ignores all records where processed == True. So as long as we set processed to True we can use the other fields for whatever we like. In this case we use the date field to temporary store an unique id identifying the record.

So let’s generate a nice 10 character random value:

token = hex((int(time.time() * 10000)
            ^ (os.getpid() << 16)
            ^ thread.get_ident() << 8)
           % 0xFFFFFFFFFF).rstrip('L')[2:]

If this random value is guaranteed to be unique it’s easy to write a robust insertation code.

def insert_order():
    token = hex((int(time.time() * 10000)
        ^ (os.getpid() << 16)
        ^ thread.get_ident() < 1:
    next_id = query('SELECT MAX(id) FROM order_head')
    query('INSERT INTO order_head (id, customer_no, delivery_date, processed)
       VALUES(%d, '12345', %r, False)" % (next_id, token))
    rowcount = query('SELECT COUNT(*) FROM order_head WHERE id = %d' % next_id)
    if rowcount > 1:
       # the race condition has hit - remove our entry and retry
        query('DELETE FROM order_head WHERE delivery_date = %r' % token)
        time.sleep(random.randint()/100.0)
        insert_order()
    else:
        query('INSERT INTO orderline (orderid | item_no | quantity)
               VALUES(%d, '10105', 80)" % next_id)
        query('INSERT INTO orderline (orderid | item_no | quantity)
               VALUES(%d, '14600', 12)" % next_id)
        # fix up the header
        query('UPDATE order_head SET delivery_date='2001-01-01',
               processed=False WHERE delivery_date=%r' % token)

While this code is much more complex than the original, it can guarantee that no race conditions occur as long as token is unique.

Parsing wi-scan data

Wednesday, November 26th, 2003

wiscanparse.py is a Python module for parsing data in wiscan format as created by many popular WiLDing/Wardriving tools.

Formating ISBNs

Monday, October 27th, 2003

I found out there are strict rules for formatting ISBNs.

There are Implementations in C, Emacs elisp, php and check-digit checking perl.

There are also articles about the structure and country prefixes. Finally there is the official Handbook.

I have ported the implementation from bibclean over to Python done some pythonisations to speed it up and added a check sum validation routine inspired by the perl version.

Download isbn.py.

PyAmazon and international Amazon Webservices

Sunday, October 19th, 2003

This patch to the PyAmazon Module allows you to access th uk, de and jp based amazon webservices. It also allows you to sumbmitt your associate ID to the web services.

Howl, a Rendezvous toolkit.

Wednesday, July 16th, 2003

Howl is a free, cross platform rendezvous/zeorconf implementation which looks reasonably well.

I’ll try to glue python on it to see how it fits.

iChat Proxy

Tuesday, July 15th, 2003

I allways wanted some library which would allow scripts to communicate with a locally running iChat via rendezvous/zeorconf. Think of all the cool applications one could build! Get an iChat message for every new posting in your favorite Weblog, build an iChat-IRC proxy, post to your Weblog via iChat get a message if your Server reboots, your phone rings, etc.

Now I read in a blog posting that there is already some hackery named jChat which implements bits and pices of such a tool. Nifty.