Builders guide to Shipping Containers

February 14th, 2010

The ISO Shipping Container is a terrific pice of hardware. Cheap (500-3000 €), sturdy (structure can carry 200-350 tonnes), corrosion resistant (usually made of CORTEN) and good looking.

They are terrific for building things and while it is an urban legend that “it’s cheaper to scrap them then to send them back to Asia” (hint: the ships have to drive back to Asia anyway) new containers are produced in China or Korea and used containers are cheap and easy to get in Europe and the US.

Things you can’t do with shipping Containers

All these great features come with a bunch of restrictions. Containers come with 8 Corner Casings which are the only points where a Shipping Container is meant to be anchored. This makes a container can only rest on it’s 4 bottom Corners and can only support loads from above on it’s 4 top corners.

Everything else on a container is not meant to carry any significant load.

This meant you can only stack containers in configurations where all for bottom corner casings of the upper container are supported by four top corner casings of lower containers.

This results in many desirable stacking configurations being impossible from a stability standpoint. See for examples on the right.

Nearly all configurations where the upper container is rotated by 90 degrees are impossible. Putting 20″ on 40″ containers is in most cases impossible.

Putting a 40″ on a single 20″ container is impossible. Most overhang configurations are impossible.

Now you know the reason why there is so little creativity in professional Container Stacking.

What works is putting a 40″ Container on two 20″ containers, but there is little reason to do that.

The Roof Problem

While the floors of a Container are meant to carry loads from more than a ton per square meter the roofs are little more than weather protection than a few dozen kg per square meter. See this video of me seesawing my toes and the effect on the roof.

Walking on the Roof for a single person is slightly uncomfortable and for a group it is something in-between scary and dangerous. The high quality steel sheeting of the roof will not break but the whole thing is very shaky.

Also the roofs of pre-owned containers are usually not flat anymore. Being the most fragile part of the structure they have a tendency to bent and get into a wobbly state.

So if you plan to walk regularly on them you need to put additional flooring on top of the containers to distribute the load and stiffen the floor. We used two layers of 1 cm plywood which is fine, but still the walking experience is not totally solid.

If you plan to put heavier loads on the roofs you probably need to discuss this with a structural engineer. An approach we where contemplating was to add an additional layer of containers on top and remove roof and walls from them to keep only the ultra rigid container floor. Puma City this this for their balcony

Steel on Steel

You might or might not be aware, that steel on steel usually results in an excellent plate bearing. Meaning that Corner Casings on top of each other result in astonishing easy movability of the upper container. As in “Slam the door and it falls nearly down”. Even a pice of carton between the corner casings helps somewhat. But on the long run you need to fix your containers onto each other.

So it’t imperative that you fix the upper container to the lower containers. The best way is to use Twistlocks as the Pros do. Marine grade Twistlocks can secure a stack of 4 fully loaded containers on a rolling ship without additional help. (Usually contianers are stacked 6-8 levels high, which results in the need for additional securing. We bought our Twistlocks at Willi Wader Group.

You might be tempted to save the expense of twistlocks (15-30 € per lock) and instead screw or weld the containers together. But keep in might that the twistlocks are mass produced to rigid quality standards – while it is hard to assess the integrity of your home grown connection approach.

Doors

Doors of shipping containers are hard to open by design. Even on a brand new container you have to operate two handles with considerable strength at once and at different speeds to open and close a container. And you can not open the left dor unless the right door is already open (although you might bee able to change that with a welding touch, a knife and in 20 minutes).

On a pre-owned container being 10 years or so old it is usually much harder to operate the doors because the frame of the door and the container itself is always bent to a certain degree. These doors can not replace a normal door if you are planning to pass it several times a day.

If you receive more than one container sort them by door quality. If you are able to choose the containers before purchase, always check the doors.

Structural integrity

Shipping Containers are unibody constructions. This means the whole hull is used to contribute to load distribution.

In case you didn’t suspect: this means bad things™ happen, if you cut substantial holes in the hull.

If you spot something with big openings it is no shipping container. Or it is a shipping container which has extensively upgraded structural components – which is possible but defeats the purpose of “standardized, cheap, reusable”.

Even if you leave the door open the container is less sturdy than one wit a closed door. Most containers seem to be able to withstand removing the back wall opposite to the door I wouldn’t like to have any load on such a container.

We opted for leaving about 30 cm on the top and about 15 cm at the sides to keep some reinforcement. But in our application we are quite sure that there are only minimal shear forces. I would not feel comfortable in removing the back walls in any container which would have to carry substantial load.

If you remove one of the longer walls your container is toast. Or at least it has as much integrity as toast. We removed a complete wall of a container and found out that the container lost all stability.

While all containers conform to rigid standards (ocean carriers hate delays due to broken containers) not all containers are created equal. Some are a little less unibody by having additional reinforcement along the top. Use this type for cutting holes in the sides – they even can keep a somewhat stable roof when a whole wall is removed.

What feels like a relative comfortable solution is removing about 20% of the side wall at the right and at the left end while leaving the upper 30 cm of the wall intact. The result is a still very stable container.

Where to get containers and what to get

You can get containers at shipping companies and ad specialized container providers. The only ISO containers we where able to get at decent prices where 20″ 40″ and 40″ “high cube” containers. Keep in mind that the “modular space” container type is very different from shipping containers and usualy much less sturdy.

We bought containers still sea-worthy and with next CSC inspection scheduled for 2011 or 2012. The Containers where commissioned 1994-1997 and considered “C” quality (on a ABC scale) by the shipping company. Because of the doors and possible holes (in C quality containers or in containers not sea worthy) you should inspect the containers before buying if possible.

The floors differ much between containers and usually are contaminated by pesticides, plan to remove them if humans will spend serious time inside. Some of our containers where freshly painted on the inside and it took several weeks of venting to get the smell to a bearable level.

How to move the containers

Usually the party selling you the containers has the equipment for hauling them to your side. The first issue is getting them of the tuck and the second issue is getting them where you want them.

We rented a 8 ton fork lift for all the handling which was adequate but not over powered for the task. When carrying two containers at once (nominal weight 5 tons) the machine had considerable effort. While considering my self an seasoned fork lift driver (30 years of experience) I had to learn that a huge forklift with a huge container is something very different than the kind of forklift you drive in warehouses. A Container has lot’s of momentum and you see absolutely nothing in the direction you are driving at. For a 20″ container assume every corner with less than 8 m space requires artistic driving capabilities. We had to get around a 6.5 m corner and through a 4 m door. We handled that by carrying the containers between the big 8 t forklift and one of our own 1.5 t warehouse-forklifts.

When renting a forklift be sure to get one which can spread it’s forks wide enough to fit into the carrying lashes at the belly of a container. They are 2.5 m or so apart.

Big forklifts can only operate on relatively even ground. Alternatively you can use a crane. The main problem with a crane and especially 40″ containers is that you can not introduce big horizontal forces by the attachment cables running from all for corners to the hook. Since you probably don’t have a spreader available to distribute the load you either need a very tall crane or two cranes. Your crane operator probably can tell you more.

Placement

Containers are build to stand on their four lower corner casings. If you don’t have a flat sturdy surface to put them on you should build a foundation at the four corners. Directly insert twistlocks into the foundation during pouring for perfect stability.

Containers are not insulated against heat, cold, noise and vibrations. If you insulate them on the outside you loose the feature of the weatherproof shell and if you insulate them on the inside your rooms get very small. It’s also hard to make the connection between two containers waterproof. Therefore it seems very popular to put the containers into another building to provide heating and shelter.

Further Reading

More information on the real world issues of container construction can be found at 10 Things to Consider in Using Shipping Containers for Your Next Project. Preston Koerner has good points although we where able to get the containers much cheaper than his estimate of 2000-4000 U$. We payed less than 1000 € per 20″ (including delivery).

Running a CouchDB cluster on Amazon EC2

December 20th, 2009

CouchDB is a nearly zero-configuration multi-master document oriented database. It is a awsome product build by an awsome team.

So far I have been using CouchDB like we would have used any other modern Document Datastore: in a centraized fashion. One Server at our premises. For backup purposes we replicated on a second couchdb instance running on our backup server.

Hosting about 300 GB of data a small 2.6 GHz Server with consumer-grade disks we started seeing preformance issues. Also we see latency issues since we are hosting some application at Amazon EC2 “in the cloud” which
results an an addiotional 40 ms delay for all queries to our locally hosted server.

So this is the right time to use some more of CouchDBs capabilities and spin up additional instances on demand at Amazon EC2. I assume you have already set up an Amazon EC2 account and are comfortable with the general concepts.

There are some tutorials out there which threat EC2 like a regular hosting provider. This is a seriously misguided approach. If you don’t use EC2 in a way that you always can loose one or two instances, you are using it wrong. If you are not spinning up servers in a way that it takes the same time to set up one instance than it takes to set up 10 instances you are using it wrong.

To use EC2 as it meant to be used, we need automation. We will use puppet in this example.

I assume that you have installed a “puppetmaster” on a machine called puppet.example.com. I also assume the puppet configuration on the puppetmaster is at /etc/puppet. On BSD ist might be located instead at /usr/local/etc/puppet. Place the following content at /etc/puppet/files/etc/couchdb/local.ini:

[couchdb]
database_dir = /mnt/couchdb
view_index_dir = /mnt/couchdb

[httpd]
bind_address = 0.0.0.0

[couch_httpd_auth]
require_valid_user=true

[admins]
admin = sekrit

This ensures that only clients which authenticate as user “admin” with the password “sekrit” are allowed to access the server. You might want to change “sekrit” to something more suble.

Add /etc/puppet/fileserver.conf to make sure the local.ini file can be moved the clients:

[files]
  path /etc/puppet/files
  allow *

Then add /etc/puppet/manifests/site.pp to allow automatic installation and configuration:

class couchserver {
  package { "couchdb": ensure => latest }
  package { "python-couchdb": ensure => installed }
  group { "couchdb": ensure => present }
  user { "couchdb": ensure => present, groups => "couchdb",
    comment => "CouchDB Administrator",
    home => "/mnt/couchdb" }
  file { "/etc/couchdb": ensure => directory,
    owner  => couchdb, group  => couchdb,
    mode   => 755 }
  file { "/mnt/couchdb": ensure => directory,
    owner  => couchdb, group  => couchdb,
    mode   => 700 }
  file {"local.ini":
    mode => 774,
    owner => couchdb, group => couchdb,
    path => "/etc/couchdb/local.ini",
    source =>
      "puppet://puppet.exmple.com/files/etc/couchdb/local.ini"
  }
  service { couchdb:
    ensure    => running,
    subscribe => [Package[couchdb],
                  File["local.ini"],
                  File["/mnt/couchdb"]]
}}

node "PLACEHOLDER" {
    include couchserver
}

Now we have to create a Amazon “security group” to firewall our CouchDB servers. Since I like the belt and suspenders way of doing things we not only will use HTTP-Auth in CouchDB but also firewall rules. You have to have the EC2 commandline tools installed. I assume your comapny has a public IP range at 17.18.19.0/24.

$ ec2-add-group couchserver -d 'couchdb server'
$ ec2-authorize couchserver -P tcp -p 5984 -s 17.18.19.0/24

Next step is starting a EC2 instance. We use a Small Ubuntu 9.10 AMI since it comes with a decent version of CouchDB. We then log in and install Puppet.

$ ec2-run-instances ami-a62a01d2 --key YOUR_EC2_SSH_KEY \
  --instance-type m1.small --region eu-west-1 \
  --group default --group couchdb
INSTANCE       i-ec985e9b   ...
$ sleep 120
# get the id from the output of ec2-run-instances
$ ec2-describe-instances i-ec985e9b
INSTANCE       i-ec985e9b      79.125.56.43   10.227.94.80
# get the ip from the output of ec2-describe-instances
$ ssh -i ~/.ssh/YOUR_EC2_SSH_KEY ubuntu@79.125.56.43
# on the EC2 instance:
$ sudo apt-get update -y
$ sudo apt-get install -y puppet
$ puppetd --test --server puppet.example.com

This will result in a Error message about certificates. The puppet client requested a certificate and you have to sign this certificate at the puppet server. There is still some room for automatation. Log into the puppetmaster and list the signature requests with puppetca -l. You’ see the name of your newly created instance. Sign that name by using puppetca -s:

root@puppet:~# puppetca -l
ip-10-20-30-40.eu-west-1.compute.internal
root@puppet:~# puppetca -s ip-10-20-30-40.compute.internal
Signed ip-10-20-30-40.eu-west-1.compute.internal
root@puppet:~# perl -npe 's//ip-10-20-30-40.compute.internal/;' \
   -i.bak /etc/puppet/manifests/site.pp

The last line automatically edits /etc/puppet/manifests/site.pp to contian configuration information for the new instance. That’s all there is to do on the puppet master.

Now back on the new instance you can make puppet configure your CouchDB by typing puppetd --test --server puppet.example.com. This should install CouchDB and configure it to use the “big” 140 GB disk of your instance and to require password authentication.

You can test if CouchDB is up, running and secured by using cURL:

$ curl http://127.0.0.1:5984
{"error":"unauthorized","reason":"Authentication required."}

This is the point in time where we can start replication from our internal, behind-the-firewall CouchDB to the new box running at Amazon. Since there are some issues regarding commandline tools and authentication I created a patched version of python-couchdb at GitHub. Download it from here to a machine in your internal network, untar it and change in the couchddb-python directory. Then initiate replication:

$ PYTHONPATH=. python ./couchdb/tools/manual_replication.py \
  --source=http://couchdb.internal.example.com:5984 \
  --target=http://admin:sekrit@79.125.56.43:5984/ --push \
  --continuous

After this ran, set up permanent two-way replication between the two Servers:

$ PYTHONPATH=. python ./couchdb/tools/manual_replication.py \
  --source=http://couchdb.internal.example.com:5984 \
  --target=http://admin:sekrit@79.125.56.43:5984/ --push
$ PYTHONPATH=. python ./couchdb/tools/manual_replication.py \
  --source=http://admin:sekrit@79.125.56.43:5984/ \
  --target=http://couchdb.internal.example.com:5984 \
  --continuous

Basicaly that’s it. We are still missing a few bit’s and pices to get full automation, but we are nearly there. and for a cluster you probably want more than one CouchDB instance running at Amazon.

Hoptoad with Django

December 19th, 2009

hoptoad is a hosted web application for filtering and sorting errors from Web Applications. It is very much integrated with the Ruby on Rails community.

For Django there are similar Applications like django-db-log but to my Knowledge there is no hosted solution for Django. Hosted solutions are nice, because some types of errors might prevent you from saving information in the local database.

But there is django-hoptoad which brings basic hoptoad functionality to Django.

To use it, got to a href=”http://hoptoadapp.com”>hoptoadapp.com, sign up and create a Project. Go to “Edit Project and tou should see something like “Current API key: 3c81d132d2f28749eab2043bb4c987a5″.
There seems an other type of “auth_token” which is not per project but per user. Don’t use this.

If you use the Lighthouse Bugtracker, you can add it to your Hotoad configuration at this point to integrate the two Applications.

Now download django-hoptoad form it’s homepage and install it on your server.

Then add something like this to your settings.py:

HOPTOAD_NOTIFY_WHILE_DEBUG = True
HOPTOAD_API_KEY = '6c2...b3'
HOPTOAD_NOTIFY_404 = True
HOPTOAD_NOTIFY_403 = True
HOPTOAD_IGNORE_AGENTS = ['Googlebot', 'Yahoo! Slurp']
MIDDLEWARE_CLASSES = list(MIDDLEWARE_CLASSES).append( \
    'hoptoad.middleware.HoptoadNotifierMiddleware')

That’s all!

You now can see your issues aggregated by cause, create Lighthouse-Tickets from them and have peace of mind that your exceptions aren’t lost, even when your servers hard disks are full.

VPN between a Fritz!Box Fon and racoon/BSD

September 25th, 2009

Fritzbox is an nice WLAN,DSL/VoIP/DECT router from AVM with VPN capabilities. They have a VPN information page but no Information on setting up a VPN to the Kame IPsec stack. To my understanding the Stack is used in OpenBSD, NetBSD and FReeBSD and in some Linux Distributions.

I’ll setup a VPN between a BSD router with a static IP Address and the Fritz!Box with a dynamically changing IP. Sometimes this is called a “road warrior” setup.

Get a Dyndns.org name for your Fritzbox and configure it.

First you need a configuration file for the Fritzbox. Replace A.B.C.D with the IP of your gateway. Also replace the key with something more secret. phase2localid needs to describe the local net of the Fritzbox. accesslist needs to be the remote (BSD) network. Put your Dyndns Name into localid { fqdn }.

/*
 * C:\fritzbox_kame.cfg
 * Thu Sep 24 23:36:34 CEST 2009
 */
vpncfg {
        connections {
                enabled = yes;
                conn_type = conntype_lan;
                name = "BSD";
                always_renew = no;
                reject_not_encrypted = no;
                dont_filter_netbios = yes;
                localip = 0.0.0.0;
                local_virtualip = 0.0.0.0;
                remoteip = A.B.C.D;
                remote_virtualip = 0.0.0.0;
                localid {
                        fqdn = "example.ath.cx";
                }
                remoteid {
                        fqdn = "A.B.C.D";
                }
                mode = phase1_mode_aggressive;
                phase1ss = "all/all/all";
                keytype = connkeytype_pre_shared;
                key = "sekritt";
                cert_do_server_auth = no;
                use_nat_t = no;
                use_xauth = no;
                use_cfgmode = no;
                phase2localid {
                        ipnet {
                                ipaddr = 172.30.20.0;
                                mask = 255.255.255.0;
                        }
                }
                phase2ss = "esp-3des-sha/ah-no/comp-no/pfs";
                accesslist = "permit ip any 192.168.0.0 255.255.0.0";
        }
        ike_forward_rules = "udp 0.0.0.0:500 0.0.0.0:500",
                            "udp 0.0.0.0:4500 0.0.0.0:4500";
}

Now you have to set up your Unix Box. I used FreeBSD but to my understanding it’s the same with OpenBSD and probably also with some Linux Variants and NetBSD. You need to install ipsec-tools and racoon. They might come in two packages or in one or might be already installed. On my FreeBSD box I added something like this in /etc/rc.conf:

racoon_enable="YES"
racoon_flags="-l /var/log/racoon.log"
racoon_create_dirs="YES"

Next thing is to save your IPsec “shared secret” (Password) somewhere:

echo "example.ath.cx sekritt" > /usr/local/etc/racoon/psk.txt

The last Part missing is /usr/local/etc/racoon/racoon.conf:

# racoon.conf
path pre_shared_key "/usr/local/etc/racoon/psk.txt" ;
log info; # notify info debug;
padding {
        maximum_length 20;
        randomize off;
        strict_check off;
        exclusive_tail off;
      }
listen { isakmp A.B.C.D [500]; # add your public IP here }
timer {
        counter 5;
        interval 20 sec;
        persend 1;
        phase1 30 sec;
        phase2 15 sec;
      }
remote anonymous { # we don't know the peers IP during phase 1
        exchange_mode main, aggressive;
        nonce_size 16;
        lifetime time 140 min;   # sec,min,hour
        initial_contact on;
        proposal_check obey;    # obey, strict or claim
        support_proxy on;
        ike_frag on;
        weak_phase1_check on;
        # important for automatically configuring
        # the Security Policy Database (SPD)
        generate_policy on;
        passive on;
        # Fritz!box
        proposal {
                encryption_algorithm aes;
                hash_algorithm sha1;
                authentication_method pre_shared_key;
                dh_group 2 ;
        }
      }
 # local net - Fritz net
sainfo address 192.168.0.0/16 any address 172.30.20.0/24 any {
	pfs_group 2;
	lifetime time 8 hour;
	encryption_algorithm aes, 3des, des;
	authentication_algorithm hmac_sha256, hmac_sha1, hmac_md5 ;
	compression_algorithm deflate;
}

If you use pf for Packet Filtering you need something like this in /etc/pf.conf:

pass on $ext_if proto udp from any port 500 to A.B.C.D port 500 keep state
pass on $ext_if proto udp from A.B.C.D port 500 to any keep state
pass quick on $ext_if proto { esp ah ipencap } from any to A.B.C.D
pass quick on $ext_if proto { esp ah ipencap } from A.B.C.D to any

This Configuration still has issues but works.

HighPoint Rocket Raid driver breaks FreeBSD

July 12th, 2009

I run a FreeBSD box with two SuperMicro AOC-SAT2-MV8 SATA controller providing 8 SATA ports each. Nice setup for a ZFS fileserver.

Unfortunately since 6.3 or so, FreeBSD comes with the hptrr binary blob driver from Highpoint. The hptrr driver breaks detection of the SuperMicro AOC-SAT2-MV8 controllers (which are handled by the ata driver).

Unfortunately the hptrr is compiled in into recent GENERIC cernels and can not be disabled. To disable it you have to edit the kernel configuration and comment out the “device hptrr # Highpoint RocketRAID 17xx, 22xx, 23xx, 25xx” line and rebuild the kernel.

You also have to add the line ‘hptrr_load=”NO”‘ to /boot/loader.conf.

Very annoying.

Simple “full text” search with CouchDB

January 19th, 2009

To have a shiny application you need domain specific search. E.g. if our call center wants to enter a new order, they might not have the customer number ready. So they need a snappy way to get the customer number based on name, city or whatever.

We did experiment with lot’s of LIKE queries to our In our legacy ERP database system. This didn’t feel good, had some SQL injection vulnerabilities and required lot’s of full table scans.

Looking for alternatives we decided to use CouchDB. There is some work on full text indexing for CouchDB you can build something much more simple yourself.

Once a day we copy all customer data from the legacy system into CouchDB. Then we use the map function to emit a line for each word in each data field of each document. It looks like this:

function(doc) {
    function output(value) {
        // Split into search terms
        if(value && (value != "-") && (value.length > 2)) {
            emit(value, 1);
            for(var word in value.split(" ")) {
                if(word && (word != "-") && (word.length > 2)) {
                    emit(word, 1);
                }
            }
        }
    }
    output(doc.kundennr);
    output(doc.name1);
    output(doc.name2);
    output(doc.ort);
    output(doc.land + "-" + doc.plz);
}

This basically generates a view (index) containing every word and the document it occurs in.

You now can use that for a prefix based search in a function like this:

from couchdb.client import *

def finde_kundendaten(searchstring):
    server = Server('http://couchdb.local.hudora.biz:5984/')
    db = server['kunden']
    rows = []
    while len(rows) < 1 and len(searchstring) > 2:
        rows = db.view('suche/alle_felder', startkey=searchstring, limit=25)
        if rows:
            break
        searchsting = searchstring[:-1]
    return [(x.id, x.key) for x in rows]
>>> finde_kundendaten("Sport Dornseif") # no exact match in the DB
[(u'51320', u'Sport Alm SysIntersport'),
 (u'27094', u'Sport Freizeit'),
 (u'31071', u'SPORT FREIZEIT  TREFF'),
...]

Nifty!

CouchDB: Improving the interval API

December 28th, 2008

I posted this to the couchdb-dev mailinglist but so far it didn’t arrive. So I store it here

While writing something about using CouchDB I came across the issue of “slice indexes” (called startkey and endkey in CouchDB lingo).

I found no exact definition of startkey and endkey anywhere in the documentation. Testing reveals that access on _all_docs and on views documents are retuned in the interval

[startkey, endkey] = (startkey <= k <= endkey).

I don’t know if this was a conscious design decision. But I like to promote a slightly different interpretation (and thus API change):

[startkey, endkey[ = (startkey <= k < endkey).

Both approaches are valid and used in the real world. Ruby uses the inclusive ("right-closed" in math speak) first approach:

>> l = [1,2,3,4]
>> l.slice(1,2)
=> [2, 3]

Python uses the exclusive (”right-open” in math speak) second approach:

>>> l = [1,2,3,4]
>>> l[1:2]
[2]

For array indices both work fine and which one to prefer is mostly an issue of habit. In spoken language both approaches are used: “Have the Software done until saturday” probably means right-open to the client and right-closed to the coder.

But if you are working with keys that are more than array indexes, then right-open is much easier to handle. That is because you have to *guess* the biggest value you want to get. The Wiki at http://wiki.apache.org/couchdb/View_collation contains an example of that problem:

It is suggested that you use
startkey=”_design/”&endkey=”_design/ZZZZZZZZZ”
or
startkey=”_design/”&endkey=”_design/\u9999″
to get a list of all design documents

This breaks if a design document is named “ZZZZZZZZZTop” or “\9999Iñtërnâtiônàlizætiøn”. Such names might be unlikely but we are computer scientists; “unlikely” is a bad approach to software engineering.

The think what we really want to ask CouchDB is to “get all documents with keys starting with ‘_design/’”.

This is basically impossible to do with right-closed intervals. We could use startkey=”_design/”&endkey=”_design0″ (’0′ is the ASCII character after ‘/’) and this will work fine … until there is actually a document with the key “_design0″ in the system. Unlikely, but …

To make selection by intervals reliable currently clients have to guess the last key (the ZZZZ approach) or use the fist key not to include (the _design0 approach) and then post process the result to remove the last element returned if it exactly matches the given endkey value.

If couchdb would change to a right-open interval approach post processing would go away in most cases. See http://blogs.23.nu/c0re/2008/12/building-a-track-and-trace-application-with-couchdb/ for two real world examples.

At least for string keys and float keys changing the meaning to [startkey, endkey[ would allow selections like

* “all strings starting with ‘abc’”
* all numbers between 10.5 and 11

It also would hopefully break not to much existing code. Since the notion of endkey seems to be already considered “fishy” (see the ZZZZZ approach) most code seems to try to avoid that issue. For example ’startkey=”_design/”&endkey=”_design/ZZZZZZZZZ”‘ still would work unless you have a design document being named exactly “ZZZZZZZZZ”.

Building a Track and Trace Application with CouchDB

December 28th, 2008

Background

In my compay we use a logistics application called huLOG. Part of huLOGs (award winning) functionality is to aggregate track and trace events from about two dozen sources. The events are schema free in there nature: some might contain a ZIP code, some may geographic coordinates attached, some relate to a certain packet or pallet (called “movable unit” in huLOG-spreak) some relate to a certain shipment (which is a group of movable units). Some have file attachments, e.g. Images of the packages, signatures proofing delivery or pictures of the number plates of the trucks.

My first approach was an SQL database. Later I coupled it with a self-designed Document store called DoDoStorage. For background on that project see here, here and here.

In summer 2007 we experienced severe performance problems with DoDoStorage. I started a rewrite in Erlang and came across Damien Katz and his then obscure CouchDB project. It was just in the transition from XML to JSON and Damion assured me that “it will not be production ready for an other two years”.

So we decided to solve the DoDoStorage speed issues with more hardware for the time being.

In Fall 2008 the landscape for CouchDB hat changed: it was now the hot thing in database technology and everybody’s darling. CouchDB 0.9 came around and it started to look usable for serious use. We hired Jan Lenhardt, one of the CouchDB core team to work with us on moving huLOG from PostgreSQL and DoDoStorage to CouchDB.

In December we started migrating services and data over to a CouchDB based system. While CouchDB has its wards we are very happy with it so far.

Data Model

As stated above tracking data comes in many different flavors ad colors. It might reference an MUI (”movable unit ID”), a shipment or both. Let’s concentrate on events referencing a MUI. Some typical tracking events might look like this:

{
   "_id": "01420000000378-20061005T064500.000000",
   "mui": "01420000000378",
   "shipment": 572,
   "message": "Processing, 0132-Vlotho, Route 0132, Code 101",
   "code": "410",
   "timestamp": "20061005T064500.000000",
   "facility": "DPD Depot 0132",
   "ort": "Vlotho (DE)",
   "plz": "32657"
},
{
   "_id": "01420000000378-20061005T085400.000000",
   "mui": "01420000000378",
   "shipment": 572,
   "message": "proof of delivery",
   "code": "421",
   "timestamp": "20061005T085400.000000",
   "plz": "32634",
   "_attachments": {
       "POD.pdf": {
           "stub": true,
           "content_type": "application/pdf",
           "length": 136446
       }
   }
}

One thing about my choice of keys (the _id field). I have choosen a “meaningful” ID over traditional “random” UUIDs. The IDs we use consist of MUI-Timestamp The Timestamp is the time the event was generated (e.g. the pallet was loaded). The good news is that this automatically keeps my database free of duplicates. If I import the same file with events twice, the events will have the same IDs during both imports and thus the second import will overwrite the first one: exactly what I want.

The bad news is that while physically there can’t really happen two events at once due to clock drift etc. I might get two different events for a MUI with exactly the same timestamp. This would result in the same key being generated for both events and thus the second event would overwrite the first one. With the relatively sparse populated space of timestamps I expect this to happen once in 10.000 events or so.

For huLOG it is acceptable to use one in 10.000 events – especially if it is the earlier one. We only get about 95% of the events we should get due to problems in the track and trace infrastructure of the freight companies. As long as the all important “has been handed to the customer” messages arrive, our system has to be able to handle missing messages.

Data Access by MUI

With that we just need a few view functions to work with the data. Most obvious we want to get all documents for a certain MUI.

This is easy since our document IDs already contain the MUI. We only have to get a list of all IDs starting with MUI. The CouchDB Document API already provides that functionality:

$ curl -s 'http://localhost:5984/hulog_events/_all_docs
  ?startkey=%22094147562251-0%22
  &endkey=%2209445147562253-9%22
  &include_docs=true' 

{"total_rows":184708,"offset":184705,"rows":[
{"id":"09445147562251-20081114T165600.000000",
 "key":"09445147562251-20081114T165600.000000",
 "value":{"rev":"463392510"},
 "doc": {..., "message":"Einrollung, 0147, Route 0280, Code 461 105",
        "mui":"09445147562251", "shipment":124524}},
{"id":"09445147562253-20081114T165700.000000",
 "key":"09445147562253-20081114T165700.000000",
 "value":{"rev":"2436256439"},
 "doc":  {..., "message":"Einrollung, 0280, Route 0234, Code 461 105",
          "mui":"09445147562253", "shipment":124524}}
]}

Some things to note: startkey and endkey must be set to valid JSON objects. We want a string containing the mui so we have to put “quotes” arround it. After URL-encoding we end up with %22094147562251%22.

The returned documents are within the interval [startkey:endkey]. If we choose a startkey which is guaranteed same or smaller than the key we want to retrieve. That’s easy: 09445147562251 is smaller than any string which has more characters and starts with 09445147562251. Endkey is somewhat more tricky. What is the biggest value? 09445147562252? This might catch one record to much, because there actually might be a key 09445147562252.

A common idiom used in the Erlang community is to use something like 09445147562251Z as the endkey. But then what is if there actually is a key 09445147562251Zabc? So use 09445147562251ZZZZZ, but …

An other suggestion is using a “high unicode character” like \u9999. But at least there is no highest unicode code point and you are well of in considering the sorting rules for obscure unicode characters as indetermistic. See the CouchDB Wiki for further information.

In our case all this is no problem. The MUI is followed by a ISO 8601 timestamp. So
startkey=094147562251-0, endkey=09445147562253-9 should work fine until the end of year 8999.

Data Access by Shipment

That was easy. Now we want to access data based on something which is not encoded in the document ID. Say the “shipment” number. For that we need a map function which gives us shipment numbers.

map: function(doc) {
    if(doc.shipment) {
        emit(doc.shipment, null);
    }
}

We permanently save this function in the server to allow CouchDB to play its clever optimization tricks. See the CouchDB wiki for more information on the HTTP View API. You could use the Futon GUI at http://localhost:5984/_utils/ for that or curl:

 curl -X PUT -H 'Content-Type: applicatioe": "javascript",
  "views": {"all":{"map":"function(doc){if(doc.shipment)
                          {emit(doc.shipment, null);}}"}}}
  ' http://localhost:5984/hulog_events/_design%2fsendung

{"ok":true,"id":"_design/sendung","rev":"23237918"}

Now we can query the view and with the startkey and endkey parameters. Here we get again into trouble for choosing the endkey. The hack we used for the IDs did work because there was a timestamp of well known format in the ID so we construct a “slightly bigger” value. Shipments are numeric. So what is bigger than 128996? 128997 is obviously bigger but would get us documents for two shipments (128996 and 128997). Since the shipments are numeric, a trick like “128996Z” does not work out.

But numeric comparing of numeric values in JavaScript work nicely with different numeric types. Shipments are integers. If we use a float key we nicely can place it to be bigger than 128996 but smaller than 128997. We use 128996.1 as our endkey.

$ curl 'http://localhost:5984/hulog_events/_view/shipment/all
  ?startkey=128996&endkey=128996.1'

{"total_rows":184707,"offset":184674,"rows":[
{"id":"09445122481336-20081223T105212.491567","key":128996,"value":null},
{"id":"09445122481336-20081223T155200.000000","key":128996,"value":null},
{"id":"09445122481336-20081223T202600.000000","key":128996,"value":null},
{"id":"09445122481337-20081223T105216.377305","key":128996,"value":null},
{"id":"09445122481337-20081223T155300.000000","key":128996,"value":null},
{"id":"09445122481337-20081223T202500.000000","key":128996,"value":null}
]}

You can also add &include_docs=true to the query – this will get you the complete documents, not only the IDs.

At this point we have a scalable, elegant datastore for our Track & trace related events ad files.

Usage

Currently we are running CouchDB with subset of our tracking archives. This subset is about a quarter million documents of wich 20% or so have attachments resulting in a database size of about 5 GB. No complains so far.

How map/reduce works in CouchDB

December 27th, 2008

I have huge trouble how CouchDBs system of views actually works.

By experimenting and reading the source I came up with thisdescription in pseudo Python:

def mapstep(alldata):
    # the map is applied to every document
    # and the result is collected in two lists of rows
    k_rows = []
    v_rows = []
    for _id, doc in alldata:
       k, v = mapfun(doc) # actually mapfunc uses emit() not return()
       k_rows.append([k, _id])
       v_rows.append(v)
    return k_rows, v_rows

def reducestep(keys, values):
    # now several reduce steps follow. For this example
    # we randomly chose two
    # all even elements
    tmp1 = reducefun(k_rows[::2], v_rows u[::2], False)
    # all uneven elements
    tmp2 = reducefun(k_rows[1::2], v_rows u[1::2], False) 

    # finally several rereduce steps follow.
    # For this example we use only one.
    return reducefun(None, [tmp1, tmp2], True)

result = reducestep(mapstep(alldocs()))

If you call the view with group=true the map step stays the same, but the server applies grouping and calls the reduce step for each group. It looks like this:

def reduce_with_grouping(keys, values):
    gdict = {}
    # create dictionary mapping values to keys
    for k, v in zip(keys, values):
        gdict.setdefault(k, []).append(v)
    ret = []
    for k, values in gdict.items():
        ret.append([k, reducestep(k*len(values), values])
    return ret

result = reduce_with_grouping(mapstep(alldocs()))

If you experiment with views keep in mind that the the Futon Web-Client silently adds group=true to your views and that group=true is ignored if you don’t provide a reduce function.

CouchDB broke my Box (not)

December 20th, 2008

I tried to see how much beating CouchDB can take. So I installed it on a modest box (1.8GHz, 512 MB RAM, Debian) and started at 12:00h pouring data in it. At about 22:00h I asked for the computation of a simple view while still dumping data into it. At that Time it contained about 400.000 Documents of with about 10 % contained an Attatchment. DB size was on Disk was about 8 GB.

And an hour later (now) I can’t reach the box anymore.