Wednesday, June 27th, 2007
Programmers Don’t Like to Code – insightfull.
Programmers Don’t Like to Code – insightfull.
So now our document managment system but to be really enterprisy it has to get waaaay more complex. Good stuff has to be complex, everybody knows that.
Some Ideas:
* allow to search for a string in all attributes
* add OpenSearch response elements to our atom feeds.
* create an OpenSearch description document
* Install it as an search plugin in our users browsers
* Improve compliance with the Atom Publishing Protocol (APP)
* Look into APP collections again
* Add support for paging in Atom feeds
* enforce a limited character set in keys, attributes and whatever
* build unittest that work without a running server.
In the first two parts we have been designing the basic infrastructure of our Document Managment System (Part 1, Part 2) now let’s see how this all fits together:
So far we have this list of URLs:
GET /documents/ add a document to the systemPOST /documents/ get a list of recently added documents (Atom feed)GET /documents/search/{attribute}/{searchterm}/‘ get a list of documents where attribute==searchterm (Atom feed)GET /document/{key}/ get document identified by keyGET /document/{key}/metadata.atom get metadata for document (Atom feed with a single entry)
(GET /documents/ and GET /document/{key}/metadata.atom haven’t been discussed before but they follow closely the pattern of the other URLs).
This interface allows us to build a very simple (< 200 lines of code) library to put data in the document store and retrieve it again. In web applications we even can link to documents directly since they are all available under static HTTP URLs. The server also is very simple (< 200 LoC) spending most of it’s energe with generating Atom feeds.
The other nice thing is that we can access this Interface interactively: Since it uses the Atom syndication format for all it’s data and Firefox has decent Atom parsing and display capability we can use Firefox to access the data in the store. See right for some example. The use of XHTML to encode our attributes helped enormously to make this usable in the Feed Reader Mode of Firefox and the identification of Attributes by their Search-Urls makes everything nicely clickable.
Unfortunately we are missing a Way to search. Let’s say you want to check Documents for customer No 12345. Sure, you could just put /documents/search/customer/12345/ in the URL line and you are fine. But most people think there is something wrong about manipulating the URL line.
We can help that by adding a tiny bit HTML: A front page which input fields to search for the different attributes and a redirector which redirects clients requesting /search?attname=client&attvalue=12345 to /search/client/12345/. The reason this is needed is the fact that HTML forms don’t allow to generate URLs but only can generate query parameters. (But maybe HTML 5 will be able to do better.)
So we add two more URLs:
GET / display a html form.GET /redirect_to_attributefeed redirect to /documents/search/{attribute}/{searchterm}/.Now we have a nice API which can be in addition quite comfortable navigated with a Webbrowser. See this Movie for an demonstration. First I search viathe Web-Form and get redirected to an Atom feed or order number 180162. From there I can click through to an Atom feed with all documents for customer id 104685. I can request the scanned Document. By manipulating the URL line I get to an overview of recently added documents, click to the documents for an other customer and then to the documents generated by staff member ‘40′. There again I get Display an document, click to documents for customer 14008 and check another document.
All this with only a single Page of HTML.
Seems FreeBSD finally gets iSCSI support.
Different output for running the same code:
cat test.py print id(1000) == id(1000) a,b = 1000,1000 print id(a) == id(b) a = 1000 b = 1000 print id(a) == id(b) md$ python test.py True True True md$ python Python 2.5.1 (r251:54863, Jun 14 2007, 15:08:59) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> print id(1000) == id(1000) True >>> a,b = 1000,1000 >>> print id(a) == id(b) True >>> a = 1000 >>> b = 1000 >>> print id(a) == id(b) False >>>
(found by chris)
After building the basic network protocol for our document store, we find something lacking: The documents we are working with have some meta-data. What kind of document we are talking about (e.g. invoice, order) and what attributes has this document (e.g. customer-number, order-number, invoice-number).
The Atom Publishing Protocol provides something called “collections” and something called “categories“. On the first glance this seems nicely map to “document type” and and “document attributes”. Unfortunately categories seem to be more something like tags but for attributes I need key-value pairs. And form the current atom draft I don’t fully understand how to create collections. And in our document storage application the client defines the document type and should be also able to create new types on the fly with minimal hassle.
Just to clarify: We have some documents like this:
* Doc1, Invoice, customerid:12345, invoiceid:23456, orderid:345678
* Doc2, Invoice, customerid:12345, invoiceid:2912345, orderid:345678
* Doc3, Order, customerid:12345, orderid:345678
* Doc4, Offer, customerid:12345, offerid:345566
* Doc5, ProductPhoto, productid:901234
* Doc6, PizzaOrder
Maybe I just didn’t understand how to map this to Atom. for now I decided to go with custom HTTP-Headers:
X-de.hudora-attributes: {"customerid": "12345", "invoiceid": "23456"}
X-de.hudora-category: Invoice
X-de.hudora-timestamp': 2007-06-22
We use JSON to encode the attributes and a plain string to encode the Document-Type. I also found no way a client can Post a Last-Modified Date to the server, so I crafted my own header. But I guess there is a better way.
Now we can store attributes on the server we need a way to retrieve them. We define a /documents/search/{attrname}/{arrtvalue}/ resource represented by Atom formated documents.
GET /documents/search/customerid/12345/ HTTP/1.1 Host: 127.0.0.1:8000
This gives us an atom feed with all documents where customerid=12345:
HTTP/1.0 200 OK
Content-Type: application/atom+xml;charset=utf-8
Content-Length: 22128
<feed xmlns="http://www.w3.org/2005/Atom">
<title>DoDoStore
<id>tag:id.23.nameu,2007-05-01:/tauzero/search/customerid/12345/</id>
<author>
<name>HUODORA DoDoStore Search for customerid=12345</name>
</author>
<link href="http://.../documents/search/customerid/12345/" rel="self"/>
<entry>
<title>Document 1 (2007-06-21)</title>
<id>tag:id.23.nameu,2007-05-01:/tauzero/703...ea1/</id>
<category>Invoice</category>
<link href="http://.../document/703...ea1/" type="text/plain"/>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml" name="703...ea1">
</div>
</content>
</entry>
<entry>
[...]
</entry>
<updated>2007-06-22T06:11:39Z</updated>
</feed>
Now we still nesd a way to represent the attributes in our atom entries. One simple way is just using XHTML to represent the data. Keeps it readable in a browser. Call it a microformat. We drop the following XHTML into our Atom content elements:
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml" name="703...ea1"
<dl class="attributes">
<dt class="customerid">customerid
<dd>
<a class="customerid"
href="http://...s/search/customerid/12345/">12345
</dd>
<dt class="invoiceid">invoiceid
<dd>
<a class="invoiceid"
href="http://...s/search/invoiceid/23456/">23456
</dd>
</dl>
</div>
</content>
Im building a System to manage Documents accumulating in our company. Starting with internal (paper based) documents. We generate a few hundred of them every day.
The strategy is: scan them, throw the paper away, keep the scanned data forever. Then OCR it enough to get find out which document it actually is (see here, here and here, all in german) and drop it with appropriate metadata on a permanent store.
So how to construct that storage? It should be networked. It should be able to spread over several hard disks and servers. So we probably need a client-server architecture and a network protocol. We use http because this Web thingy is all around, totally rocks and was recently upgraded to Version 2.0. Seemingly RESTful application design and the Atom Publishing Protocol is the way to go for content, if you want to play with the cool kids.
All in all the Idea of Atom fits well with an document store: Atom is about Documents which have an Author, a publication date and so on.
So let’s get wild and just code away. We post new documents to /documents/:
POST /documents/ HTTP/1.1 Host: 127.0.0.1:8000 Content-type: text/plain Content-Length: 9 blablafoo
If we do so we get back an Atom Document describing the just created entry:
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Document 31 (2007-06-21)</title>
<id>tag:id.23.nameu,2007-05-01:/f/1a3...b73</id>
<author><name>HUODORA DoDoStore</name></author>
<link href="http://.../document/1...3/metadata.atom" rel="self"/>
<entry>
<title>Document 31 (2007-06-21)</title>
<id>tag:id.23.nameu,2007-05-01:/e/1a3e270d7e8b73/</id>
<published>2007-06-21T21:58:41Z</published>
<link href="http://.../document/1a3...b73/" type="text/plain"/>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml" name="1a3...b73">
</div>
</content>
</entry>
<updated>2007-06-21T21:58:41Z</updated>
</feed>
Based on the Atom entry we now know where to request the document we have just posted:
$ curl http://127.0.0.1:8000/document/1a3...b73/ blablafoo
Viola! New documents are POSTed to /documents/ and afterwards you can get them from /document/{id}/.
This

xmllint --format << ugly.xml
is converted to that:

Probably it is already installed on your system without you knowing.