Archiving and preserving the Internet

I’m Trying to Archive (mostly Internet) Content for a long time now. My first tries were mirroring and burning on CD of what I considered Important. I started a project trying to scan the CCC Paper based archive and digitized hundreds of hours Radio Intergalaktik (seems the CCC deleted it) and experimented with keeping copies of sites I surfed to.

I experimented with several archiving proxies like Gerald Oskoboiny system, Autojot, Archiver Proxy, Agent Frank and it’s precessor.

It turned out that all proxies degenerate my Web experience.
So I turned to Low level Networking and sniffed the HTTP directly from the network interface by modifying several parts of dsniff.

Since I’m also contemplating to make archived (semi-) public it turned out to be an problem that also password protected pages were archived. I finally gave up the idea of archiving the data in the fly and decided to use an separate crawler for archiving. Just sniffing the requested URLs from the wire was much easier but it turned out that it is even easier to extract URLs to archive from the browser history, RSS Reader and Email-Archives. URLs are then sent via XML-RPC to the archiving server where larbin downloads them and they are archived in the ARC Format.

Leave a Reply