In the light of the recent DOD blunder regarding redacted documents, I try to sum up, what I so far did in that area. The problem is, that all kinds of documents contain information which they should not contain. MS Word documents contain previous revisions of a document, information about Servers, Filenames and Authors. PDF files allow redacted text to reappear. JPEGs contain uncropped original images. Webpages contain local paths. etc.
I presented at various occasions on the topic so far:
Our claim to fame is that we researched the problem with JPEG Thumbnails to some degree. We – the RedTeam Pentesting group at RWTH-Aachen University – did an advisory on this issue: Advisory: JPEG EXIF information disclosure. The EXIF issue also has a CVE number: CAN-2005-0406. We probably should get a CVE number for the MS-Office issues, too. And we have a application for screening thumbnails which is somewhat fun. Some examples on what to expect can be seen here.
In the comments to blog entries regarding the Hidden Data issue besides Richard Smith’s famous WordDumper another tool came up: WordLeaker – I havn’t tried it so far. There is also a tool called revisionist by lcamtuf, which I havn’t tried so far.
Previous postings on the issue can be found at http://blogs.23.nu/disLEXia/topics/HiddenData/
Since I get mails like this from time to time:
I’m writing because I’ve read your slides on “Hidden data in Documents” and I found them very interesting. Unfortuately, I can’t find Word Dumper on the Internet. Can you send me a copy or the URL where I can download it from?
WordDumper was to my knowledge never available to the public. It was build by Richard M. Smith for internal use. You can learn more about some of the work by Mr. Smith at his site Computer Bytes Man. When I wrote to him asking about WordDumper he was very kind and mailed me a copy.
I’m respecting Mr. Smith’s decision not to put WordDumper on his Webpage and thus will not further distribute the tool. If you like to have a copy of WordDumper Mr. Smith is the right person to ask.
If you interested in WordDumper, following presentations might be also of interest to you: 1, 2, 3, 4
After my presentation on thumbnails and related problems somebody came to me and told me that there are commmercial PDF to MS Office conversation programs (I didn’t know that) and that they have the side effect of removing redaction “bars” from documents. Interesting.
There is a topic I’m somewhat beating to death right now, but it is so fascinating. It seems that most people use the term “Hidden data in Documents” to describe the problem, although this might mislead the audience to expect steganography. It’s about complex document formats and unintended information in there. Simopn Byers was the first to take a deep look into that problem in his Paper Scalable Exploitation of, and Responses to Information Leakage Through Hidden Data in Published Documents an an Article in IEEE Security & Privacy, Vol. 2, No. 2, pg 23-27 named “Information Leakage Caused by Hidden Data in Published Documents”. I recently came around a Paper in the SANS Reading Room which looks more into how to scrub Office Documents.
I myself did a lot of talking on the topic from a broader view, not only looking at MS Office documents but on anything from Audio files to Mail Headers to XML files. I did presentation on this at Defcon, at the Chaos Computer Club (in german) and at the Aachen Summerschool Applied IT Security so far.
During the Summerschool Steven Murdoch and I started to do large scale research on hidden data in images. Steven did some clever trickery to detect thumbnails which contain something different than the actual images. That was interesting and we really found some nasty stuff. We will present our results at the Chaos Communication Congress (and possibly earlier in by computer forensics class).
All of this is no breathtaking stuff. The interesting thing is how people react to this. Seems the legal community sees the importance of this problem and is working – at least in it’s non-technophobic part – hard to avoid such data. On the other hand journalists are mad at the research community for publishing about that – seems they have known for some time and fear that they will loose a source of information of organizations start to adapt scrubbing measures to their documents.
Very interesting are reactions I got from the consulting community which basically knows no way of communication besides sending Powerpoint, MS Word and Excel documents: “Oh sure that’s bad but we have other problems to deal with.”
I think that is an approach which ultimately should and will lead to liability. I basically think everybody processing confidential information and using complex document formats like in MS Office which are well known for information leakage should be held liable for any information actually leaked. But beware: while MS Office is certainly one of the worst offenders concerning information leakage nearly all document formats leak.
This whole subject is an interesting place where theoretical computer science (”convert channels”) meet the real world violently. I’m looking forward to do more research on that.
At Defcon 12 in Las Vegas I will present on the topic of unwanted data:
Far More Than You Ever Wanted To Tell – Hidden Data In Document Formats
Applications usually put all kinds of information besides the ones which you intend to into saved documents. This can lead to embarrassing revelations. We will take a look into different types of application data and what can be hidden in there. This allows us to “scrub” our own documents to avoid unwanted information in there but also to look for information in documents which the authors didn’t want to hand out. Go grasp the scope of the problem we will present a large scale study of hidden information in Documents on the Internet.