Recently I was looking at how /dev/urandom was implemented in the linux kernel. So I looked at drivers/char/random.c. While I was reading that code I ran into the following:
…
static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
size_t nbytes)
{
ssize_t ret = 0, i;
__u8 tmp[EXTRACT_SIZE];
…
while (nbytes) {
…
extract_buf(r, tmp);
i = min_t(int, nbytes, EXTRACT_SIZE);
if (copy_to_user(buf, tmp, i)) { … }
…
}
…
}
…
static ssize_t
urandom_read(struct file * file, char __user * buf,
size_t nbytes, loff_t *ppos)
{
return extract_entropy_user(&nonblocking_pool, buf, nbytes);
}
urandom_read() is a direct read() hook that get’s called from vfs_read() (which in turn gets called from sys_read()). The bug here is pretty obvious. There is size_t truncation here:
i = min_t(int, nbytes, EXTRACT_SIZE);
At this point I was thinking “WTF ? this is wrong, how can this be in /dev/urandom code !!!!”. I remembered that there are some range checks in vfs_read() and hence I looked in vfs_read():
/*
* rw_verify_area doesn’t like huge counts. We limit
* them to something that fits in “int” so that others
* won’t have to do range checks all the time.
*/
#define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
int rw_verify_area(int read_write, struct file *file, loff_t *ppos, size_t count)
{
…
loff_t pos;
if (unlikely((ssize_t) count < 0))
goto Einval;
pos = *ppos;
if (unlikely((pos < 0) || (loff_t) (pos + count) MAX_RW_COUNT ? MAX_RW_COUNT : count;
Einval:
return -EINVAL;
}
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;
…
ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
count = ret;
…
ret = file->f_op->read(file, buf, count, pos);
…
}
return ret;
}
(actually I was told about this code by linus torvalds himself, after I reported a ppc specific bug in /proc. I believe his (almost) exact words were (I lost the email) “We added this code in 2.6. to prevent integer abuse in read/write hooks, I believe it got backported to 2.4 aswell”. And it did get backported to 2.4, I checked. The reason I stumbled on this bug (it’s in src/arch/ppc/kernel/htab.c proc_dol2crvec() btw, bonus points for whoever can spot signedness issue) because of a capture the flag game at HITB bahrain last year. The game got pretty boring so we decided to own one of the organisers laptop (hi Dhillon !) WITH PERMISSION! He was using an iBook with some linux distro on it, and using the 2.6.10 kernel at the time. That kernel didn’t have the range checks yet and hence the proc_dol2crvec quickly gave us read access to most kernel memory. We eventually ended up rooting it because of that bug. Which brings me to my point, that all of these horrible read/write bugs in drivers were in fact exploitable (and were exploited) on linux prior to just adding the range checking code in rw_verify_area and hence should still have gotten fixed !.)
Hm, wandered off there for a bit, let’s get back to the code in rw_verify_area(). What it basicly does is check for:
- size not negative
- offset nog negative
- size + offset not negative
and then TRUNCATES size_t -> signed integer, see the code comment.
This is shocking imo! But done for security reasons, which is certainly a noble thing. But this hides horrible bugs. So in fact it does the opposite. It sends out a message that “It’s ok to write crappy kernel code, we’ve got code in place to mitigate”. Anyone who ever looked at some of the drivers in the linux kernel knows there is a huge pile of low quality code in there, and a lot of copypasting is done (let’s call it infecting other code with low quality code). So even if you have crappy code that’s covered by the range checks and truncation in rw_verify_area() and vfs_read() someone might copypast your code in some place where’s it’s not covered by these range checks and you’ve got a big fat security hole on your hands. Not to mention that there might be another codepath to read() hooks in drivers thru some ioctls (where it doesn’t go thru vfs_read()).
What this means in reality is that on systems where sizeof(size_t) > sizeof(int) (mostly 64 bit architectures) that you only get to read or write at most 2 gigabytes of data (in a single call to read() or write()), even though you might want to read or write more then that. Which is INSANE! There is no need for this. Obviously you could try to make a case for this and state that you should do this in a loop of 2gig reads or write’s at a time. But this is kind of silly. Because it adds a needless loop (there is already such a loop in place in the kernel most of the time, in the /dev/urandom case for example) and more importantly it adds syscall overhead !
I’m sure by now some people are going “But you could use mmap, USE MMAP !” NO, you can’t. Hm, well, not always. See, there are non-regular files out there that don’t let you mmap. /dev/urandom for example, or /proc/pid/mem (Zalewksi++). But wait, it get’s worse, write() is covered by the same restrictions as read() in rw_verify_area() and therefor you can only write 2 gigs at a time. Not something you really want (thanks fefe).
I bet there are apps out there that actually break on this! there have to be, with all the crap that people put out there there’s gotta be 1 or 2 applications that go up in flames because of this!
if you want to add some range checks, then check if the count given is int truncation. It breaks functionality !