Wednesday 3 August 2011

Faster files in R

R is fairly slow in reading files. read.table() is slow, scan() a bit faster, and readLines() fastest.
But all these are nowhere as fast as other tools that scan through files.

Let us look at an example. I have in front of me a 283M file.

(Small update: the timings where off before. First because R hashes strings, one has to quit and restart R to get good timings. But I'm not really sure why my timings were off before...)


; time grep hello file
1.79 real 0.23 user 0.27 sys
; time wc file
774797 145661836 297147275 file
2.87 real 2.13 user 0.34 sys
; time cat file | rev >/dev/null
3.58 real 0.02 user 0.35 sys

> system.time({a=readLines("file")})
user system elapsed
25.158 0.829 26.500

> system.time({a=scan("file",what=character(),sep="\n")})
Read 774797 items
user system elapsed
30.526 0.827 31.734

> system.time({a=read.table("file",sep="\n",as.is=T)})
user system elapsed
31.880 0.802 32.837


sad, isn't it? And what does R do? Process data. So we read large files in all the time.

But beyond readLines(), R has deeper routines for reading files. There are also readChar() and readBin().

It turns out that using these, one can read in files faster.

Here is the code: (The second version further down is much better...)

my.read.lines <- function( fname, buf.size=5e7 ) {
s = file.info( fname )$size
in.file = file( fname, "r" )
buf=""
res = list()
i=1
while( s > 0 ) {
n = min( c( buf.size, s ) )
buf = paste(buf, readChar( in.file, n ),sep="" )
 
s = s - n
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
n=nchar(buf)
if( substr(buf,n,n)=="\n" ) {
res[[i]] = r
buf = ""
} else {
res[[i]] = head(r,-1)
buf = tail(r,1)
}
i=i+1
}
close( in.file )
c( unlist(res), buf )
}

Created by Pretty R at inside-R.org



The gain is not amazing:

> system.time({a=my.read.lines("file")})
user system elapsed
22.277 2.739 25.175


How does the code work?
 buf = paste(buf, readChar( in.file, n ),sep="" )
reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn't matter - on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

We then split the file using
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
the fixed=T parameter makes strsplit faster.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele's note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?



Update.

I got comments that it isn't fair to compare to grep and wc, because they don't need to keep the whole file in memory. So, I tried the following primitive program:

#include < stdio.h >
#include < stdlib.h >

main()
{
FILE *f ;
char *a ;
f = fopen("file","r") ;

a=malloc(283000001) ;
fread( a, 1, 283000000, f ) ;

fclose(f) ;
}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I'll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory...

Here is the new code:
my.read.lines2=function(fname) {
s = file.info( fname )$size
buf = readChar( fname, s, useBytes=T)
strsplit( buf,"\n",fixed=T,useBytes=T)[[1]]
}

Created by Pretty R at inside-R.org


And the timings:

> system.time({f=file("file","rb");a=readChar(f,file.info("file")$size,useBytes=T);close(f)})
user system elapsed
1.302 0.746 2.080

> system.time({a=my.read.lines2("file")})
user system elapsed
10.721 1.163 12.046


Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven't shown a better scan() or read.table()....


This thread will continue for other read functions.

8 comments:

  1. No guarantees here, but you are not doing comparable things. You are reading in the file, then splitting it up into a bunch of "lines" each separately allocated in memory. wc and grep are not keeping much state, they are passing through the file reading and either outputting immediately if a match is found or just updating a few summary statistics. They are definitely not allocating and re-allocating space for a lot of lines. I don't know if R is smart enough to not re-allocate your "r" every time you read ... that only happens 6 times but still, you could have reallocations happening very often inside the split.

    Try checking without all the split stuff. Again,no guarantees, but that is a lot closer to what grep and wc do that what you have. There still might be an issue, then I suspect there is some double buffering going on.

    One other thing you need to do when you check this is run it over and over and run all three of the tests first, second and third in different order. 280M is going to be in a cache somewhere after the first time you read it. So you need to make sure it is there all the time or you will get some odd times.

    There is a difference between reading an entire file into memory, and just scanning the file as it passes.

    ReplyDelete
  2. I just tried it out and on my platform (OS X + R 2.13.0) that function is about twice as slow as readLines.

    Also comparing readLines to programs that only need to keep a line at-a-time in memory seems a bit unfair.

    ReplyDelete
  3. Strange, Hadley. That is exactly the setup I'm using, and here it is faster.
    How big is the file you're reading?

    One thing I noticed is that you have to quit R and restart for each measurement, because R keeps a hash of strings, and reuses them.

    Anyway, about the keep-in-memory thing, I'll write above in the blog.

    ReplyDelete
  4. You're doing actually doing readChar( in.file, n, TRUE ) are you? If I add that then your method is about twice as fast as readLines.

    ReplyDelete
  5. Oh, and I did restart R between each timing, although it didn't seem to make that much of a difference.

    ReplyDelete
  6. Yeah the big allocation stream for your readchar routine is possibly lost in the I/O time. I still would wonder about file caching - I am talking at the OS or even the disk level. Starting and stopping R won't change what is in system I/O buffers much, much less a hardware disk cache. To truly isolate that you would need to have a just rebooted machine on the one hand, and on the other hand run the program a couple of times and take the second time. I used to get several times the speed on some old batch programs depending on the I/O cache status.

    Assuming that isn't the case here, you can see that allocating and reallocating memory to keep things neat in lists or dataframes has a non-trivial overhead.

    ReplyDelete
  7. Yes, you could change the order at which you run things for comparison. But that wouldn't test the "first read of a file" timing"....
    You could also read different files - same size but different data, or in between runs read files of sufficient size to clear out buffers.

    Allocation is a problem. For example, doing a scan() to read the single fields in my file makes memory use of R grow to 3GB on my machine. I'll have to study that a bit more.

    ReplyDelete
  8. Oh, and you can go to different machines on a cluster.... on a network where the disk access is faster than the network.

    ReplyDelete