Wednesday 3 August 2011

Faster files in R

R is fairly slow in reading files. read.table() is slow, scan() a bit faster, and readLines() fastest.
But all these are nowhere as fast as other tools that scan through files.

Let us look at an example. I have in front of me a 283M file.

(Small update: the timings where off before. First because R hashes strings, one has to quit and restart R to get good timings. But I'm not really sure why my timings were off before...)


; time grep hello file
1.79 real 0.23 user 0.27 sys
; time wc file
774797 145661836 297147275 file
2.87 real 2.13 user 0.34 sys
; time cat file | rev >/dev/null
3.58 real 0.02 user 0.35 sys

> system.time({a=readLines("file")})
user system elapsed
25.158 0.829 26.500

> system.time({a=scan("file",what=character(),sep="\n")})
Read 774797 items
user system elapsed
30.526 0.827 31.734

> system.time({a=read.table("file",sep="\n",as.is=T)})
user system elapsed
31.880 0.802 32.837


sad, isn't it? And what does R do? Process data. So we read large files in all the time.

But beyond readLines(), R has deeper routines for reading files. There are also readChar() and readBin().

It turns out that using these, one can read in files faster.

Here is the code: (The second version further down is much better...)

my.read.lines <- function( fname, buf.size=5e7 ) {
s = file.info( fname )$size
in.file = file( fname, "r" )
buf=""
res = list()
i=1
while( s > 0 ) {
n = min( c( buf.size, s ) )
buf = paste(buf, readChar( in.file, n ),sep="" )
 
s = s - n
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
n=nchar(buf)
if( substr(buf,n,n)=="\n" ) {
res[[i]] = r
buf = ""
} else {
res[[i]] = head(r,-1)
buf = tail(r,1)
}
i=i+1
}
close( in.file )
c( unlist(res), buf )
}

Created by Pretty R at inside-R.org



The gain is not amazing:

> system.time({a=my.read.lines("file")})
user system elapsed
22.277 2.739 25.175


How does the code work?
 buf = paste(buf, readChar( in.file, n ),sep="" )
reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn't matter - on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

We then split the file using
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
the fixed=T parameter makes strsplit faster.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele's note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?



Update.

I got comments that it isn't fair to compare to grep and wc, because they don't need to keep the whole file in memory. So, I tried the following primitive program:

#include < stdio.h >
#include < stdlib.h >

main()
{
FILE *f ;
char *a ;
f = fopen("file","r") ;

a=malloc(283000001) ;
fread( a, 1, 283000000, f ) ;

fclose(f) ;
}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I'll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory...

Here is the new code:
my.read.lines2=function(fname) {
s = file.info( fname )$size
buf = readChar( fname, s, useBytes=T)
strsplit( buf,"\n",fixed=T,useBytes=T)[[1]]
}

Created by Pretty R at inside-R.org


And the timings:

> system.time({f=file("file","rb");a=readChar(f,file.info("file")$size,useBytes=T);close(f)})
user system elapsed
1.302 0.746 2.080

> system.time({a=my.read.lines2("file")})
user system elapsed
10.721 1.163 12.046


Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven't shown a better scan() or read.table()....


This thread will continue for other read functions.