MLT thinks: August 2011

R is fairly slow in reading files. read.table() is slow, scan() a bit faster, and readLines() fastest.
But all these are nowhere as fast as other tools that scan through files.

Let us look at an example. I have in front of me a 283M file.

(Small update: the timings where off before. First because R hashes strings, one has to quit and restart R to get good timings. But I'm not really sure why my timings were off before...)


; time grep hello file
        1.79 real         0.23 user         0.27 sys
; time wc file
  774797 145661836 297147275 file
        2.87 real         2.13 user         0.34 sys
; time cat file | rev >/dev/null
        3.58 real         0.02 user         0.35 sys

> system.time({a=readLines("file")})
   user  system elapsed 
 25.158   0.829  26.500 

> system.time({a=scan("file",what=character(),sep="\n")})
Read 774797 items
   user  system elapsed 
 30.526   0.827  31.734 

> system.time({a=read.table("file",sep="\n",as.is=T)})
   user  system elapsed 
 31.880   0.802  32.837

sad, isn't it? And what does R do? Process data. So we read large files in all the time.

But beyond readLines(), R has deeper routines for reading files. There are also readChar() and readBin().

It turns out that using these, one can read in files faster.

Here is the code: (The second version further down is much better...)

my.read.lines <- function( fname, buf.size=5e7 ) {
 s = file.info( fname )$size
 in.file = file( fname, "r" )
 buf=""
 res = list()
 i=1
 while( s > 0 ) {
  n = min( c( buf.size, s ) )
  buf = paste(buf, readChar( in.file, n ),sep="" )
 
  s = s - n
  r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
  n=nchar(buf)
  if( substr(buf,n,n)=="\n" ) {
   res[[i]] = r
   buf = ""
  } else {
   res[[i]] = head(r,-1)
   buf = tail(r,1)
  }
  i=i+1
 }
 close( in.file )
 c( unlist(res), buf )
}

Created by Pretty R at inside-R.org

The gain is not amazing:


> system.time({a=my.read.lines("file")})
   user  system elapsed 
 22.277   2.739  25.175

How does the code work?

 buf = paste(buf, readChar( in.file, n ),sep="" )

reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn't matter - on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

We then split the file using

r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]

the fixed=T parameter makes strsplit faster.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele's note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?

Update.

I got comments that it isn't fair to compare to grep and wc, because they don't need to keep the whole file in memory. So, I tried the following primitive program:


#include < stdio.h >
#include < stdlib.h >

main()
{
   FILE *f ;
   char *a ;
   f = fopen("file","r") ;
   
   a=malloc(283000001) ;
   fread( a, 1, 283000000, f ) ;
   
   fclose(f) ;
}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I'll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory...

Here is the new code:

my.read.lines2=function(fname) {
 s = file.info( fname )$size 
 buf = readChar( fname, s, useBytes=T)
 strsplit( buf,"\n",fixed=T,useBytes=T)[[1]]
}

Created by Pretty R at inside-R.org

And the timings:


> system.time({f=file("file","rb");a=readChar(f,file.info("file")$size,useBytes=T);close(f)})
   user  system elapsed 
  1.302   0.746   2.080 

> system.time({a=my.read.lines2("file")})
   user  system elapsed 
 10.721   1.163  12.046

Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven't shown a better scan() or read.table()....

This thread will continue for other read functions.

MLT thinks

Wednesday, 3 August 2011

Faster files in R

Followers