MLT thinks: benchmark

Some more background on atof(). I analyzing sequencing data. That means, I am processing files with billions of lines. I haven't yet mastered properly how to do all I need in a parallel manner, so some of my operations are linear. That's where it all starts. At first, I had programs written in perl+python to analyze the data. The problem was that each file took many hours to precess. The rate of processing (as measured by pv) was round 300kb/s. Simply running 'cat' on the file works at a rate of 100Mb/s. Quite a lot faster. I was trying to do all my processing at the rate of 'cat'. Here is where the problems start - other than the fact that I switched to flex/C/C++.... The data to be analyzed has lots of numbers, and so I read and write a lot of numbers, and every read/write operation on numbers takes a long time, basically because computers store data in binary, and I'm storing text files with a decimal representation of numbers.
There are 3 solutions to this problem:

Store numbers in binary in the file. For that one needs to define a good format for storing binary data. That wuold be really the best approach.
Write or find a good BCD (Binary-coded decimal) library. I know... sounds old. A bit like perl, I guess. But, as long as one doesn't to too complicated math, this will be almost as fast as 1.
improve read/write of numbers.

How much was there to improve? With sprintf my programs were running at rates of 10-30 MB/s, or 3-10 times "too slow".
I went on a hunt for an efficient itoa, but didn't really find much.
In the end, I wrote the itoa mentioned in my last post. That itoa was fixed width, now I also added variable width.
Here is a table of the relative speeds of various implementations:

size of number	my itoa	my itoa_fill0	itoa1	ufast_itoa10	new_itoa	sstream	i32toa	sprintf
10^1	1.00	2.25	1.27	1.10	1.81	6.04	9.98	18.64
10^2	1.00	2.21	1.33	1.09	1.91	6.07	9.91	18.62
10^3	1.00	2.45	1.80	1.41	3.01	6.68	12.16	22.06
10^4	1.00	2.33	2.25	1.38	2.47	6.24	12.06	21.12
10^5	1.00	1.33	1.80	2.76	4.46	4.18	7.98	14.48
10^6	1.00	1.12	1.87	2.27	3.85	3.52	6.86	12.76
10^7	1.00	1.21	2.27	2.97	4.75	3.79	7.90	14.05
10^8	1.00	1.27	2.78	3.07	4.74	4.00	6.72	15.08
10^9	1.23	1.00	2.70	3.85	5.11	3.49	6.08	13.73

It is easy to see what the problem is. sprintf is really slow. sstream is actually surprisingly fast.... atoi next. Then, just reading and writing from files....

R is fairly slow in reading files. read.table() is slow, scan() a bit faster, and readLines() fastest.
But all these are nowhere as fast as other tools that scan through files.

Let us look at an example. I have in front of me a 283M file.

(Small update: the timings where off before. First because R hashes strings, one has to quit and restart R to get good timings. But I'm not really sure why my timings were off before...)


; time grep hello file
        1.79 real         0.23 user         0.27 sys
; time wc file
  774797 145661836 297147275 file
        2.87 real         2.13 user         0.34 sys
; time cat file | rev >/dev/null
        3.58 real         0.02 user         0.35 sys

> system.time({a=readLines("file")})
   user  system elapsed 
 25.158   0.829  26.500 

> system.time({a=scan("file",what=character(),sep="\n")})
Read 774797 items
   user  system elapsed 
 30.526   0.827  31.734 

> system.time({a=read.table("file",sep="\n",as.is=T)})
   user  system elapsed 
 31.880   0.802  32.837

sad, isn't it? And what does R do? Process data. So we read large files in all the time.

But beyond readLines(), R has deeper routines for reading files. There are also readChar() and readBin().

It turns out that using these, one can read in files faster.

Here is the code: (The second version further down is much better...)

my.read.lines <- function( fname, buf.size=5e7 ) {
 s = file.info( fname )$size
 in.file = file( fname, "r" )
 buf=""
 res = list()
 i=1
 while( s > 0 ) {
  n = min( c( buf.size, s ) )
  buf = paste(buf, readChar( in.file, n ),sep="" )
 
  s = s - n
  r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
  n=nchar(buf)
  if( substr(buf,n,n)=="\n" ) {
   res[[i]] = r
   buf = ""
  } else {
   res[[i]] = head(r,-1)
   buf = tail(r,1)
  }
  i=i+1
 }
 close( in.file )
 c( unlist(res), buf )
}

Created by Pretty R at inside-R.org

The gain is not amazing:


> system.time({a=my.read.lines("file")})
   user  system elapsed 
 22.277   2.739  25.175

How does the code work?

 buf = paste(buf, readChar( in.file, n ),sep="" )

reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn't matter - on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

We then split the file using

r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]

the fixed=T parameter makes strsplit faster.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele's note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?

Update.

I got comments that it isn't fair to compare to grep and wc, because they don't need to keep the whole file in memory. So, I tried the following primitive program:


#include < stdio.h >
#include < stdlib.h >

main()
{
   FILE *f ;
   char *a ;
   f = fopen("file","r") ;
   
   a=malloc(283000001) ;
   fread( a, 1, 283000000, f ) ;
   
   fclose(f) ;
}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I'll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory...

Here is the new code:

my.read.lines2=function(fname) {
 s = file.info( fname )$size 
 buf = readChar( fname, s, useBytes=T)
 strsplit( buf,"\n",fixed=T,useBytes=T)[[1]]
}

Created by Pretty R at inside-R.org

And the timings:


> system.time({f=file("file","rb");a=readChar(f,file.info("file")$size,useBytes=T);close(f)})
   user  system elapsed 
  1.302   0.746   2.080 

> system.time({a=my.read.lines2("file")})
   user  system elapsed 
 10.721   1.163  12.046

Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven't shown a better scan() or read.table()....

This thread will continue for other read functions.

MLT thinks

Saturday, 7 April 2012

More on atof()

Wednesday, 3 August 2011

Faster files in R

Followers