Saturday, 7 April 2012

More on atof()

Some more background on atof(). I analyzing sequencing data. That means, I am processing files with billions of lines. I haven't yet mastered properly how to do all I need in a parallel manner, so some of my operations are linear. That's where it all starts. At first, I had programs written in perl+python to analyze the data. The problem was that each file took many hours to precess. The rate of processing (as measured by pv) was round 300kb/s. Simply running 'cat' on the file works at a rate of 100Mb/s. Quite a lot faster. I was trying to do all my processing at the rate of 'cat'. Here is where the problems start - other than the fact that I switched to flex/C/C++.... The data to be analyzed has lots of numbers, and so I read and write a lot of numbers, and every read/write operation on numbers takes a long time, basically because computers store data in binary, and I'm storing text files with a decimal representation of numbers.
There are 3 solutions to this problem:
  1. Store numbers in binary in the file. For that one needs to define a good format for storing binary data. That wuold be really the best approach.
  2. Write or find a good BCD (Binary-coded decimal) library. I know... sounds old. A bit like perl, I guess. But, as long as one doesn't to too complicated math, this will be almost as fast as 1.
  3. improve read/write of numbers.
How much was there to improve? With sprintf my programs were running at rates of 10-30 MB/s, or 3-10 times "too slow".
I went on a hunt for an efficient itoa, but didn't really find much.
In the end, I wrote the itoa mentioned in my last post. That itoa was fixed width, now I also added variable width.
Here is a table of the relative speeds of various implementations:

size of number my itoa my itoa_fill0 itoa1 ufast_itoa10 new_itoa sstream i32toa sprintf
10^1 1.00 2.25 1.27 1.10 1.81 6.04 9.98 18.64
10^2 1.00 2.21 1.33 1.09 1.91 6.07 9.91 18.62
10^3 1.00 2.45 1.80 1.41 3.01 6.68 12.16 22.06
10^4 1.00 2.33 2.25 1.38 2.47 6.24 12.06 21.12
10^5 1.00 1.33 1.80 2.76 4.46 4.18 7.98 14.48
10^6 1.00 1.12 1.87 2.27 3.85 3.52 6.86 12.76
10^7 1.00 1.21 2.27 2.97 4.75 3.79 7.90 14.05
10^8 1.00 1.27 2.78 3.07 4.74 4.00 6.72 15.08
10^9 1.23 1.00 2.70 3.85 5.11 3.49 6.08 13.73

It is easy to see what the problem is. sprintf is really slow. sstream is actually surprisingly fast.... atoi next. Then, just reading and writing from files....

Wednesday, 28 March 2012

Optimizing atof()

I was annoyed that atof() was making my program run 6 times slower. So, I wrote a new, slightly faster one. Still not fast enough for my taste. I tried playing around with mmx, but don't have good results to show, yet. I'll add benchmarks soon. The program will use one integer multiplication, addition and subtraction per digit, and one float multiplication for the whole number.
double my_atof (const char *p)
{
   static unsigned char to_do[256] ;
   static double my_exp[512*2] ;
  
  int i;

 
  if( to_do[0] != DONE ) {
    // Initialize tables for interpreting characters and for exponents.
    for( i=0; i<255; i++) to_do[i] = 255 ;
    to_do['0']=0 ;
    for(i='1'; i<='9'; i++) to_do[i] = DIGIT ;
    
    
    to_do['-'] = MINUS ;
    to_do['.'] = POINT ;
    to_do['e'] = EXPONENT ; 
    to_do['E'] = EXPONENT ;
    to_do[' '] = SPACE ;
    to_do['\t'] = SPACE ;
    to_do[0] = DONE ;
    my_exp[512] = 1.0;
    for( i=1; i<511; i++)
      my_exp[i+512] = my_exp[i-1+512] * 10.0 ;
    for( i=1; i<511; i++)
      my_exp[512-i] = my_exp[512-i+1] * 0.1 ;
#define MY_EXP(x) (my_exp[((x)+512) & 1023])
  }
  
  int neg=FALSE , div=0, val = 0, eval=0 ;


  // The function works in 5 stages:
  // 1. jump over whitespace, till first char is found - either digit, sign or dot.
  // 2. Read digits till first non-digit - either '.' or eE
  // 3. Read digits after decimal point
  // 4. Read exponent digit or sign
  // 5. Read rest of exponent.
  while( TRUE ) {

    switch( to_do[ *p ] ) {
    case DIGIT: val = *p-'0' ; goto end_loop1 ;
    case SPACE: 
      break ;
    case MINUS:
      neg = TRUE ;
      goto end_loop1 ;
    case PLUS:
      goto end_loop1 ;
    case POINT:
      goto found_dot ;
    case 0: goto end_loop1 ;
    default:
      return 0.0 ;
    }
    p++ ;
  }


  end_loop1:
  p++ ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: val= val*10+(*p)-'0'; break ;
    case 0: val *= 10 ;break ;
    case POINT:
      goto found_dot;
    case EXPONENT:
      goto found_exp;
    default:
      if( neg ) val = -val ;
      return (double)( val) ;
    }
     p++ ; 
  }

  found_dot:
  p++ ;

  while( TRUE ) {
    {
      switch( to_do[ *p ] ) {
      case DIGIT: val= val*10+(*p)-'0';div--; break ;
      case 0: val *= 10 ;div--;break ;
      case EXPONENT:
 goto found_exp;
      default:
 if( neg ) val = -val;
 return MY_EXP(div) * val ;
      }
      p++ ;
    }
  }

  found_exp:
  p++ ;
  if( neg ) val = -val ;
  neg=FALSE ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: eval=(*p)-'0';goto end_loop2 ;
    case 0: goto end_loop2 ;
    case SPACE: 
      return MY_EXP(div) * val ;
    case MINUS:
      neg  = TRUE ; //fallthrough
    case PLUS:
      goto end_loop2 ;
    default:
      return MY_EXP(div) * val ;
    }
    p++ ;
  }

  end_loop2:
  p++ ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: eval= eval*10+(*p)-'0'; break ;
    case 0: eval *= 10 ;break ;
    default:
      goto done ;
    }
    p++ ; 
  } 
  done:

  if( neg ) 
    eval = div - eval ;
  else
    eval = div + eval ;
  return  MY_EXP( eval ) * val ;

}

Wednesday, 3 August 2011

Faster files in R

R is fairly slow in reading files. read.table() is slow, scan() a bit faster, and readLines() fastest.
But all these are nowhere as fast as other tools that scan through files.

Let us look at an example. I have in front of me a 283M file.

(Small update: the timings where off before. First because R hashes strings, one has to quit and restart R to get good timings. But I'm not really sure why my timings were off before...)


; time grep hello file
1.79 real 0.23 user 0.27 sys
; time wc file
774797 145661836 297147275 file
2.87 real 2.13 user 0.34 sys
; time cat file | rev >/dev/null
3.58 real 0.02 user 0.35 sys

> system.time({a=readLines("file")})
user system elapsed
25.158 0.829 26.500

> system.time({a=scan("file",what=character(),sep="\n")})
Read 774797 items
user system elapsed
30.526 0.827 31.734

> system.time({a=read.table("file",sep="\n",as.is=T)})
user system elapsed
31.880 0.802 32.837


sad, isn't it? And what does R do? Process data. So we read large files in all the time.

But beyond readLines(), R has deeper routines for reading files. There are also readChar() and readBin().

It turns out that using these, one can read in files faster.

Here is the code: (The second version further down is much better...)

my.read.lines <- function( fname, buf.size=5e7 ) {
s = file.info( fname )$size
in.file = file( fname, "r" )
buf=""
res = list()
i=1
while( s > 0 ) {
n = min( c( buf.size, s ) )
buf = paste(buf, readChar( in.file, n ),sep="" )
 
s = s - n
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
n=nchar(buf)
if( substr(buf,n,n)=="\n" ) {
res[[i]] = r
buf = ""
} else {
res[[i]] = head(r,-1)
buf = tail(r,1)
}
i=i+1
}
close( in.file )
c( unlist(res), buf )
}

Created by Pretty R at inside-R.org



The gain is not amazing:

> system.time({a=my.read.lines("file")})
user system elapsed
22.277 2.739 25.175


How does the code work?
 buf = paste(buf, readChar( in.file, n ),sep="" )
reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn't matter - on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

We then split the file using
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
the fixed=T parameter makes strsplit faster.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele's note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?



Update.

I got comments that it isn't fair to compare to grep and wc, because they don't need to keep the whole file in memory. So, I tried the following primitive program:

#include < stdio.h >
#include < stdlib.h >

main()
{
FILE *f ;
char *a ;
f = fopen("file","r") ;

a=malloc(283000001) ;
fread( a, 1, 283000000, f ) ;

fclose(f) ;
}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I'll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory...

Here is the new code:
my.read.lines2=function(fname) {
s = file.info( fname )$size
buf = readChar( fname, s, useBytes=T)
strsplit( buf,"\n",fixed=T,useBytes=T)[[1]]
}

Created by Pretty R at inside-R.org


And the timings:

> system.time({f=file("file","rb");a=readChar(f,file.info("file")$size,useBytes=T);close(f)})
user system elapsed
1.302 0.746 2.080

> system.time({a=my.read.lines2("file")})
user system elapsed
10.721 1.163 12.046


Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven't shown a better scan() or read.table()....


This thread will continue for other read functions.

Saturday, 20 June 2009

Analysis of Iran absentee votes



On http://www.presstv.com/detail.aspx?id=98206&sectionid=351020101 the official Iranian election results outside of Iran are posted. Here is a bit of exploration of the results.

The graph shows the number of votes for Ahmadinejad (x-axis) vs. the number of votes for Moussavi (y-axis). Each point is a city (or a country). The diagonal line represents the ratio in official total results for all of Iran. Any point above the line has more votes for Moussavi than the official ratio says, any point below the line more for Ahmadinejad. (I didn't use some regions where the reported total was less than the sum of votes). The total percentages of the data I looked at is 56% for Moussavi, 40% for Ahmadinejad, 2% for Karroubi and 2% for Reazei.

There are two main lines that the points lie on: one is very close to the official ratio, the other has a much steeper ratio - many more votes for Moussavi.
The points that lie on the less steep line with more votes to Ahmadinejad are Mecca, Medina, Damascus, Kuwait, Karbala.
points that lie on the steep line, with more votes to Moussavi are New York, London, Kuala Lampur, Paris, Ottawa.
Dubai is somewhat of an outlier, and sits between the two lines.

Looking at the ratio of votes for Ahmadinejad to the other two candidates, Karroubi and Reazei gives the same overall picture. The graphs are below.

When looking at these results, the first thing one has to remember is that they are not expected to represent the population in Iran. The Iranian people who live in New York represents a very different socioeconomic sphere, and are exposed to very different Media than the people who live in Iran.

The results in these ballots definitely do not agree with the letter supposedly proving the election fraud in Iran. In no case do the percentage of votes for Karroubi and Reazei come close to the numbers given in the letter. In the letter they get 32% and 9%, in these votes, even in the most "western" booths, they get 4% and 1% (and Moussavi gets 85%). So, either these results are also fabricated, or the letter is fraudulent.

What I can infer from this are the following options

1. If both official results and these results are fabricated, then in this case they are cleverly fabricated. As opposed to the results in Iran, where there are no trends whatsoever with respect to urban vs. non-urban areas, here there are very strong trends.

2. If just the official results are fabricated, and these results are real, then in at least some places - Mecca, Medina, Damascus, Kuwait, and Karbala, the results of the vote are very close to the official published results. This at least means that the results in Iran were not totally fabricated - they reflect the ratio of opinions in some places.

3. It could be that the official results are fabricated, and at the same time, results in some cities outside Iran, but not in all cities where fabricated. Either because of where the counting was done, or who counted.

4. It could be that both the official results and these results are real. When I started the analysis I expected much more discrepancy. The official results are not that far off the results in Mecca and Medina, which I would expect to represent the population in Iran much more than New York.

Dubai is especially interesting. It lies between the two slopes. This could be a mixture of people who live under more or less western influence, or are on two different social status levels. It is interesting, because Nate's analysis of urban vs. non urban areas saw no such trend.

In short, after looking at these results, I still can not be sure of what happened. I'm pretty sure that the "Fraud letter" is fake, though.

Here are the same graphs comparing Ahmadinejad to the other two candidates: