Saturday, 7 April 2012

More on atof()

Some more background on atof(). I analyzing sequencing data. That means, I am processing files with billions of lines. I haven't yet mastered properly how to do all I need in a parallel manner, so some of my operations are linear. That's where it all starts. At first, I had programs written in perl+python to analyze the data. The problem was that each file took many hours to precess. The rate of processing (as measured by pv) was round 300kb/s. Simply running 'cat' on the file works at a rate of 100Mb/s. Quite a lot faster. I was trying to do all my processing at the rate of 'cat'. Here is where the problems start - other than the fact that I switched to flex/C/C++.... The data to be analyzed has lots of numbers, and so I read and write a lot of numbers, and every read/write operation on numbers takes a long time, basically because computers store data in binary, and I'm storing text files with a decimal representation of numbers.
There are 3 solutions to this problem:
  1. Store numbers in binary in the file. For that one needs to define a good format for storing binary data. That wuold be really the best approach.
  2. Write or find a good BCD (Binary-coded decimal) library. I know... sounds old. A bit like perl, I guess. But, as long as one doesn't to too complicated math, this will be almost as fast as 1.
  3. improve read/write of numbers.
How much was there to improve? With sprintf my programs were running at rates of 10-30 MB/s, or 3-10 times "too slow".
I went on a hunt for an efficient itoa, but didn't really find much.
In the end, I wrote the itoa mentioned in my last post. That itoa was fixed width, now I also added variable width.
Here is a table of the relative speeds of various implementations:

size of number my itoa my itoa_fill0 itoa1 ufast_itoa10 new_itoa sstream i32toa sprintf
10^1 1.00 2.25 1.27 1.10 1.81 6.04 9.98 18.64
10^2 1.00 2.21 1.33 1.09 1.91 6.07 9.91 18.62
10^3 1.00 2.45 1.80 1.41 3.01 6.68 12.16 22.06
10^4 1.00 2.33 2.25 1.38 2.47 6.24 12.06 21.12
10^5 1.00 1.33 1.80 2.76 4.46 4.18 7.98 14.48
10^6 1.00 1.12 1.87 2.27 3.85 3.52 6.86 12.76
10^7 1.00 1.21 2.27 2.97 4.75 3.79 7.90 14.05
10^8 1.00 1.27 2.78 3.07 4.74 4.00 6.72 15.08
10^9 1.23 1.00 2.70 3.85 5.11 3.49 6.08 13.73

It is easy to see what the problem is. sprintf is really slow. sstream is actually surprisingly fast.... atoi next. Then, just reading and writing from files....

Wednesday, 28 March 2012

Optimizing atof()

I was annoyed that atof() was making my program run 6 times slower. So, I wrote a new, slightly faster one. Still not fast enough for my taste. I tried playing around with mmx, but don't have good results to show, yet. I'll add benchmarks soon. The program will use one integer multiplication, addition and subtraction per digit, and one float multiplication for the whole number.
double my_atof (const char *p)
{
   static unsigned char to_do[256] ;
   static double my_exp[512*2] ;
  
  int i;

 
  if( to_do[0] != DONE ) {
    // Initialize tables for interpreting characters and for exponents.
    for( i=0; i<255; i++) to_do[i] = 255 ;
    to_do['0']=0 ;
    for(i='1'; i<='9'; i++) to_do[i] = DIGIT ;
    
    
    to_do['-'] = MINUS ;
    to_do['.'] = POINT ;
    to_do['e'] = EXPONENT ; 
    to_do['E'] = EXPONENT ;
    to_do[' '] = SPACE ;
    to_do['\t'] = SPACE ;
    to_do[0] = DONE ;
    my_exp[512] = 1.0;
    for( i=1; i<511; i++)
      my_exp[i+512] = my_exp[i-1+512] * 10.0 ;
    for( i=1; i<511; i++)
      my_exp[512-i] = my_exp[512-i+1] * 0.1 ;
#define MY_EXP(x) (my_exp[((x)+512) & 1023])
  }
  
  int neg=FALSE , div=0, val = 0, eval=0 ;


  // The function works in 5 stages:
  // 1. jump over whitespace, till first char is found - either digit, sign or dot.
  // 2. Read digits till first non-digit - either '.' or eE
  // 3. Read digits after decimal point
  // 4. Read exponent digit or sign
  // 5. Read rest of exponent.
  while( TRUE ) {

    switch( to_do[ *p ] ) {
    case DIGIT: val = *p-'0' ; goto end_loop1 ;
    case SPACE: 
      break ;
    case MINUS:
      neg = TRUE ;
      goto end_loop1 ;
    case PLUS:
      goto end_loop1 ;
    case POINT:
      goto found_dot ;
    case 0: goto end_loop1 ;
    default:
      return 0.0 ;
    }
    p++ ;
  }


  end_loop1:
  p++ ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: val= val*10+(*p)-'0'; break ;
    case 0: val *= 10 ;break ;
    case POINT:
      goto found_dot;
    case EXPONENT:
      goto found_exp;
    default:
      if( neg ) val = -val ;
      return (double)( val) ;
    }
     p++ ; 
  }

  found_dot:
  p++ ;

  while( TRUE ) {
    {
      switch( to_do[ *p ] ) {
      case DIGIT: val= val*10+(*p)-'0';div--; break ;
      case 0: val *= 10 ;div--;break ;
      case EXPONENT:
 goto found_exp;
      default:
 if( neg ) val = -val;
 return MY_EXP(div) * val ;
      }
      p++ ;
    }
  }

  found_exp:
  p++ ;
  if( neg ) val = -val ;
  neg=FALSE ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: eval=(*p)-'0';goto end_loop2 ;
    case 0: goto end_loop2 ;
    case SPACE: 
      return MY_EXP(div) * val ;
    case MINUS:
      neg  = TRUE ; //fallthrough
    case PLUS:
      goto end_loop2 ;
    default:
      return MY_EXP(div) * val ;
    }
    p++ ;
  }

  end_loop2:
  p++ ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: eval= eval*10+(*p)-'0'; break ;
    case 0: eval *= 10 ;break ;
    default:
      goto done ;
    }
    p++ ; 
  } 
  done:

  if( neg ) 
    eval = div - eval ;
  else
    eval = div + eval ;
  return  MY_EXP( eval ) * val ;

}