Monday, 14 January 2013

Using TeXmacs as an interface for R

Using TeXmacs as an interface for R (part 1)

(This post was copied from my post at wordpress. Because I didn't manage to switch r-bloggers.com to use that blog, I'm reposting it here....)
A nice, but not very well known, interface to R is
TeXmacs. (I have to say that I am not totally objective, since I wrote the interface between R and TeXmacs…)
Here’s a sample window:
Screen Shot 2012-12-04 at 12.05.43 PM
In the following few posts I’d like to explain how to use this interface.

Installation

First, install TeXmacs. Best is to install the latest and greatest version of the svn, because that will include the latest changes of the interface with R, and in particular syntax highlighting. If you don’t install the latest version, you’ll probably only miss syntax highlighting. In that case it would probably be good to still install the latest version of the TeXmacs library for R.
Now that TeXmacs is installed, we can start it.
From the command line, just enter:

texmacs

A nice window will open:
TeXmacs main window
I will not go into how to work with TeXmacs very much, just the R interface.
Still, a few hints might help.
  1. You can chose your favorite key bindings in the menu Edit→Preferences→Look and Feel. I like emacs bindings.
  2. TeXmacs initially communicates with the user via the status line (bottom edge of window). If you prefer popups, select this from the menu Edit→Preferences→Interactive questions→In popup window.
  3. Usually TeXmacs files are stored with the extension “.tm”
  4. TeXmacs has a somewhat awkward interaction with the clipboard. Cut and paste within TeXmacs works fine. But, if you want to cut of paste to a different program, the text has to be converted from the special TeXmacs representation to text. To do that do:
    Edit→Copy to→Verbatim and Edit→Paste from→Verbatim. If you copy/paste a lot from other programs, you can chose:
    Tools→Selections→Import→Verbatim and Tools→Selections→Export→Verbatim. Then the default behaviour will be import and export in verbatim.
  5. TeXmacs has a lot of documentation. Explore the help menu and use its search functions.

Using TeXmacs and R

Now let’s work with R. Select the little monitor icon, and chose “R”.
Dropdown menu for choosing session type
After starting an R session, you get the usual R prompt ‘>’.
Type:

> 1:100

TeXmacs responds with the output from R:

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100

What is nice in using TeXmacs as an interface to R is that you can now move up to the previous line, edit it, and press enter again. The output will be replaced by The new output.
If you would like to enter multiple lines of text at once, press shift-enter, instead of enter. The input field will expend, and nothing will be sent to the R process. Once you are done, press enter. Here is an example:
Screen Shot 2012-12-03 at 10.28.24 PM
You also see two additional features: syntax highlighting, and special background colors for input and output text. To turn on this coloring scheme, select the aptly named menu item Documentadd package→Programvarsession.
BTW: If you prefer to invert the behaviour of enter and shift-enter, select the option multiline input as shown below:
Screen Shot 2012-12-03 at 10.34.37 PM

Graphics

TeXmacs+R can also handle graphics.
Enter the following expression:

> plot(1:10);v()

Notice that after the first plot() command, usually a new window pops up, and you have to select the TeXmacs window again.
The result will look something like the following:
Screen Shot 2012-12-03 at 10.43.41 PM
As you see, we needed two commands: the usual plot call, and v(). v() inserts the current plot as is into TeXmacs. So you can now do the following:

> title("Our first sample plot");v()

Which gives:
Screen Shot 2012-12-04 at 11.42.04 AM
We now have a second copy of our graph, with a title. You could, of course, have done two plot commands and one v().

> plot(1:10);title("Our first sample plot");v()

Let us try a more complicated example:
Here is a nice plot taken from the R Graph Gallery.
mu1mu2s11s12s22rhox1<-seq(-10,10,length=41) # generating the vector series x1
x2#
f<-function(x1,x2){
 term1  term2  term3  term4  term5  term1*exp(term2*(term3+term4-term5))
} # setting up the function of the multivariate normal density
#
z<-outer(x1,x2,f) # calculating the density values
#
persp(x1, x2, z,
      main="Two dimensional Normal Distribution",
      sub=expression(italic(f)~(bold(x))==frac(1,2~pi~sqrt(sigma[11]~
                     sigma[22]~(1-rho^2)))~phantom(0)~exp~bgroup("{",
              list(-frac(1,2(1-rho^2)),
              bgroup("[", frac((x[1]~-~mu[1])^2, sigma[11])~-~2~rho~frac(x[1]~-~mu[1],
              sqrt(sigma[11]))~ frac(x[2]~-~mu[2],sqrt(sigma[22])) ~+~
              frac((x[2]~-~mu[2])^2, sigma[22]),"]")),"}")),
      col="lightgreen",
      theta=30, phi=20,
      r=50,
      d=0.1,
      expand=0.5,
      ltheta=90, lphi=180,
      shade=0.75,
      ticktype="detailed",
      nticks=5) # produces the 3-D plot
# adding a text line to the graph
mtext(expression(list(mu[1]==0,mu[2]==0,sigma[11]==10,sigma[22]==10,sigma[12]==15,rho==0.5)), side=3)

Created by Pretty R at inside-R.org

In order to paste this text properly into the R session, you will have to chose Edit→paste from→Verbatim. This is because TeXmacs normally treats the clipboard as representing a special TeXmacs format, not plain text.
One you press enter, you’ll see something like this:
Screen Shot 2012-12-03 at 11.03.43 PM
At the Bottom you see a strange combination of > and +. These represent the prompts given by R as the lines are sent to the session. The current interface between TeXmacs and R can not tell when the R process finished evaluating its input. Not only that, but it might be that we’re not even finished with sending/receiving all the input lines. Thesefore, hit enter a couple more times, followed by v():
Screen Shot 2012-12-03 at 11.15.02 PM
As you can see, the size of the graph doesn’t fit very well. We can control this by specifying the size of the image that should actually be produced by R v(width=8,height=8) (remember you can go up and edit the previous expression):
Screen Shot 2012-12-03 at 11.18.11 PM
Much better!

Help

There are at least two ways of calling help in R inside TeXmacs. The first, is the usual help.start(), and using a web browser. The second, which is what I usually use, is to just ask for help in the usual way:

> ?plot

The help page is inserted formated into the current Buffer:
Screen Shot 2012-12-03 at 11.24.50 PM

Working on a remote server

One of the best features of the TeXmacs+R interface is the ability to work on remote systems. To do that it is best to first set up a possibility for password-less ssh sessions (which means you need to ssh-keygen on the source system, and add the public key in the file ~/.ssh/authorized_hosts.)
You should also install the TeXmacs package on the remote system.
One that is set up, enter the following command in an R session:

> system("ssh -t REMOTE_HOST R")

You should then get the prompt of the remote system. (The argument ‘-t’ makes ssh open a ‘psydo-terminal’, which interacts better with TeXmacs) Now enter:

> library(TeXmacs)

And, if you’d like to use graphics,

> start.view()

This last command opens a postscript device, and sets it up so that the v() command will copy from it to TeXmacs.
Here is what things look like then:
Screen Shot 2012-12-04 at 12.26.40 AM
Everything should work as on the local machine. Thus, you can ask R for help, and it should be inserted nicely formated into the current buffer.

Saturday, 7 April 2012

More on atof()

Some more background on atof(). I analyzing sequencing data. That means, I am processing files with billions of lines. I haven't yet mastered properly how to do all I need in a parallel manner, so some of my operations are linear. That's where it all starts. At first, I had programs written in perl+python to analyze the data. The problem was that each file took many hours to precess. The rate of processing (as measured by pv) was round 300kb/s. Simply running 'cat' on the file works at a rate of 100Mb/s. Quite a lot faster. I was trying to do all my processing at the rate of 'cat'. Here is where the problems start - other than the fact that I switched to flex/C/C++.... The data to be analyzed has lots of numbers, and so I read and write a lot of numbers, and every read/write operation on numbers takes a long time, basically because computers store data in binary, and I'm storing text files with a decimal representation of numbers.
There are 3 solutions to this problem:
  1. Store numbers in binary in the file. For that one needs to define a good format for storing binary data. That wuold be really the best approach.
  2. Write or find a good BCD (Binary-coded decimal) library. I know... sounds old. A bit like perl, I guess. But, as long as one doesn't to too complicated math, this will be almost as fast as 1.
  3. improve read/write of numbers.
How much was there to improve? With sprintf my programs were running at rates of 10-30 MB/s, or 3-10 times "too slow".
I went on a hunt for an efficient itoa, but didn't really find much.
In the end, I wrote the itoa mentioned in my last post. That itoa was fixed width, now I also added variable width.
Here is a table of the relative speeds of various implementations:

size of number my itoa my itoa_fill0 itoa1 ufast_itoa10 new_itoa sstream i32toa sprintf
10^1 1.00 2.25 1.27 1.10 1.81 6.04 9.98 18.64
10^2 1.00 2.21 1.33 1.09 1.91 6.07 9.91 18.62
10^3 1.00 2.45 1.80 1.41 3.01 6.68 12.16 22.06
10^4 1.00 2.33 2.25 1.38 2.47 6.24 12.06 21.12
10^5 1.00 1.33 1.80 2.76 4.46 4.18 7.98 14.48
10^6 1.00 1.12 1.87 2.27 3.85 3.52 6.86 12.76
10^7 1.00 1.21 2.27 2.97 4.75 3.79 7.90 14.05
10^8 1.00 1.27 2.78 3.07 4.74 4.00 6.72 15.08
10^9 1.23 1.00 2.70 3.85 5.11 3.49 6.08 13.73

It is easy to see what the problem is. sprintf is really slow. sstream is actually surprisingly fast.... atoi next. Then, just reading and writing from files....

Wednesday, 28 March 2012

Optimizing atof()

I was annoyed that atof() was making my program run 6 times slower. So, I wrote a new, slightly faster one. Still not fast enough for my taste. I tried playing around with mmx, but don't have good results to show, yet. I'll add benchmarks soon. The program will use one integer multiplication, addition and subtraction per digit, and one float multiplication for the whole number.
double my_atof (const char *p)
{
   static unsigned char to_do[256] ;
   static double my_exp[512*2] ;
  
  int i;

 
  if( to_do[0] != DONE ) {
    // Initialize tables for interpreting characters and for exponents.
    for( i=0; i<255; i++) to_do[i] = 255 ;
    to_do['0']=0 ;
    for(i='1'; i<='9'; i++) to_do[i] = DIGIT ;
    
    
    to_do['-'] = MINUS ;
    to_do['.'] = POINT ;
    to_do['e'] = EXPONENT ; 
    to_do['E'] = EXPONENT ;
    to_do[' '] = SPACE ;
    to_do['\t'] = SPACE ;
    to_do[0] = DONE ;
    my_exp[512] = 1.0;
    for( i=1; i<511; i++)
      my_exp[i+512] = my_exp[i-1+512] * 10.0 ;
    for( i=1; i<511; i++)
      my_exp[512-i] = my_exp[512-i+1] * 0.1 ;
#define MY_EXP(x) (my_exp[((x)+512) & 1023])
  }
  
  int neg=FALSE , div=0, val = 0, eval=0 ;


  // The function works in 5 stages:
  // 1. jump over whitespace, till first char is found - either digit, sign or dot.
  // 2. Read digits till first non-digit - either '.' or eE
  // 3. Read digits after decimal point
  // 4. Read exponent digit or sign
  // 5. Read rest of exponent.
  while( TRUE ) {

    switch( to_do[ *p ] ) {
    case DIGIT: val = *p-'0' ; goto end_loop1 ;
    case SPACE: 
      break ;
    case MINUS:
      neg = TRUE ;
      goto end_loop1 ;
    case PLUS:
      goto end_loop1 ;
    case POINT:
      goto found_dot ;
    case 0: goto end_loop1 ;
    default:
      return 0.0 ;
    }
    p++ ;
  }


  end_loop1:
  p++ ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: val= val*10+(*p)-'0'; break ;
    case 0: val *= 10 ;break ;
    case POINT:
      goto found_dot;
    case EXPONENT:
      goto found_exp;
    default:
      if( neg ) val = -val ;
      return (double)( val) ;
    }
     p++ ; 
  }

  found_dot:
  p++ ;

  while( TRUE ) {
    {
      switch( to_do[ *p ] ) {
      case DIGIT: val= val*10+(*p)-'0';div--; break ;
      case 0: val *= 10 ;div--;break ;
      case EXPONENT:
 goto found_exp;
      default:
 if( neg ) val = -val;
 return MY_EXP(div) * val ;
      }
      p++ ;
    }
  }

  found_exp:
  p++ ;
  if( neg ) val = -val ;
  neg=FALSE ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: eval=(*p)-'0';goto end_loop2 ;
    case 0: goto end_loop2 ;
    case SPACE: 
      return MY_EXP(div) * val ;
    case MINUS:
      neg  = TRUE ; //fallthrough
    case PLUS:
      goto end_loop2 ;
    default:
      return MY_EXP(div) * val ;
    }
    p++ ;
  }

  end_loop2:
  p++ ;
  while( TRUE ) {
    switch( to_do[ *p ] ) {
    case DIGIT: eval= eval*10+(*p)-'0'; break ;
    case 0: eval *= 10 ;break ;
    default:
      goto done ;
    }
    p++ ; 
  } 
  done:

  if( neg ) 
    eval = div - eval ;
  else
    eval = div + eval ;
  return  MY_EXP( eval ) * val ;

}

Wednesday, 3 August 2011

Faster files in R

R is fairly slow in reading files. read.table() is slow, scan() a bit faster, and readLines() fastest.
But all these are nowhere as fast as other tools that scan through files.

Let us look at an example. I have in front of me a 283M file.

(Small update: the timings where off before. First because R hashes strings, one has to quit and restart R to get good timings. But I'm not really sure why my timings were off before...)


; time grep hello file
1.79 real 0.23 user 0.27 sys
; time wc file
774797 145661836 297147275 file
2.87 real 2.13 user 0.34 sys
; time cat file | rev >/dev/null
3.58 real 0.02 user 0.35 sys

> system.time({a=readLines("file")})
user system elapsed
25.158 0.829 26.500

> system.time({a=scan("file",what=character(),sep="\n")})
Read 774797 items
user system elapsed
30.526 0.827 31.734

> system.time({a=read.table("file",sep="\n",as.is=T)})
user system elapsed
31.880 0.802 32.837


sad, isn't it? And what does R do? Process data. So we read large files in all the time.

But beyond readLines(), R has deeper routines for reading files. There are also readChar() and readBin().

It turns out that using these, one can read in files faster.

Here is the code: (The second version further down is much better...)

my.read.lines <- function( fname, buf.size=5e7 ) {
s = file.info( fname )$size
in.file = file( fname, "r" )
buf=""
res = list()
i=1
while( s > 0 ) {
n = min( c( buf.size, s ) )
buf = paste(buf, readChar( in.file, n ),sep="" )
 
s = s - n
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
n=nchar(buf)
if( substr(buf,n,n)=="\n" ) {
res[[i]] = r
buf = ""
} else {
res[[i]] = head(r,-1)
buf = tail(r,1)
}
i=i+1
}
close( in.file )
c( unlist(res), buf )
}

Created by Pretty R at inside-R.org



The gain is not amazing:

> system.time({a=my.read.lines("file")})
user system elapsed
22.277 2.739 25.175


How does the code work?
 buf = paste(buf, readChar( in.file, n ),sep="" )
reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn't matter - on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

We then split the file using
r = strsplit( buf, "\n", fixed=T, useBytes=T)[[1]]
the fixed=T parameter makes strsplit faster.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele's note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?



Update.

I got comments that it isn't fair to compare to grep and wc, because they don't need to keep the whole file in memory. So, I tried the following primitive program:

#include < stdio.h >
#include < stdlib.h >

main()
{
FILE *f ;
char *a ;
f = fopen("file","r") ;

a=malloc(283000001) ;
fread( a, 1, 283000000, f ) ;

fclose(f) ;
}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I'll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory...

Here is the new code:
my.read.lines2=function(fname) {
s = file.info( fname )$size
buf = readChar( fname, s, useBytes=T)
strsplit( buf,"\n",fixed=T,useBytes=T)[[1]]
}

Created by Pretty R at inside-R.org


And the timings:

> system.time({f=file("file","rb");a=readChar(f,file.info("file")$size,useBytes=T);close(f)})
user system elapsed
1.302 0.746 2.080

> system.time({a=my.read.lines2("file")})
user system elapsed
10.721 1.163 12.046


Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven't shown a better scan() or read.table()....


This thread will continue for other read functions.

Saturday, 20 June 2009

Analysis of Iran absentee votes



On http://www.presstv.com/detail.aspx?id=98206&sectionid=351020101 the official Iranian election results outside of Iran are posted. Here is a bit of exploration of the results.

The graph shows the number of votes for Ahmadinejad (x-axis) vs. the number of votes for Moussavi (y-axis). Each point is a city (or a country). The diagonal line represents the ratio in official total results for all of Iran. Any point above the line has more votes for Moussavi than the official ratio says, any point below the line more for Ahmadinejad. (I didn't use some regions where the reported total was less than the sum of votes). The total percentages of the data I looked at is 56% for Moussavi, 40% for Ahmadinejad, 2% for Karroubi and 2% for Reazei.

There are two main lines that the points lie on: one is very close to the official ratio, the other has a much steeper ratio - many more votes for Moussavi.
The points that lie on the less steep line with more votes to Ahmadinejad are Mecca, Medina, Damascus, Kuwait, Karbala.
points that lie on the steep line, with more votes to Moussavi are New York, London, Kuala Lampur, Paris, Ottawa.
Dubai is somewhat of an outlier, and sits between the two lines.

Looking at the ratio of votes for Ahmadinejad to the other two candidates, Karroubi and Reazei gives the same overall picture. The graphs are below.

When looking at these results, the first thing one has to remember is that they are not expected to represent the population in Iran. The Iranian people who live in New York represents a very different socioeconomic sphere, and are exposed to very different Media than the people who live in Iran.

The results in these ballots definitely do not agree with the letter supposedly proving the election fraud in Iran. In no case do the percentage of votes for Karroubi and Reazei come close to the numbers given in the letter. In the letter they get 32% and 9%, in these votes, even in the most "western" booths, they get 4% and 1% (and Moussavi gets 85%). So, either these results are also fabricated, or the letter is fraudulent.

What I can infer from this are the following options

1. If both official results and these results are fabricated, then in this case they are cleverly fabricated. As opposed to the results in Iran, where there are no trends whatsoever with respect to urban vs. non-urban areas, here there are very strong trends.

2. If just the official results are fabricated, and these results are real, then in at least some places - Mecca, Medina, Damascus, Kuwait, and Karbala, the results of the vote are very close to the official published results. This at least means that the results in Iran were not totally fabricated - they reflect the ratio of opinions in some places.

3. It could be that the official results are fabricated, and at the same time, results in some cities outside Iran, but not in all cities where fabricated. Either because of where the counting was done, or who counted.

4. It could be that both the official results and these results are real. When I started the analysis I expected much more discrepancy. The official results are not that far off the results in Mecca and Medina, which I would expect to represent the population in Iran much more than New York.

Dubai is especially interesting. It lies between the two slopes. This could be a mixture of people who live under more or less western influence, or are on two different social status levels. It is interesting, because Nate's analysis of urban vs. non urban areas saw no such trend.

In short, after looking at these results, I still can not be sure of what happened. I'm pretty sure that the "Fraud letter" is fake, though.

Here are the same graphs comparing Ahmadinejad to the other two candidates: