The Power of the R Programming Language 382
BartlebyScrivener writes "The New York Times has an article on the R programming language. The Times describes it as: 'a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca partly because data mining has entered a golden age, whether being used to set ad prices, find new drugs more quickly or fine-tune financial models. Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group and Shell use it.'"
SAS strikes out ^H^H^H er, "back" (Score:5, Informative)
Good thing Boeing's not using fere software for aircraft simulation tools [sourceforge.net], space station labs [nasa.gov], sub hunters [com.com], or moon rockets [popularmechanics.com] ;-)
Re:Based on S (Score:3, Informative)
And i also dont know why it is called R
The guys who originally wrote both had first names that started with R and being the jokers that they were, they thought it would be funny to give it a name very similar to S.
r-project.org (Score:4, Informative)
The language is very well documented online and the mailing lists contain thousands of examples. It is primarily for statistical analysis, and the libraries available for doing such analysis are unparalleled.
Re:Show me some example code (Score:4, Informative)
It may not be "better" in the sense of "calculating stuff with higher efficiency" (i reckon you can do the same stuff in C, given the right libraries :P), but for statistical and data mining/visualization purposes it is a quite simple object-oriented functional language with many useful built-in procedures and lots of freely available packages/libraries that is simple enough for "non-programmers" and, so far, it does what i want it to do fast enough and.. it's free.
So.. probably not the best all-purpose programming language, but fits nicely in the "statistical software environment/language" niche and, unlike SPSS et al., it's free (as in "libre", as in "everyone can independently verify your results without having to shell out cash", which is useful in academia).
Example code:
results <- prcomp(datamatrix)
This does a PCA (Principal Component Analysis [wikipedia.org]) on the data contained in "datamatrix" and dumps the results into the "results" variable.
I have no idea how i would start to code that in C, python, etc. in a way that's remotely efficient ;)
Re:Show me some example code (Score:5, Informative)
It's been a while since I worked with it and I don't have code examples with me at the moment, but think of it as the Matlab/Octave of statistics, including the preference for "function over each row/column" instead of loops.
Compared to other languages, R makes it easy to do statistical analysis tasks like Matlab/Octave makes it easy to do linear algebra tasks.
Plus, as other posts stated above, there's excellent documentation and tons of useful libraries (take a peek at the libraries available at the Debian repositories), Bioconductor being the finest example.
Oh, and nice emacs integration. :)
Re:Only for certain kind of analyst... (Score:5, Informative)
Sorry, but R is not relatively new, it's been around for at least 10 years, I was taught how to use R at University back in 2001, and S and later S+ (which R is a FOSS version of) has been around for even longer, since the mid 70's.
Re:Show me some example code (Score:1, Informative)
R is not a programming language, it is an environment which uses the S programming language. The S programming language was developed in the 70s at Bell Labs.
If you use Linux, you can install R with
yum install R*
which contains many examples.
People switching to R usually started with Splus, which a few years ago worked to close source code contributed by academics. They have chosen to move to R.
Re:Show me some example code (Score:5, Informative)
I use R a great deal. Think of it as an alternative to MATLAB, or Excel, rather than C or perl or lisp or whatever you like to use as a general purpose language. So, compared to MATLAB, functions are first class objects (rather like lisp), so, you can write functions that take functions as arguments, and return them as well, just as though
they were simple variables. It handles
vectors rather easily, and has decent plotting tools.
#quick example
# function, which, given numerical arguments a and b, and a function g, returns a function of x
f - function(a,b, g){
function(x){ a * x + g(b * x)}
}
f1 - f(1,2.5,sin)
x - seq(-pi,pi,l=100)
plot(x,f1(x),type='l')
Re:Show me some example code (Score:4, Informative)
i'm a PhD student in biostatistics at a fairly prestigious american university. we use R almost exclusively, because it is better than other statistical software options. reasons for it's superiority are i) it's free ii) it's open source and iii) its considerably more powerful than STATA, SPSS, SAS, etc.
it is true that other languages can be quicker for many tasks. proficiency in C is desirable, but C is not geared toward statistics, where many built-in libraries and user-contributed packages for R implement complex methodologies.
i'm not as versed in C as i am in R, so i can't provide a direct comparison of the languages, but i have included a sample below. it's a function that fits a simple linear model, taking the outcome data and input data (as a matrix) and a couple of other parameters as inputs. it returns a variety of values, including the model coefficients and fitted values. there is an R function that does this exact thing, but we have to do something for homework.
lm=function(y,x,returnHat=FALSE,addInt=FALSE){
if(addInt){
x=cbind(matrix(1,nrow(x),1),x)
}
#use range around 0, for roundoff error
if(-1e-5=det(t(x)%*%x) & det(t(x)%*%x)=1e-5){stop("x'x not invertible",call.=F)}
beta=solve(t(x) %*% x) %*% t(x) %*% y
sigma = as.numeric(sqrt(var(y-(x%*%beta))))
varbeta=sigma * (solve(t(x)%*% x))
fitted=x %*% beta
residuals=y-fitted
if(!returnHat){
output=list(beta,sigma,varbeta,fitted,residuals)
names(output)=c("beta","sigma","varbeta","fitted","residuals")
}
if(returnHat){
hat=x%*% solve(t(x) %*% x) %*% t(x)
output=list(beta,sigma,varbeta,fitted,residuals,hat)
names(output)=c("beta", "sigma", "varbeta", "fitted", "residuals", "hat matrix")
}
output
}
i'd also say that i'm glad to see some press for R. it's popular in some circles, but not as accepted by companies and some academics because it is open source. the idea is that software you have to pay a licensing fee for must be more reliable because, well, you paid for it (thinking i'm sure you're familiar with).
Re:Show me some example code (Score:2, Informative)
Re:Freak your colleagues out with "no loop" code.. (Score:5, Informative)
"The worse thing about R programming is its name. Googling for "R" turns up way to much noise and way too little signal"
Try searching from http://rseek.org/ [rseek.org] instead of directly from Google.
Re:Show me some example code (Score:1, Informative)
Want example code and comparison with other stats software? Here's 80 pages from an entire BOOK devoted to your request:
http://RforSASandSPSSusers.comÂ
From the text:
From the text:
Since its release in 1996, R has dramatically changed the landscape of research software. There
are very few things that SAS or SPSS will do that R cannot, while R can do a wide range of things
that the others cannot. Given that R is free and the others quite expensive, R is definitely worth
investigating.
It takes most statistics packages at least five years to add a major new analytic method.
Statisticians who develop new methods often work in R, so R users often get to use them
immediately. There are now over 800 addâon packages available for R.
R also has full matrix capabilities that are quite similar to MATLAB, and it even offers a MATLAB
emulation package.
If you'd like to see some examples with accompanying graphics, check out the newsletters or manuals at
http://cran.r-project.org/
I use R because it's free, there's lots of free add-on code, every other statistician I know uses R, it's quick and easy to test stuff out in R, and if you want you can speed up things by writing the most computationally intensive parts of your program in C, C++, or FORTRAN. Also, you can get great graphics out of R if you put in a little effort to learn how.
Re:Show me some example code (Score:1, Informative)
# Draws labelled diagrams with critical
# region's for the normal and t distributions
#
# Excuse my lack of code reuse, etc. this
# was meant to make diagrams just for a quick
# homework assignment
#
# Please show me how to do this in SAS!!!
# Tell me you'd even think of trying this in SAS
# to draw pictures for your short homework
# assignment
#
#
crit_norm_diag = function(alpha,lowertail=T) {
par(new=F);
end = -4;
crit_value = qnorm(alpha);
if(!lowertail) {
crit_value = -crit_value;
end = -end;
}
pts = c(end,crit_value);
pts = sort(pts);
x = seq(pts[1],pts[2],by=0.01);
y = dnorm(x);
x = append(x,c(pts[2],pts[1]));
y = append(y,c(0,0));
plot(dnorm,-4,4,xlab='z',ylab='p.d.f.');
par(new=T);
polygon(matrix(c(x,y),ncol=2,byrow=F),density=25);
if(crit_value0) {
text(-3,0.3,round(crit_value,3),pos=3);
arrows(-3,0.3,crit_value,dnorm(crit_value),angle=20);
} else {
text(3,0.3,round(crit_value,3),pos=3);
arrows(3,0.3,crit_value,dnorm(crit_value),angle=20);
}
par(new=F);
}
crit_norm_diag(0.05);
crit_t_diag = function(alpha,df,lowertail=T) {
par(new=F);
end = -4;
crit_value = qt(alpha,df);
if(!lowertail) {
crit_value = -crit_value;
end = -end;
}
pts = c(end,crit_value);
pts = sort(pts);
x = seq(pts[1],pts[2],by=0.01);
y = dt(x,df);
x = append(x,c(pts[2],pts[1]));
y = append(y,c(0,0));
plot(function(x) { dt(x,df) },-4,4,xlab=paste('t, d.f.=',df,sep=''),ylab='p.d.f.');
par(new=T);
polygon(matrix(c(x,y),ncol=2,byrow=F),density=25);
if(crit_value0) {
text(-3,0.3,round(crit_value,3),pos=3);
arrows(-3,0.3,crit_value,dt(crit_value,df),angle=20);
} else {
text(3,0.3,round(crit_value,3),pos=3);
arrows(3,0.3,crit_value,dt(crit_value,df),angle=20);
}
par(new=F);
}
crit_t_diag(0.05,10);
Re:R sucks as a language (Score:3, Informative)
The R language is optimized for writing statistical code. It's going to seem a little weird, especially if you have a traditional programming background. Once you spend some serious time writing R code, however, you will probably begin to appreciate many of the things that initially seemed odd.
For example, consider the way R handles function calls [moertel.com]:
All of these "oddities" serve to reduce the amount of boilerplate code you need to write when coding up statistics routines. (Click the link above if you want to see examples and take a more in-depth tour of R's fascinating and time-saving function call behavior.)
Re:Show me some example code (Score:2, Informative)
x = 1:10 #integers from 1 to 10
#set all even elts of x that are less than 7
x[(x < 7)&(x %% 2 == 0)] = -1
#y is some big array with several dimensions
#I and J are vectors of integers
z = y[I,,J,,, drop = F]
#'z' is now a sub array
z = y[I,2,J,1,]
#now z is a subarray with fewer dimensions
Re:SAS strikes out ^H^H^H er, "back" (Score:5, Informative)
Re:Freak your colleagues out with "no loop" code.. (Score:2, Informative)
x = vector(mode="list")
x[["joe"]] = y
x[["bob"]] = z #z can be a function!
x = list(joe=y)
x$bob = z
The R language and its uses (Score:5, Informative)
The R language (yes, it's a language; an interpreted languages is a language too) has developed as the language of choice by statisticians (both academics and sundry statistical researchers) around the world as their main computer language. It is used in those cases where researchers feel the need for customized computations rather than the use of a package like SAS or SPSS.
The reason that R has become popular is due to a snowball effect and history. It started as a FOSS re-implementation-from-scratch of the "S" language designed for statistical work at Bell labs (see http://en.wikipedia.org/wiki/S_(programming_language) [wikipedia.org]. Some academics and researchers of repute used it (the S language) because at that time (1975) it was very innovative and far better than most alternatives, and others followed. The S language gained a measure of acceptance among statisticians. Then when R became available the cycle intensified because of the much improved availability of the interpretor and its libraries. This cycle continued to the point that by now probably most professional statisticians use it.
As far as I can see, the R language isn't especially sophisticated or elegant, and may strike people used to more modern languages as a bit repugnant. It does however excel in three respects:
(a) it allows for easy access of Fortran and C library routines
(b) it allows you to pass large blobs of data by name
(c) it makes it easy to pass data to and from your own compiled C and Fortran routines
The first reason is particularly important because it allows one to use e.g. pre-compiled linear algebra package like LAPACK, or Fourier Transforms, or special function evaluations and thereby gain execution speeds comparable to C despite being an interpreted language (just like Matlab, Octave, Scilab, Gauss, Ox and suchlike): the hard work is carried out by a compiled library routine which is made easily accessible through the interpreted language. Any algorithm needed in statistics that's available as C or Fortran code can be linked in and called without too much effort.
The second reason is important because it slows down execution much less than any pass-by-value interpreted language would, and it allows you to change data that is passed into a function.
The third reason is particularly important because it helps researchers be more productive. Reading in your data, examining it, graphing it, tracing outliers and cleaning them up is best done in an interactive environment in an interpreted language. Coding such things in C or Fortran is an awful waste of time, and besides, researchers aren't code-monkeys and don't enjoy coding inane for-loops to read, clean, and display data. Vector and matrix primitives are far more powerful, and usually preferable unless they are so inefficient that you have to wait for the result. However, there are times when you just need to carry out standard algorithms (linear algebra, calculation of mathematical or statistical functions) or simply time-consuming repetitive algorithms that run so much faster in a genuine compiled language. You could start out by coding the algorithm in an interpreted language to check if it's working, and then isolate the computationally expensive part and code it up in C or Fortran. Using R (or Matlab or Scilab) you can *call* the compiled subroutine, pass it your (cleaned) data, and get the result back in an environment where you can easily analyze it.
That's why languages like R, Matlab, Scilab, Octave, Gauss, and Ox are so productive: you get the best of both worlds. Both the convenience, interactiveness, and terseness of a high-level interpreted language and the speed of compiled languages.
So why R, and why not Gauss or Matlab or whatever?
Well, part of that is cultural. If you're an econometrician you'll have been weane
Re:Freak your colleagues out with "no loop" code.. (Score:1, Informative)
The worse thing about R programming is its name. Googling for "R" turns up way to much noise and way too little signal.
use RSeek.org
Problem solved.
Re:Not a language, really (Score:3, Informative)
Actually, R is a real (Turing-complete) programming language like Perl, Python, Ruby, etc. It just happens to have lots of statistical libraries and matrix-oriented functions.
You put #!/usr/bin/Rscript in your first line and it can work just like any other scripting language, with command-line arguments, etc. I use it all the time as a replacement for other scripting languages (think PDL+Perl or Numpy+Python).
R is an excellent language for any scientist. The sytax and semantics of the language are very well thought-out.
Re:Only for certain kind of analyst... (Score:5, Informative)
Pfft. Matlab is the fastest way to connect to his testing equipment.
Well.. Labview, actually, but no one in their right mind would want to actually use it. Anyway, simulink gets you a lot of the graphical programming features if you need that.
Re:Based on S (Score:3, Informative)
"And I also don't know why it is called R"
"The guys who originally wrote both had first names that started with R and being the jokers that they were, they thought it would be funny to give it a name very similar to S."
Additionally, in statistics r is the letter used to denote the Pearson product-moment correlation coefficient [wikipedia.org].
Re:r-project.org (Score:2, Informative)
With multi-core processors becoming more and more prevalent, R's developers should remedy this as soon as possible.
Ask and ye shall receive [insidehpc.com]
Re:Based on S (Score:5, Informative)
http://www.rseek.org/ [rseek.org]
Re:Freak your colleagues out with "no loop" code.. (Score:2, Informative)
Re:r-project.org (Score:5, Informative)
With multi-core processors becoming more and more prevalent, R's developers should remedy this as soon as possible.
Already done. There's an R package called SNOW [www.sfu.ca] that allows you to handle code running in parallel.
Re:Based on S (Score:3, Informative)
are you talking about R or S? searching for "R" on google returns pretty good results [google.com]--the first 6 links are all related to R. and 4 of the results on the next page are also related to R. searching for "S" on the other hand doesn't immediately come up with any relevant results.
i'd say it's fairly easy to find info on R using google considering its limited popularity relative to other languages. obviously you're not going to find a ton of information on it since it's a somewhat obscure niche language. but if you can find the r-project/CRAN website or other R resources on google, then you can probably find documentation for whatever info you need.
besides, you can always use multiple keywords and boolean search operators to narrow down your search results, like searching for "R" AND "statistics." or once you've found online documentation for "R" you can use the "site:" modifier to search that site only.
i mean, this is all pretty basic stuff. there are much harder things to search for information on--like pharmaceutical drugs. this just requires basic knowledge of search engines and a little commonsense.
Re:r-project.org (Score:2, Informative)
there are tools to help parallelize code:
http://www.stats.uwo.ca/faculty/yu/Rmpi/ [stats.uwo.ca]
http://www.sfu.ca/~sblay/R/snow.html [www.sfu.ca]
Re:The R language and its uses (Score:5, Informative)
I second that. R is terribly useful for the wide variety of libraries available and esoteric statistical procedures. But you would *never* want to write a long/complex program in R.
As you say, it's most convenient to work in some other language that's actually designed to be scaleable, object-oriented, and easy to debug. It's usually straightforward to call R libraries when you need them. I find that python+scipy+rpy is an almost ideal environment for day to day scientific programming.
Re:Not a language, really (Score:3, Informative)
Re:Show me some example code (Score:4, Informative)
Re:Show me some example code (Score:3, Informative)
Production side: I would agree. However statistical differential equations? SAS is good for predefined "statistical analysis", not for solving partial differential equations. Almost all mechanical problems in aerospace (read fluids, solids, thermal, electro) are expressed as partial differential equations. solutions of these (baring a few special cases) require numerical methods. The most common of these methods are finite element, finite difference and finite volume.
And each one of these has it numerous "schemes" for solving a particular class of PDE. The choice of scheme/method depends on the problem at hand. You can use a prepackaged tool like Fluent/Gambit. But that limits you to the limitations of those packages. Need anything cutting edge, or applicable to a special case, you need to program it yourself (c/c++/fortran). Most design houses have tons of legacy code that they build upon and add modules to deal with their specific problem. A lot of these run on linux clusters or unix big irons. I don't think they use gcc though. For performance sake most use proprietary compilers (eg pgc, icc etc). But no SAS.
Now, on the control systems side, most researchers use matlab, but most of the implementation is done using imbeded C or ADA.
As for SAS, they do now support freeware aka Linux.
I have personally notice a sense of unease when SAS employees are asked about R. They are quick to dismiss it claiming the usual FUD and then change the topic. It is quite amusing actually. Happens everytime.
It is a pain in the ass to change. (Score:5, Informative)
Say you realize that you need to check for another corner case that you forgot, or need to extend a function for another purpose, or whatever. In any other language, you would type a few lines of code and be done with it. Not with labview. With labview you have to move things around to make room for the new code, disconnect wires and reconnect them. NI has added stuff into the newer version to help with this (auto growing, etc) but it still turns into a mess in short order.
Other things are just easier to type than to draw, and also easier to read in text then as a schematic, like equations. So much so that they have added the ability to type portions of the code, but the amount of setup that you need to do with a code block often defeats the time benefit you get from using it.
As someone who likes "clean code" I find LabView much more tedious and time consuming to keep neat, and when dealing with other coders that are not as picky, I find that their LabView code is much messier and harder to read than Java or C code by the same developer.
Re:Show me some example code (Score:2, Informative)
Plus, R not only does "vectors" or "arrays"... R does "lists", and "data frames".
These data structures make it much more logical to work with experimental data (e.g. to pass it between functions), and their use is logical and coherent througout the language. All through R, data structures "do the right thing".
Re:Freak your colleagues out with "no loop" code.. (Score:3, Informative)
Do you happen to have a link to what you mean by "a program should not have state"? Because, I mean, that seems antithetic to the nature of a program.
Of course there is a state, you're using a standard computer to run the program, so there must be a state somewhere. Still, the point is that even if the language implementation works by changing the computer's memory state, the abstraction you use to program isn't state-based. In a pure functional programming language, you don't program by manipulating a state, but by computing the results of functions.
Regarding the SICP book, like most functional programming languages, Scheme isn't a pure functional language. It contains constructs with side effects, which actually change the program state directly. Such constructs are available because there are problems that are very difficult (but not impossible) to handle with pure functional programming, so language designers end up making compromises.
Just my 2 (Euro) cents
Re:Show me some example code (Score:2, Informative)
beta=solve(t(x) %*% x) %*% t(x) %*% y,
NO, NO, NO!
beta=solve(crossprod(X,X),crossprod(X,y))
is much nicer, and it is less susceptible to round-of errors.
Labview sucks the most (Score:5, Informative)
Labview is utterly non-deterministic in its execution. The execution order of blocks does NOT follow the data flow of the lines joining them if there are more than a handful of blocks present. In fact, the execution sequence becomes random, and changes randomly when block positions are changed (even without changing the data connectivity). This forces the use of explicit sequence structures in any non-trivial function, increasing its complexity and opacity. Just try synchronizing shared data between asynchronous loops. Even their Knowledgebase admits that there's no way to do it properly.
And let's not get started on the crappy content of Labview's documentation. It's organized and formatted tolerably well, but the content is vacuous. Hardly any functions have any suggestion of their behaviour when faulty data arrives (e.g. a NaN), for example.
Re:Only for certain kind of analyst... (Score:3, Informative)
>The folks I know who use Excel for analysis use it because it's the package that everyone gets in their organization, there's a shit load of material on the web that uses excel, there's plenty of add-ons for it (no need to reinvent the wheel), and when sharing data and analysis, everyone is familiar with it
Back when I was in grad school, ten years ago, Excel was the preferred data analysis tool for most physical and biological scientists that I knew; even when they had high end analysis tools installed on their machines.
Re:Show me some example code (Score:1, Informative)
SAS is very threatened by R, in a number of markets, but from my experience extremely threatened in biomedical research. When you charge 10s of thousands of dollars for platform and there's a strong open source free platform that due to its nature is generally 6-12 months ahead in terms of implementation of cutting edge algorithms, you have little but FUD to fall back on.
Re:Only for certain kind of analyst... (Score:4, Informative)
Re:Show me some example code (Score:4, Informative)
I used to use Matlab quite a lot (mostly for prototyping simulations and for visualization; I use C for my "real" simulations which take a lot of CPU time, since they run so much faster in C). I learned R about 2 years ago, and found that it can do pretty much everything Matlab can that I need for my own research.
Anyway, I wrote up a "Matlab / R Reference" that translates the basics between the two packages. It doesn't have highly specialized stuff, but many people have found it handy. I use my own reference quite a bit myself, since these days I mix up commands between the two packages quite a bit. It's available at:
http://www.math.umaine.edu/faculty/hiebeler/comp/matlabR.html [umaine.edu]
Re:Show me some example code (Score:2, Informative)
No, it isn't. The standard is maintained by the ISO and costs money. That there happen to be free compilers for C doesn't mean that C itself is free.