Friday, March 6, 2015

The R documentation is bad

I have been using R for some time now and still can find it frustrating to work with. Over the years have come to the conclusion that it is primarily due to the documentation being bad. I offer no actual solutions here, but thought I would try and write down exactly what I dislike about it.

The docs are more or less premised on knowing which function you wish to use. Want to find out what some argument of some specific function does? Sorted.

But, if you are new to the language or just want to check out a few different ways of doing things, the built-in documentation is not going to help.

The “phone book” type reference that just lists all functions in a package alphabetically is useless for exploring the language. This in turn makes it very hard for people to adopt R. I know this from my own experience and helping others who are new to R.

Example


As an example, let’s look at creating an identity matrix. Pretend we are new to R and staring at the prompt.

First I look at help(matrix) and see nothing about it there. Following the links in the “See Also” section, we visit the page for data.matrix and array, neither of which give any hints.

Let’s try a search, typing ??identity at the prompt. A big list of functions pops up, but only two from base and one from stats. First is called ‘dontCheck’ which probably isn’t going to make an identity matrix so I skip that, but second up is identity. 

A-ha! Let’s take a look

“A trivial identity function returning its argument.”

Oh. 

All right, lets bring up the phone book listing for everything in base. I type help(base), navigate to the index and type identity in the search box. Nope, just the two functions I saw before. 

Now, I personally know the right function is diag, but how are people meant to find that out? If they have to search google for basic things, it’s kinda hard to conclude that the documentation is good.

The second example is left as an exercise to the reader: find in the R help how to get the last element of a vector. And no, despite the helpful and intuitive name, the documentation for “[“ doesn’t say.

Help


The actual help browser is I dislike as well. I have not yet figured out how to have more than one page open at a time, and I really don’t like the idea of having lots of windows popping up everywhere. Something like tabs would be a lot more useful. 

Most people are reading on the screen, which makes big blocks of text hard to read, and big blocks of text feature quite often in the help pages. 

This becomes especially important when there are so many intricate details being explained. It’s easy to miss things, and many pages require careful and close readings. 

Usually, once you wade through all the text, there are examples. Many of these are not clear, seemingly having been written for brevity over elucidation. It’s hard to work out what exactly is going on, again especially for people new to R.

None of the examples include the output inline, which also makes it very hard to get a feel for what they do. 

This is made worse by the fact that often I find myself on a “function safari.” I know there must be a function that does what I need but I have no idea what it is called or how to find it. 

It’s a huge time sink if I have to go through a bunch of help pages, manually run all the examples to figure out if they do what I need. 

The font is ugly as well. Just throwing it out there.

Groups


Having specific documentation for each function is necessary, but there should be higher level grouping of related functions. 

For example when I look at help(matrix) why does it not include links to diag, rbind, cbind, t etc. or show examples of them all that include the output when run?

This is a good start, but it is all on one big page and not actually referenced in the help page for matrix. (Note also it does not mention diag)

One of the problems is the use of generics and the general mish-mash that is OO in R. It makes it hard to find functions that are related to working with a particular data structure or the task at hand. 

For example, compare 

R vector with the Python or C++ equivalents.

I really feel all these things compound to be a huge weakness of R.

Outro



For me the most frustrating thing is that the documentation I am after is usually there, it’s just too hard to find and often poorly presented. There is much room for improvement of structure and presentation. As it is now it is hard to search, hard to navigate and hard to read.

I would love to see a nice HTML based reference with navigation menus and logical groupings. It is easy to provide logically structured information as well as a reference index.

By switching to something browser based, it makes it a lot easier to find relevant information, as well as allowing users to pick their own fonts (and font sizes), as well as have multiple pages open at the same time.

I am sure there are people out there thinking “but I like the way R does its docs!!” and I can’t argue with that. But I have seen enough videos on the Internet to know some people seem to like some very strange things indeed ...

I could go on and on with lots more examples/things that are just a bit silly. As I said at the start, I have no real solution to offer, but hopefully provide some useful specifics, or at least maybe help some people realize “it’s not you, it’s R.” I have gotten myself up to speed in a lot of languages over the years and by far, R took the longest to feel competent with.

[update]

Pat Burns has shared a useful map of information R Navigation Tools

Also I found the The R Cookbook a useful reference for answering those "how do I X a Y" type questions. 

Ritchie Cotton has shared his thoughts on the topic here.

19 comments:

  1. Yes, your are absolutely correct.
    BTW. that's also the reason why a lot of people are switching to python.
    Plus, you cannot contribute to R as they have closed source behind SVN, which only a handful of people have access to.

    ReplyDelete
    Replies
    1. Thanks yes I was not aware about that SVN thing. Seems a bit weird tbh!

      Delete
    2. It is simply not true that only a handful of people have access to the source. The main Subversion repository is publicly readable, and is mirrored on GitHub. Only a handful of people have commit access, but that's true of every open source project.

      Delete
  2. You're absolutely right. R is definitely not rising in popularity on every measured rank chart while Python flatlines. Oh… wait… http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html #nvrmnd

    ReplyDelete
    Replies
    1. Comparing R and Python is apples vs oranges IMO. Its not a popularity contest and I would take the bet that theres more people using python to do more stuff than R any day.

      At the moment I find it hard to recommend R to people, just because the learning curve is so steep.

      It's easy to pick up bad habits and learn to do things "the wrong way" when you end up winging things because its hard to find "the right way" to do something.

      Those bad habits can last a long time, and as people move from new users to becoming contributors or making their own packages etc, bad practises can end up having a big impact.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I can see your point but I think it misunderstands how to go about documenting for lookup and for process use within R. I think that the documentation system is terrific for what it's designed to do; that is be a lookup. The current system is nice because it is relatively straight forward to maintain and keeps documentation uniform. Its not the documentation's job to tell the user how to use it. That job goes to vignettes and/or README. But that's up to maintainers to include vignettes as they'r not required for CRAN submission.

    ReplyDelete
    Replies
    1. I suppose base R doesn't have vignettes though. But there is documentation on how base R works in a non lookup style: http://cran.r-project.org/manuals.html or http://manuals.bioinformatics.ucr.edu/home/programming-in-r or http://www.introductoryr.co.uk/R_Resources_for_Beginners.html

      Delete
    2. Yes the vignettes are a good resource too. I guess what I would like to see is a nice HTML reference that has everything.

      When I am doing something new I usually start by searching on R-Bloggers, the resources are there it all just seems a bit haphazard.

      It is hard when so many contributed packages also form an important part of the ecosystem.

      Delete
  5. Yes the documentation is utterly bewildering to noobs. It should be called a technical reference, because that is what it is, not help. Yet there are now hundreds of gentle introductions to the language. Things have improved greatly for the novice over how things were when I first explored R (about 13 years ago).

    Example of how it is a technical reference. ?plot gives you:
    "Generic function for plotting of R objects. For more details about the graphical parameter arguments, see par."
    Technical information that means nothing to the new user. What is an object? What does generic mean? It doesn't even define "plotting." It doesn't tell you anything if you don't already know the meaning of those terms. It assumes you have already learned the fundamentals of the language and its terminology.

    But rejoice, we now have lots of free introductory tutorials and the user friendly Stack Overflow!

    ReplyDelete
    Replies
    1. Yeah I agree there are some good resources out there. Yeah the plot manual in particular gives me a chuckle now and then, looking at par and everything is lty, cex, bewildering is a good word.

      I do kinda feel I should be able to be effective in a language without having to post questions to stack exchange (or similar), esp for basic stuff.

      Delete
  6. No, you are absolutely wrong. The documentation in R is generally very good. As an experienced user I can't think of anything worse than sifting through reams of waffle looking for userful information. This is the same for any language. If you are looking for introductory information for beginners the internet is loaded with it.

    You might say you can't find beginner information that suits your level of proficiency and current need. However you can't say that R documentation is poor just because it doesn't suit you.

    ReplyDelete
    Replies
    1. I too would consider myself an experienced user. My issue is mostly not with the actual content of the docs, just its structure and presentation.

      I still feel the way the R docs work is quantifiably the worst out of any of the languages I use regularly (python, c++, objective c)

      We may just have to agree to disagree though :)

      Delete
  7. There are tools to try to workaround the weakness: http://www.burns-stat.com/r-navigation-tools/

    ReplyDelete
  8. The "function safari" is awesome! Great writing. I knew exactly what you meant!

    From the beginning of my experience with programming languages, the documentation has been bad. Really, I don't think it's specific to just R nor is it a new problem. I still don't know how I learned C in the early 1990s. Anyone who worked on SAS in earlier days knows how almost worthless that documentation was.

    The key to learning languages and environments is the sharing, not the documentation. This is why so much more code is being written now...it's easier to share information and to ask questions than ever before.

    I don't think experienced R users pay that much attention to the examples that are included. Even in Python, I don't pay attention to the examples in the documentation.

    Learning programming languages is hard. R is not a good, first programming language. It's better than SAS being a first language, but I'd steer newbs to C or Python because they'll develop a sense of computer science concepts in Python whereas in R it's about the numbers.

    Comparing languages and environments is good for identifying relative strengths and weaknesses of languages, but that's as far as it goes.

    Bigger problems for R are support for 64-bit integers and the early OO models.

    For intermediate to advanced users, grab a copy as quick as you of Hadley's Advanced R book.

    R is going to be around for a long time.

    ReplyDelete
  9. Hm, if I try to enter help("matrix") on Python prompt, I get even less information than in R. So R is not that bad :)

    ReplyDelete
  10. My main complaint against R, and the reason I don't use it, is that the statistical and machine learning packages (the main attraction to R, imo) all lack meaningful documentation.

    It's most often the case that the documentation gives no idea what specific model is being fit, what optimization formulation is used, what algorithm is used, how to access the parameters, etc. References to papers are pretty useless, since you have no idea how the authors of the package implemented the algorithms. (Iirc, even the documentation for the fft function doesn't say what format the fft output is in, e.g. where the DC component is!)

    R is useful if you're willing to take the package authors at their word, and just fit a model and use it, but not interrogate or verify the algorithm it uses. Since I design and analyze machine learning algorithms, this is not acceptable (and it wouldn't be even if I were doing data analysis: I need to know what exactly my tools are to know how to interpret their output!).

    Matlab and Python, although not as concise as R in terms of the workflow, are much more fitting environments, because e.g. scikit-learn or the various files you can download from Matlab Central are much more transparent about what is going on.

    ReplyDelete
  11. This article makes a lot sense. I start to learn Python and R almost the same time. After one year, I'm very familiar with the Python now but still a total noob to R. The R documentation is good though in terms of "technology reference", but I think it conceptually somehow messed up with the "introduction" or "tutorial". For instance, if I have something unclear in Python, I go to google and always very easy to see the answers as the top links in the search result. In R, I know the help information is not for me. Then I google it, what I get? The exact same help information again! That is quite frustrating.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete