R — Reference

Source: https://www.r-project.org/

R

  • Created: 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland; an open implementation of the S language (Bell Labs, 1976) (Wikipedia).
  • Latest stable: R 4.6.0 (“Because it was There”), released 2026-04-24 (r-project.org).
  • Owner: R Foundation for Statistical Computing (since 2003); maintained by the R Core Team. License: GPL-2.0-or-later.
  • Paradigms: functional, procedural, object-oriented, reflective, array-oriented; lexically scoped (Scheme-influenced).
  • Typing: dynamic, weak; everything is a vector. No scalars — a “scalar” is a length-1 vector.
  • Memory: tracing garbage collector; copy-on-modify semantics for most objects (R’s signature behavior).
  • Compilation: tree-walking interpreter with byte-code compilation via compiler package (default for base + most packages since R 3.4).
  • Primary domains: statistics, biostatistics, econometrics, data analysis, plotting, reproducible research, bioinformatics (Bioconductor).
  • Official docs: https://cran.r-project.org/manuals.html

At a glance

R is the lingua franca of statistical computing. It is an interactive, vectorized, copy-on-modify functional language with a brutally pragmatic standard library for tabular data, linear models, and graphics. Three OO systems ship in base (S3, S4, RC) plus the popular community R6. The CRAN repository hosts >22,000 packages with mandatory cross-platform building. Packaging, vignettes, and the help system are tightly integrated; ?fun and vignette() are core to the workflow.

Getting started

Install: download from CRAN (https://cran.r-project.org/); on macOS use Homebrew brew install r, on Debian apt install r-base, on Windows the installer.

Version manager: rig (R Installation Manager, by Posit) is the modern standard — rig add release, rig default 4.5. Older alternative: RSwitch on macOS.

Hello world (file hello.R):

cat("Hello, world!\n")
# or
message("Hello, world!")

Run: Rscript hello.R or paste in the REPL.

REPL: launch with R. RStudio (now Posit Workbench) is the dominant IDE; radian is a modern terminal REPL with syntax highlighting; VS Code via the R extension also works.

Project layout (a package, the canonical “project” type):

mypkg/
  DESCRIPTION    # metadata + Imports/Suggests
  NAMESPACE      # exports + imports (auto-generated by roxygen2)
  R/             # .R source files
  man/           # .Rd help pages (often generated from roxygen2)
  tests/         # testthat/
  vignettes/     # long-form docs (.Rmd / .qmd)
  data/          # binary datasets (.rda)
  inst/          # arbitrary files copied to install dir

For analysis projects, usethis::create_project() and renv (lockfile-based dependency pinning) are the modern norm.

Package/build tool: install.packages("pkg") from CRAN; R CMD build / R CMD check / R CMD INSTALL for source packages. devtools (load_all, document, test, check) is the standard developer wrapper; pak is the fast modern installer; renv for project-local libraries.

Basics

Types and literals:

  • Atomic vectors: logical (TRUE/FALSE/NA), integer (1L), double (1.0, the default numeric), complex (1+2i), character ("x"), raw.
  • NULL (zero-length) vs NA (missing, type-specific: NA_integer_, NA_real_, NA_character_).
  • Compound: list (heterogeneous), data.frame (list of equal-length vectors), matrix/array (vectors with dim attribute), factor (integer with levels).

Variables/scoping: assignment via <- (idiomatic), =, or ->; <<- walks up enclosing environments. Lexical scoping with first-class environments. Function arguments use lazy evaluation (promises) — they aren’t evaluated until referenced.

Control flow: if/else, for, while, repeat+break, next. Vectorized ifelse(cond, yes, no). Almost everything is an expression: x <- if (p) 1 else 2.

Functions:

add <- function(x, y = 1) x + y      # default arg, last expr is return value
add(2)                               # 3
add(y = 5, x = 3)                    # named args, any order
do.call(add, list(1, 2))             # apply with list of args
\(x) x^2                             # R 4.1+ lambda shorthand

Variadic via ... (passed through with list(...)).

Strings: 1-indexed, character vectors. paste() / paste0() for concat, sprintf() for format, gsub()/sub() for regex replace, strsplit() for split. stringr (tidyverse) wraps these with consistent argument order.

Collections: c() concatenates; list() builds heterogeneous; [/[[/$ for subset (single-bracket preserves type, double-bracket extracts). Vectorize everything you can — for loops are slow because of copy-on-modify, not the loop itself.

Intermediate

Type system depth: dynamic, but classes drive method dispatch.

  • S3: lightweight, name-based — class(x) <- "foo", then print.foo <- function(x, ...) .... Dispatch by UseMethod(). Single-dispatch on first arg.
  • S4: formal — setClass("Foo", representation(x = "numeric")), setGeneric, setMethod. Multiple dispatch on argument signatures. Used heavily in Bioconductor.
  • R5 (Reference Classes / “RC”): built-in mutable OO with $ method calls; less common.
  • R6 (CRAN package): mutable, environment-based, similar to typical class-based OO; widely used (Shiny, plumber).

Modules: packages are the unit of modularity. NAMESPACE declares export/importFrom. No file-level imports — everything in R/ is loaded into the package environment.

Error handling: stop(), warning(), message(). tryCatch(expr, error = function(e) ..., warning = ..., finally = ...). Conditions are first-class S4-ish objects: withCallingHandlers for restart-style handling. rlang::abort() / cli::cli_abort() are modern equivalents with rich formatting and structured conditions.

Concurrency: R is single-threaded for user code. parallel (base): mclapply (fork on Unix), parLapply (PSOCK clusters everywhere). Modern: future + furrr for transparent parallel map-style; mirai for lightweight async; callr for clean R subprocesses. Real threading inside C/C++ via OpenMP in compiled extensions.

I/O: base read.table/write.table, readRDS/saveRDS (binary R objects), readLines. Modern: readr (tidyverse, fast TSV/CSV), vroom (lazy CSV), data.table::fread (fastest CSV reader), arrow (Parquet/Feather), DBI + dbplyr for databases.

Stdlib highlights: stats (lm, glm, t.test, dist, kmeans), graphics + grDevices (base plots, devices), utils (install.packages, head, str), methods (S4), parallel, compiler (cmpfun, enableJIT), tools (package machinery).

Advanced

Memory / GC: generational, non-moving mark-and-sweep. Trigger manually with gc(). Copy-on-modify is implemented via reference counting on bindings — when an object has refcount 1, R modifies in place (“modify in place when possible”). This is why x[i] <- v in a loop can be O(n^2): the refcount path is sometimes broken. Use data.table (reference semantics) or pre-allocate.

Concurrency deep dive: there is no shared-memory threading at the R level. The future package abstracts over multisession (PSOCK), multicore (fork), cluster, and remote backends; plan(multisession) makes future_map parallel. mirai uses NNG for fast IPC. For HPC: Rmpi, clustermq, batchtools. GPU: gpuR, torch (R bindings to LibTorch).

FFI:

  • .Call and .External interfaces to C with R’s SEXP API.
  • Rcpp (Dirk Eddelbuettel) is the dominant C++ bridge — cppFunction("..."), Rcpp::sourceCpp(), RcppArmadillo / RcppEigen for linalg.
  • cpp11 is a header-only modern alternative.
  • reticulate for Python interop, JuliaCall for Julia, V8 for JavaScript.

Reflection: deep. body(f), formals(f), environment(f) for any function. substitute(), quote(), bquote(), deparse(), eval() give full AST manipulation. sys.call(), match.call(), sys.function() for call introspection.

Performance tools: Rprof() + summaryRprof() (sampling profiler), profvis (interactive flame graph, by Posit), bench::mark() (benchmarking with garbage-collection accounting), microbenchmark. compiler::cmpfun(f) byte-compiles a function (mostly automatic since R 3.4). lobstr::obj_size() and tracemem() to detect copies.

God mode

Non-standard evaluation (NSE): R’s killer trick. Functions can capture their unevaluated arguments and decide what they mean.

my_filter <- function(df, cond) {
  cond <- substitute(cond)         # capture AST instead of evaluating
  rows <- eval(cond, envir = df)   # evaluate in column scope
  df[rows, , drop = FALSE]
}
my_filter(mtcars, mpg > 20 & cyl == 4)

Modern tidy eval (rlang) formalizes this with quosures (enquo, !!, {{ }}):

my_summary <- function(df, var) {
  df |> dplyr::summarise(m = mean({{ var }}), n = dplyr::n())
}

Environments as first-class objects: new.env(), parent.env(), globalenv(), topenv(). Enables closures, mutable state, hash-map use (env$key <- val is O(1)), and the entire NSE stack.

S3 dispatch tricks: NextMethod() walks the class vector; class(x) <- c("subclass", "superclass") enables inheritance. UseMethod looks up <generic>.<class> then <generic>.default.

Package internals: NAMESPACE controls visibility — internal functions aren’t exported but are accessible via pkg:::fun. .onLoad / .onAttach hooks run at load. S4 classes/generics need explicit exportClasses / exportMethods.

Byte compilation: compiler::cmpfun(f) produces a byte-compiled closure; the byte-code interpreter is roughly 2-5x faster than the AST interpreter. enableJIT(3) byte-compiles all closures at definition. Inspect with pryr::compose or compiler::disassemble.

Rcpp deep magic: // [[Rcpp::export]] attribute creates the SEXP wrapper. Rcpp::sourceCpp() compiles and loads inline. Rcpp::Rcout for printing back to R.

devtools internals: load_all() simulates package install by sourcing R/ into a fresh namespace + binding to package env, without R CMD INSTALL.

Operators are functions: `+`(2, 3) works. Define your own infix: `%plusone%` <- function(a, b) a + b + 1.

Idioms & style

  • Naming: snake_case is dominant in modern code (tidyverse style); base R uses dot.case (is.numeric, data.frame) — avoid dot.case in new code, it collides with S3 dispatch.
  • Assignment: <- over = is idiomatic at the top level; = is for arguments. tidyverse_style_guide() insists on <-.
  • Pipe: |> (base, R 4.1+) is the modern default; %>% (magrittr, lhs into rhs as .) is still common in tidyverse code.
  • Formatter: styler (Posit) — auto-formats to tidyverse style. Linter: lintr. Static analysis: lintr + goodpractice.
  • Vectorize first: avoid for (i in 1:n) x[i] <- ... — use vectorized ops, vapply/sapply/Map, or purrr::map_*.
  • Pre-allocate: result <- vector("list", n) then assign result[[i]] <- ... is O(n); growing with c() is O(n^2).
  • Use seq_along(x) not 1:length(x) (handles empty x).
  • vapply over sapply for type-stable code.
  • Expert review focus: copy-on-modify pitfalls, NSE/tidy-eval correctness (especially {{ }} vs !!), namespace pollution, S4 method dispatch edge cases, factor coercion bugs, stringsAsFactors legacy assumptions (defaulted to FALSE only since R 4.0).

Ecosystem

  • Data wrangling: dplyr, tidyr, data.table (high perf, reference semantics), arrow, dtplyr.
  • Plotting: ggplot2 (grammar of graphics), base graphics, lattice, plotly (interactive), htmlwidgets.
  • Modeling: stats (base), tidymodels (parsnip, recipes, rsample, yardstick), caret (older), mlr3, glmnet, xgboost, lme4 (mixed models), survival, forecast/fable.
  • Bayesian: rstan, brms, cmdstanr, rstanarm.
  • Web: shiny (reactive web apps), plumber (REST APIs), httr2 (HTTP client), rvest (scraping).
  • Reproducible reports: rmarkdown, Quarto (the modern multilang successor), knitr, bookdown, targets (pipeline orchestration).
  • Bioinformatics: Bioconductor (separate repo, 2,300+ packages, S4-heavy).
  • Testing: testthat (dominant), tinytest (zero-dep), covr (coverage).
  • Docs: roxygen2 (inline docstrings .Rd), pkgdown (website generator), devtools::check() (CRAN-style lint).
  • Notable users: Posit (RStudio), pharma (FDA accepts R submissions), finance, NYT graphics desk, BBC data team, Bioconductor consortium.

Gotchas

  • Copy-on-modify in loops creates O(n^2) surprises. data.table and reference classes opt out.
  • 1:length(x) explodes when length(x) == 0 (gives c(1, 0)); use seq_along(x).
  • sapply is type-unstable — returns vector, list, or matrix depending on input. Use vapply or purrr::map_*.
  • Partial matching of argument names: mean(x, na.r = TRUE) “works” (matches na.rm) — bug magnet. Disable with options(warnPartialMatchArgs = TRUE).
  • Factors stringify “helpfully” in unexpected places. stringsAsFactors = FALSE is the default since R 4.0 (2020-04) but legacy code assumes TRUE.
  • <<- assigns in the nearest enclosing env that has a binding, not necessarily global — surprising in nested closures.
  • NSE captures: library(dplyr); f <- function(col) df %>% select(col) doesn’t work as you expect — col evaluates to itself; need {{ col }}.
  • drop = TRUE default on [: df[, 1] may return a vector instead of a 1-column data.frame. Always pass drop = FALSE defensively, or use tibbles which never drop.
  • NA propagation: most arithmetic with NA yields NA. sum(x, na.rm = TRUE). Comparison: NA == NA is NA, not TRUE.
  • Integer overflow: silent return of NA with warning. .Machine$integer.max is 2^31-1.
  • Floating point equality: 0.1 + 0.2 == 0.3 is FALSE. Use all.equal(a, b) or dplyr::near().
  • T and F are not reserved — they’re variables bound to TRUE/FALSE and CAN be reassigned. Always write TRUE/FALSE.
  • CRAN policies: examples must run in <5s, tests in <10min, no writing outside tempdir, no internet during R CMD check unless --run-donttest.

Citations