R — Reference
Source: https://www.r-project.org/
R
- Created: 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland; an open implementation of the S language (Bell Labs, 1976) (Wikipedia).
- Latest stable: R 4.6.0 (“Because it was There”), released 2026-04-24 (r-project.org).
- Owner: R Foundation for Statistical Computing (since 2003); maintained by the R Core Team. License: GPL-2.0-or-later.
- Paradigms: functional, procedural, object-oriented, reflective, array-oriented; lexically scoped (Scheme-influenced).
- Typing: dynamic, weak; everything is a vector. No scalars — a “scalar” is a length-1 vector.
- Memory: tracing garbage collector; copy-on-modify semantics for most objects (R’s signature behavior).
- Compilation: tree-walking interpreter with byte-code compilation via
compilerpackage (default for base + most packages since R 3.4). - Primary domains: statistics, biostatistics, econometrics, data analysis, plotting, reproducible research, bioinformatics (Bioconductor).
- Official docs: https://cran.r-project.org/manuals.html
At a glance
R is the lingua franca of statistical computing. It is an interactive, vectorized, copy-on-modify functional language with a brutally pragmatic standard library for tabular data, linear models, and graphics. Three OO systems ship in base (S3, S4, RC) plus the popular community R6. The CRAN repository hosts >22,000 packages with mandatory cross-platform building. Packaging, vignettes, and the help system are tightly integrated; ?fun and vignette() are core to the workflow.
Getting started
Install: download from CRAN (https://cran.r-project.org/); on macOS use Homebrew brew install r, on Debian apt install r-base, on Windows the installer.
Version manager: rig (R Installation Manager, by Posit) is the modern standard — rig add release, rig default 4.5. Older alternative: RSwitch on macOS.
Hello world (file hello.R):
cat("Hello, world!\n")
# or
message("Hello, world!")Run: Rscript hello.R or paste in the REPL.
REPL: launch with R. RStudio (now Posit Workbench) is the dominant IDE; radian is a modern terminal REPL with syntax highlighting; VS Code via the R extension also works.
Project layout (a package, the canonical “project” type):
mypkg/
DESCRIPTION # metadata + Imports/Suggests
NAMESPACE # exports + imports (auto-generated by roxygen2)
R/ # .R source files
man/ # .Rd help pages (often generated from roxygen2)
tests/ # testthat/
vignettes/ # long-form docs (.Rmd / .qmd)
data/ # binary datasets (.rda)
inst/ # arbitrary files copied to install dir
For analysis projects, usethis::create_project() and renv (lockfile-based dependency pinning) are the modern norm.
Package/build tool: install.packages("pkg") from CRAN; R CMD build / R CMD check / R CMD INSTALL for source packages. devtools (load_all, document, test, check) is the standard developer wrapper; pak is the fast modern installer; renv for project-local libraries.
Basics
Types and literals:
- Atomic vectors:
logical(TRUE/FALSE/NA),integer(1L),double(1.0, the default numeric),complex(1+2i),character("x"),raw. NULL(zero-length) vsNA(missing, type-specific:NA_integer_,NA_real_,NA_character_).- Compound:
list(heterogeneous),data.frame(list of equal-length vectors),matrix/array(vectors withdimattribute),factor(integer withlevels).
Variables/scoping: assignment via <- (idiomatic), =, or ->; <<- walks up enclosing environments. Lexical scoping with first-class environments. Function arguments use lazy evaluation (promises) — they aren’t evaluated until referenced.
Control flow: if/else, for, while, repeat+break, next. Vectorized ifelse(cond, yes, no). Almost everything is an expression: x <- if (p) 1 else 2.
Functions:
add <- function(x, y = 1) x + y # default arg, last expr is return value
add(2) # 3
add(y = 5, x = 3) # named args, any order
do.call(add, list(1, 2)) # apply with list of args
\(x) x^2 # R 4.1+ lambda shorthandVariadic via ... (passed through with list(...)).
Strings: 1-indexed, character vectors. paste() / paste0() for concat, sprintf() for format, gsub()/sub() for regex replace, strsplit() for split. stringr (tidyverse) wraps these with consistent argument order.
Collections: c() concatenates; list() builds heterogeneous; [/[[/$ for subset (single-bracket preserves type, double-bracket extracts). Vectorize everything you can — for loops are slow because of copy-on-modify, not the loop itself.
Intermediate
Type system depth: dynamic, but classes drive method dispatch.
- S3: lightweight, name-based —
class(x) <- "foo", thenprint.foo <- function(x, ...) .... Dispatch byUseMethod(). Single-dispatch on first arg. - S4: formal —
setClass("Foo", representation(x = "numeric")),setGeneric,setMethod. Multiple dispatch on argument signatures. Used heavily in Bioconductor. - R5 (Reference Classes / “RC”): built-in mutable OO with
$method calls; less common. - R6 (CRAN package): mutable, environment-based, similar to typical class-based OO; widely used (Shiny, plumber).
Modules: packages are the unit of modularity. NAMESPACE declares export/importFrom. No file-level imports — everything in R/ is loaded into the package environment.
Error handling: stop(), warning(), message(). tryCatch(expr, error = function(e) ..., warning = ..., finally = ...). Conditions are first-class S4-ish objects: withCallingHandlers for restart-style handling. rlang::abort() / cli::cli_abort() are modern equivalents with rich formatting and structured conditions.
Concurrency: R is single-threaded for user code. parallel (base): mclapply (fork on Unix), parLapply (PSOCK clusters everywhere). Modern: future + furrr for transparent parallel map-style; mirai for lightweight async; callr for clean R subprocesses. Real threading inside C/C++ via OpenMP in compiled extensions.
I/O: base read.table/write.table, readRDS/saveRDS (binary R objects), readLines. Modern: readr (tidyverse, fast TSV/CSV), vroom (lazy CSV), data.table::fread (fastest CSV reader), arrow (Parquet/Feather), DBI + dbplyr for databases.
Stdlib highlights: stats (lm, glm, t.test, dist, kmeans), graphics + grDevices (base plots, devices), utils (install.packages, head, str), methods (S4), parallel, compiler (cmpfun, enableJIT), tools (package machinery).
Advanced
Memory / GC: generational, non-moving mark-and-sweep. Trigger manually with gc(). Copy-on-modify is implemented via reference counting on bindings — when an object has refcount 1, R modifies in place (“modify in place when possible”). This is why x[i] <- v in a loop can be O(n^2): the refcount path is sometimes broken. Use data.table (reference semantics) or pre-allocate.
Concurrency deep dive: there is no shared-memory threading at the R level. The future package abstracts over multisession (PSOCK), multicore (fork), cluster, and remote backends; plan(multisession) makes future_map parallel. mirai uses NNG for fast IPC. For HPC: Rmpi, clustermq, batchtools. GPU: gpuR, torch (R bindings to LibTorch).
FFI:
.Calland.Externalinterfaces to C with R’s SEXP API.- Rcpp (Dirk Eddelbuettel) is the dominant C++ bridge —
cppFunction("..."),Rcpp::sourceCpp(), RcppArmadillo / RcppEigen for linalg. - cpp11 is a header-only modern alternative.
reticulatefor Python interop,JuliaCallfor Julia,V8for JavaScript.
Reflection: deep. body(f), formals(f), environment(f) for any function. substitute(), quote(), bquote(), deparse(), eval() give full AST manipulation. sys.call(), match.call(), sys.function() for call introspection.
Performance tools: Rprof() + summaryRprof() (sampling profiler), profvis (interactive flame graph, by Posit), bench::mark() (benchmarking with garbage-collection accounting), microbenchmark. compiler::cmpfun(f) byte-compiles a function (mostly automatic since R 3.4). lobstr::obj_size() and tracemem() to detect copies.
God mode
Non-standard evaluation (NSE): R’s killer trick. Functions can capture their unevaluated arguments and decide what they mean.
my_filter <- function(df, cond) {
cond <- substitute(cond) # capture AST instead of evaluating
rows <- eval(cond, envir = df) # evaluate in column scope
df[rows, , drop = FALSE]
}
my_filter(mtcars, mpg > 20 & cyl == 4)Modern tidy eval (rlang) formalizes this with quosures (enquo, !!, {{ }}):
my_summary <- function(df, var) {
df |> dplyr::summarise(m = mean({{ var }}), n = dplyr::n())
}Environments as first-class objects: new.env(), parent.env(), globalenv(), topenv(). Enables closures, mutable state, hash-map use (env$key <- val is O(1)), and the entire NSE stack.
S3 dispatch tricks: NextMethod() walks the class vector; class(x) <- c("subclass", "superclass") enables inheritance. UseMethod looks up <generic>.<class> then <generic>.default.
Package internals: NAMESPACE controls visibility — internal functions aren’t exported but are accessible via pkg:::fun. .onLoad / .onAttach hooks run at load. S4 classes/generics need explicit exportClasses / exportMethods.
Byte compilation: compiler::cmpfun(f) produces a byte-compiled closure; the byte-code interpreter is roughly 2-5x faster than the AST interpreter. enableJIT(3) byte-compiles all closures at definition. Inspect with pryr::compose or compiler::disassemble.
Rcpp deep magic: // [[Rcpp::export]] attribute creates the SEXP wrapper. Rcpp::sourceCpp() compiles and loads inline. Rcpp::Rcout for printing back to R.
devtools internals: load_all() simulates package install by sourcing R/ into a fresh namespace + binding to package env, without R CMD INSTALL.
Operators are functions: `+`(2, 3) works. Define your own infix: `%plusone%` <- function(a, b) a + b + 1.
Idioms & style
- Naming:
snake_caseis dominant in modern code (tidyverse style); base R usesdot.case(is.numeric,data.frame) — avoiddot.casein new code, it collides with S3 dispatch. - Assignment:
<-over=is idiomatic at the top level;=is for arguments.tidyverse_style_guide()insists on<-. - Pipe:
|>(base, R 4.1+) is the modern default;%>%(magrittr, lhs into rhs as.) is still common in tidyverse code. - Formatter:
styler(Posit) — auto-formats to tidyverse style. Linter:lintr. Static analysis:lintr+goodpractice. - Vectorize first: avoid
for (i in 1:n) x[i] <- ...— use vectorized ops,vapply/sapply/Map, orpurrr::map_*. - Pre-allocate:
result <- vector("list", n)then assignresult[[i]] <- ...is O(n); growing withc()is O(n^2). - Use
seq_along(x)not1:length(x)(handles emptyx). vapplyoversapplyfor type-stable code.- Expert review focus: copy-on-modify pitfalls, NSE/tidy-eval correctness (especially
{{ }}vs!!), namespace pollution, S4 method dispatch edge cases, factor coercion bugs,stringsAsFactorslegacy assumptions (defaulted toFALSEonly since R 4.0).
Ecosystem
- Data wrangling:
dplyr,tidyr,data.table(high perf, reference semantics),arrow,dtplyr. - Plotting:
ggplot2(grammar of graphics), base graphics,lattice,plotly(interactive),htmlwidgets. - Modeling:
stats(base),tidymodels(parsnip, recipes, rsample, yardstick),caret(older),mlr3,glmnet,xgboost,lme4(mixed models),survival,forecast/fable. - Bayesian:
rstan,brms,cmdstanr,rstanarm. - Web:
shiny(reactive web apps),plumber(REST APIs),httr2(HTTP client),rvest(scraping). - Reproducible reports:
rmarkdown, Quarto (the modern multilang successor),knitr,bookdown,targets(pipeline orchestration). - Bioinformatics: Bioconductor (separate repo, 2,300+ packages, S4-heavy).
- Testing:
testthat(dominant),tinytest(zero-dep),covr(coverage). - Docs:
roxygen2(inline docstrings → .Rd),pkgdown(website generator),devtools::check()(CRAN-style lint). - Notable users: Posit (RStudio), pharma (FDA accepts R submissions), finance, NYT graphics desk, BBC data team, Bioconductor consortium.
Gotchas
- Copy-on-modify in loops creates O(n^2) surprises.
data.tableand reference classes opt out. 1:length(x)explodes whenlength(x) == 0(givesc(1, 0)); useseq_along(x).sapplyis type-unstable — returns vector, list, or matrix depending on input. Usevapplyorpurrr::map_*.- Partial matching of argument names:
mean(x, na.r = TRUE)“works” (matchesna.rm) — bug magnet. Disable withoptions(warnPartialMatchArgs = TRUE). - Factors stringify “helpfully” in unexpected places.
stringsAsFactors = FALSEis the default since R 4.0 (2020-04) but legacy code assumesTRUE. <<-assigns in the nearest enclosing env that has a binding, not necessarily global — surprising in nested closures.- NSE captures:
library(dplyr); f <- function(col) df %>% select(col)doesn’t work as you expect —colevaluates to itself; need{{ col }}. drop = TRUEdefault on[:df[, 1]may return a vector instead of a 1-column data.frame. Always passdrop = FALSEdefensively, or use tibbles which never drop.- NA propagation: most arithmetic with
NAyieldsNA.sum(x, na.rm = TRUE). Comparison:NA == NAisNA, notTRUE. - Integer overflow: silent return of
NAwith warning..Machine$integer.maxis 2^31-1. - Floating point equality:
0.1 + 0.2 == 0.3isFALSE. Useall.equal(a, b)ordplyr::near(). TandFare not reserved — they’re variables bound toTRUE/FALSEand CAN be reassigned. Always writeTRUE/FALSE.- CRAN policies: examples must run in <5s, tests in <10min, no writing outside tempdir, no internet during R CMD check unless
--run-donttest.
Citations
- R Project home: https://www.r-project.org/
- CRAN manuals (Intro, Language Definition, Writing R Extensions, R Internals, Data Import/Export, Installation): https://cran.r-project.org/manuals.html
- R Language Definition: https://cran.r-project.org/doc/manuals/r-release/R-lang.html
- Writing R Extensions (canonical packaging reference): https://cran.r-project.org/doc/manuals/r-release/R-exts.html
- R Internals: https://cran.r-project.org/doc/manuals/r-release/R-ints.html
- Tidyverse style guide: https://style.tidyverse.org/
- Advanced R (Hadley Wickham), 2nd ed.: https://adv-r.hadley.nz/
- R Packages (Wickham & Bryan), 2nd ed.: https://r-pkgs.org/
- rlang tidy eval: https://rlang.r-lib.org/reference/topic-data-mask.html
- Rcpp: https://www.rcpp.org/
- Bioconductor: https://www.bioconductor.org/
- Posit (RStudio): https://posit.co/
- Wikipedia (history, license, OO systems): https://en.wikipedia.org/wiki/R_(programming_language)