“R is a shockingly dreadful language for an exceptionally useful data
analysis environment” (
Tim Smith & Kevin Ushey ). One of the strangest manifestations of
claims like these is that,
“Everything in R is a vector” . The simple question that then arises
is, What is a matrix? One commonly cited current repository of things R
is Hadley Wickham’s book,
Advanced R , which has a section on
Data Structures which simply states that a matrix is the two
dimensional equivalent of a vector, and that,
“Adding a dim
attribute to an atomic vector allows it to behave like a
multi-dimensional array.” The chapter linked to above goes on to
say that, “Vectors are not the only 1-dimensional data structure. You
can have matrices with a single row or single column, or arrays with a
single dimension. They may print similarly, but will behave differently.
The differences aren’t too important.” This blog entry will attempt to
illustrate the kind of circumstances under which differences between
vectors and matrices actually become quite important indeed.
Vectors do differ from matrices, as the following code clearly illustrates:
n <- 1e6
x <- runif (n)
y <- runif (n)
xy <- cbind (x, y) # a matrix
rbenchmark::benchmark (
res <- x + y,
res <- rowSums (xy),
replications = 100,
order = NULL) [, 1:4]
test | replications | elapsed | relative |
---|---|---|---|
res <- x + y | 100 | 0.245 | 1.000 |
res <- rowSums(xy) | 100 | 0.912 | 3.722 |
Adding the two rows of a matrix takes 3-4 times longer than adding two
otherwise equivalent vectors. And okay, that’s very likely something to
do with the rowSums
function rather than the matrix itself, but why
should these two behave so differently? At that point, I must freely
admit to being not sufficiently clever to have uncovered the actual
reason in the
underlying C source code. The answer must lie somewhere in there,
so any pointers would be greatly appreciated. Short of that, the
following is a phenomenological explanation, derived through attempting
to reconstruct in C code what rowSums
is actually doing.
Direct vector addition must work something like the following C code, written here in a form able to be directly parsed in R via the inline package .
library (inline)
add <- cfunction(c(a = "numeric", b = "numeric"), "
int n = length (a);
SEXP result = PROTECT (allocVector (REALSXP, n));
double *ra, *rb, *rout;
ra = REAL (a);
rb = REAL (b);
rout = REAL (result);
for (int i = 0; i < n; i++)
rout [i] = ra [i] + rb [i];
UNPROTECT (1);
return result;
")
That’s a simple C function to add two vectors and return the result, with most of the code providing the necessary scaffolding for an R function. The following benchmark compares that with the previous two equivalent functions.
rbenchmark::benchmark (
res <- x + y,
res <- add (x, y),
res <- rowSums (xy),
replications = 100,
order = NULL) [, 1:4]
test | replications | elapsed | relative |
---|---|---|---|
res <- x + y | 100 | 0.203 | 1.000 |
res <- add(x, y) | 100 | 0.217 | 1.069 |
res <- rowSums(xy) | 100 | 0.931 | 4.586 |
So our add
function is broadly equivalent to R’s underlying code for
vector addition, and correspondingly, considerably more efficient than
rowSums
applied to an equivalent matrix. This naturally fosters the
question of whether the inefficiency arises in rowSums
itself, or
whether it is somehow something inherent to R’s internal representation
of matrices and/or matrix operations? The following code provides an
initial answer to that quesiton.
rbenchmark::benchmark (
res <- x + y,
res <- add (x, y),
res <- rowSums (xy),
res <- xy [, 1] + xy [, 2],
replications = 100,
order = NULL) [, 1:4]
test | replications | elapsed | relative |
---|---|---|---|
res <- x + y | 100 | 0.208 | 1.000 |
res <- add(x, y) | 100 | 0.250 | 1.202 |
res <- rowSums(xy) | 100 | 0.840 | 4.038 |
res <- xy[, 1] + xy[, 2] | 100 | 0.742 | 3.567 |
And direct addition of two columns of a matrix, through indexing into
those columns, is roughly as inefficient as rowSums
itself, while
direct addition of the equivalent vectors remains 3-4 times more
efficient.
So the reason for the relative inefficiency of rowSums
is likely to
extend directly from the column selection operation, xy[, i]
. The
reference manual for the C-level details of data storage and
sub-selection in R is the online compendium,
R Internals , yet even this has remarkably little to say in regard
to how matrices are actually stored or manipulated. The key is a single
incidental statement that,
“Matrices are stored as vectors” . The storage can then be
understood through reading the details of vector storage, and then
simply figuring out how the indexing of a matrix-as-vector is
implemented. This can be easily discerned from direct conversion within
R:
as.vector (cbind (1:5, 6:10))
## [1] 1 2 3 4 5 6 7 8 9 10
The columns of a matrix are directly concatenated within the vector
object. This enables us to then re-write the above C code for vector
addition to instead accept a matrix object, noting that the indices i
and n + i
respectively refer to the first and second columns of the
matrix.
matadd <- cfunction(c(a = "numeric"), "
int n = floor (length (a) / 2);
SEXP result = PROTECT (allocVector (REALSXP, n));
double *ra, *rout;
ra = REAL (a);
rout = REAL (result);
for (int i = 0; i < n; i++)
rout [i] = ra [i] + ra [n + i];
UNPROTECT (1);
return result;
")
Benchmarking that against the previous versions, and including an additional comparison of direct matrix addition, gives the following results.
rbenchmark::benchmark (
res <- x + y,
res <- add (x, y),
res <- rowSums (xy),
res <- xy [, 1] + xy [, 2],
res <- matadd (xy),
res <- xy + xy,
replications = 100,
order = NULL) [, 1:4]
test | replications | elapsed | relative |
---|---|---|---|
res <- x + y | 100 | 0.214 | 1.000 |
res <- add(x, y) | 100 | 0.215 | 1.005 |
res <- rowSums(xy) | 100 | 0.824 | 3.850 |
res <- xy[, 1] + xy[, 2] | 100 | 0.767 | 3.584 |
res <- matadd(xy) | 100 | 0.219 | 1.023 |
res <- xy + xy | 100 | 0.407 | 1.902 |
That benchmark demonstrates that operations on matrix columns are only
as efficient as equivalent operations on vectors when the matrices are
treated as singular vector objects. Direct addition of entire matrices
(xy + xy
) is also as efficient as vector addition, taking here around
twice as long because twice as many values are being added.
Inefficiencies arise in handling matrices only when extracting
individual rows or columns – the xy[, i]
operations, presumably
because these operations involve creating an additional copy of the
entire row or column.
What the above code was intended to demonstrate was that matrices should only be considered to be like vectors in the sense of operations on the entire objects. Sub-setting or sub-selecting of matrices involves creating additional copies of the sub-set/sub-selected portions, and is comparably less efficient than equivalent vector operations. In particular, efficient C or C++ operations on matrices should index directly into the underlying vector object, rather than sub-setting particular rows or columns of the matrices.
The assertion that everything in R is a vector hereby deepens: Even matrices in R are vectors, and should in many circumstances be treated as such.
Copyright © 2019--22 mark padgham