The title of this blog entry should be fairly self-evident for those who might incline to read it, yet is motivated by the simple fact that there currently appear to be no online sources that clearly describe the relatively straightforward process of using background processes in R to cache objects. (Check out search engine results for “caching background R processes” : most of the top entries are for Android, and even opting for other search engines does little to help uncover any useful information .) Caching is implemented because it saves time, generally by saving the results of one function call for subsequent reuse. Background processes are also commonly implemented as time-saving measures, through delegating long-running tasks to “somewhere else”, allowing you to keep focussing on whatever (un)important things you were doing in the meantime.
Straightforward caching of the results of single function calls is often achieved through “memoisation” , implemented in several R packages including R.cache , memoise , memo , simpleCache , and simpleRCache , not to mention the extremely useful cache-management package, hoardr . None of these packages offer the ability to perform the caching via a background process, and thus the initial call to a function to-be-cached will have to wait until that function finishes before returning a value.
This blog entry describes how to implement caching via background processes. Using a background process to cache an object naturally requires a measure of anticipation that the object to be cached is likely to be useful sometime in the future, as opposed to necessarily needed right now. This is nevertheless a relatively common situation is complex, multi-stage analyses, where the results of one stage generally proceed in a predictable manner to subsequent stages. The typical inputs and outputs of those subsequent stages are the things that can be anticipated, and the results pre-calculated via background processes, and then cached for subsequent and immediate recall. So having briefly described “standard” caching (“foreground” caching, if you like), it’s time to describe background processes in R.
Background processes are, among other things, the key to the much-used future package . This package seems at first like a barely intelligible miracle of mysterious implementation. What are these “futures”? The host of highly informative vignettes provide a wealth of information on how the users of this package can implement their own “futures”, yet little information on how the futures themselves are implemented. (This is not a criticism; it reflects a reasonably self-justifying design choice, because the average user of this package will be generally satisfied with knowing how to use the package, and won’t necessarily want or need to know how the magic is performed.)
In short: a “future” is just a background process that dumps its results somewhere ready for later recall. What is a background process? Simply another R session running as a separate process . It’s easy to implement in base R. We first need a simple R script, as for example generated by the following code:
my_code <- c ("x <- rnorm (1e6)",
"y <- x ^ 2",
"y [x < 0] <- -y [x < 0]",
"saveRDS (sd (y), file = 'myresult.Rds')")
writeLines (my_code, con = "myfile.R")
That script can be executed as a background process by simply calling
Rscript via a
system or
system2 call, where the latter two allow wait = FALSE
to send the
process to the background. (The more recent implementation of system
calls via the
sys package and its simple exec_background()
function also
deserve a mention here.) In base R terms, a script can be called from an
interactive session via
system2 (command = "Rscript", args = "myfile.R", wait = FALSE)
list.files (pattern = "^my")
## [1] "myfile.R" "myresult.Rds"
The script has been executed as a background process, and the result dumped to the file, “myresult.Rds”. This can then simply be read to retrieve the cached result generated by that background process:
readRDS ("myresult.Rds")
## [1] 1.728436
And that value was calculated in, and cached from, a background process. Simple.
Where was the above value stored? In the working directory of that R
session, of course. This is often neither a practicable nor sensible
approach, for example whenever any control over storage locations is
desired. These cached values are generally going to be temporary in
nature, and the tempdir()
of the current R session offers an
alternative location, and is in fact the only location acceptable for
CRAN packages to write to during package tests. Other common options
include a sub-directory of ~/.Rcache
, as used for example in the
R.cache package. I’ll only consider tempdir()
from here on, but
doing so will also reveal why the more enduring location of ~/.Rcache
is often preferred.
Another complication arises in calling Rscript , by virtue of the claims in “Writing R Extensions” – the official CRAN guide to R packages – that one should,
… not invoke R by plain R, Rscript or (on Windows) Rterm in your examples, tests, vignettes, makefiles or other scripts. As pointed out in several places earlier in this manual, use something like “$(R_HOME)/bin/Rscript” or “$(R_HOME)/bin$(R_ARCH_BIN)/Rterm”
That comment is not very helpful because the alluded “several places” are in different contexts, and are also only examples rather than actual guidelines. The problem is those suggestions will usually, but not always work, depending on Operating System idiosyncrasies. So calling Rscript directly is less straightforward than it might seem.
A further problem arises in that both system
and system2
will
generally return values of 0
when everything works okay. “Works” then
means that the process has been successfully started. But where is that
process in relation to the current R session? And likely most
importantly, has that process finished or is it still operating? While
it is possible to use further system
calls to determine the
process identifier (PID) , that process itself is fraught and
perilous. There are further complications which arise through directly
calling background R processes via Rscript
, but those should
suffice to argue for the fabulous alternative available thanks to Gábor
Csárdi and …
The processx package states simply that it provides,
“Tools to run system processes in the background”
This package is designed to run any available system process,
including ones that potentially have nothing to do with R let alone
a current R session. Using
processx to run background R process thus requires calling
Rscript
, with the associated problems described above. Fortunately for
us, Gábor foresaw this need and created the “companion” package,
callr to simply
callr relies directly on processx , but provides the far simpler function, r\_bg to“Call R from R”
“Evaluate an expression in another R session, in the background”
So r_bg provides the perfect tool for our needs. This function directly evaluates R code, without needing to render it to text as we did above in order to write it to an external script file. An r_bg version of the above would look like this:
f <- function () {
x <- rnorm (1e6)
y <- x ^ 2
y [x < 0] <- -y [x < 0]
saveRDS (sd (y), file = "myresult.Rds")
}
callr::r_bg (f)
## PROCESS 'R', running, pid 3494.
We immediately see that r_bg returns a handle to the process itself, along with the single piece of critical diagnostic information: Whether the process is still running or not:
px <- callr::r_bg (f)
px
## PROCESS 'R', running, pid 3502.
Sys.sleep (1)
px
## PROCESS 'R', finished.
Multiple processes can be generated and queried this way. The package is designed around, and returns, R6 class objects, enabling function calls on the objects, notably including the following:
px <- callr::r_bg (f)
px
## PROCESS 'R', running, pid 3524.
while (px$is_alive())
px$wait ()
px
## PROCESS 'R', finished.
The px$is_alive()
and px$wait()
functions are all that is needed to
wait until a background process is finished. In the context of using
background processes to cache objects, these lines enable the primary
R session to simply wait until caching is finished before retrieving
the object.
There is only one remaining issue with the above code: Where is “myresult.Rds” in the following code?
f <- function () {
x <- rnorm (1e6)
y <- x ^ 2
y [x < 0] <- -y [x < 0]
saveRDS (sd (y), file = file.path (tempdir (), "myresult.Rds"))
}
px <- callr::r_bg (f)
It’s in tempdir()
, but not the tempdir()
of the current process.
Where is his other tempdir()
? It’s temporary of course, so has been
dutifully cleaned up, thereby removing our desired result. What is
needed is a way to store the result in the tempdir()
of the current –
active – R session. This tempdir()
is merely specified as a
character string, which we can pass directly to our function:
f <- function (temp_dir) {
x <- rnorm (1e6)
y <- x ^ 2
y [x < 0] <- -y [x < 0]
saveRDS (sd (y), file = file.path (temp_dir, "mynewresult.Rds"))
}
We then only need to note that the second parameter of
r_bg is args
, which is,
“Arguments to pass to the function. Must be a list.”
That is then all we need, so let it run …
px <- callr::r_bg (f, list (tempdir ()))
while (px$is_alive())
px$wait ()
list.files (tempdir (), pattern = "^my")
## [1] "mynewresult.Rds"
And there is our new result, along with all we need to understand how to cache objects via background R processes.
tempdir()
parameter if that is to be used as the cache location.callr::r_bg()
to call that function in the background and
deliver the result to the desired location.r_bg()
to determine
whether it has finished or not.Copyright © 2019--22 mark padgham