Caching via background R processes

The title of this blog entry should be fairly self-evident for those who might incline to read it, yet is motivated by the simple fact that there currently appear to be no online sources that clearly describe the relatively straightforward process of using background processes in R to cache objects. (Check out search engine results for “caching background R processes” : most of the top entries are for Android, and even opting for other search engines does little to help uncover any useful information .) Caching is implemented because it saves time, generally by saving the results of one function call for subsequent reuse. Background processes are also commonly implemented as time-saving measures, through delegating long-running tasks to “somewhere else”, allowing you to keep focussing on whatever (un)important things you were doing in the meantime.

Straightforward caching of the results of single function calls is often achieved through “memoisation” , implemented in several R packages including R.cache , memoise , memo , simpleCache , and simpleRCache , not to mention the extremely useful cache-management package, hoardr . None of these packages offer the ability to perform the caching via a background process, and thus the initial call to a function to-be-cached will have to wait until that function finishes before returning a value.

This blog entry describes how to implement caching via background processes. Using a background process to cache an object naturally requires a measure of anticipation that the object to be cached is likely to be useful sometime in the future, as opposed to necessarily needed right now. This is nevertheless a relatively common situation is complex, multi-stage analyses, where the results of one stage generally proceed in a predictable manner to subsequent stages. The typical inputs and outputs of those subsequent stages are the things that can be anticipated, and the results pre-calculated via background processes, and then cached for subsequent and immediate recall. So having briefly described “standard” caching (“foreground” caching, if you like), it’s time to describe background processes in R.

Background processes in R

Background processes are, among other things, the key to the much-used future package . This package seems at first like a barely intelligible miracle of mysterious implementation. What are these “futures”? The host of highly informative vignettes provide a wealth of information on how the users of this package can implement their own “futures”, yet little information on how the futures themselves are implemented. (This is not a criticism; it reflects a reasonably self-justifying design choice, because the average user of this package will be generally satisfied with knowing how to use the package, and won’t necessarily want or need to know how the magic is performed.)

In short: a “future” is just a background process that dumps its results somewhere ready for later recall. What is a background process? Simply another R session running as a separate process . It’s easy to implement in base R. We first need a simple R script, as for example generated by the following code:

my_code <- c ("x <- rnorm (1e6)",
                           "y <- x ^ 2",
                           "y [x < 0] <- -y [x < 0]",
                           "saveRDS (sd (y), file = 'myresult.Rds')")
writeLines (my_code, con = "myfile.R")

That script can be executed as a background process by simply calling Rscript via a system or system2 call, where the latter two allow wait = FALSE to send the process to the background. (The more recent implementation of system calls via the sys package and its simple exec_background() function also deserve a mention here.) In base R terms, a script can be called from an interactive session via

system2 (command = "Rscript", args = "myfile.R", wait = FALSE)
list.files (pattern = "^my")
## [1] "myfile.R"     "myresult.Rds"

The script has been executed as a background process, and the result dumped to the file, “myresult.Rds”. This can then simply be read to retrieve the cached result generated by that background process:

readRDS ("myresult.Rds")
## [1] 1.728436

And that value was calculated in, and cached from, a background process. Simple.

Complications

Where was the above value stored? In the working directory of that R session, of course. This is often neither a practicable nor sensible approach, for example whenever any control over storage locations is desired. These cached values are generally going to be temporary in nature, and the tempdir() of the current R session offers an alternative location, and is in fact the only location acceptable for CRAN packages to write to during package tests. Other common options include a sub-directory of ~/.Rcache, as used for example in the R.cache package. I’ll only consider tempdir() from here on, but doing so will also reveal why the more enduring location of ~/.Rcache is often preferred.

Another complication arises in calling Rscript , by virtue of the claims in “Writing R Extensions” – the official CRAN guide to R packages – that one should,

… not invoke R by plain R, Rscript or (on Windows) Rterm in your examples, tests, vignettes, makefiles or other scripts. As pointed out in several places earlier in this manual, use something like “$(R_HOME)/bin/Rscript” or “$(R_HOME)/bin$(R_ARCH_BIN)/Rterm”

That comment is not very helpful because the alluded “several places” are in different contexts, and are also only examples rather than actual guidelines. The problem is those suggestions will usually, but not always work, depending on Operating System idiosyncrasies. So calling Rscript directly is less straightforward than it might seem.

A further problem arises in that both system and system2 will generally return values of 0 when everything works okay. “Works” then means that the process has been successfully started. But where is that process in relation to the current R session? And likely most importantly, has that process finished or is it still operating? While it is possible to use further system calls to determine the process identifier (PID) , that process itself is fraught and perilous. There are further complications which arise through directly calling background R processes via Rscript, but those should suffice to argue for the fabulous alternative available thanks to Gábor Csárdi and …

The processx package

The processx package states simply that it provides,

“Tools to run system processes in the background”

This package is designed to run any available system process, including ones that potentially have nothing to do with R let alone a current R session. Using processx to run background R process thus requires calling Rscript, with the associated problems described above. Fortunately for us, Gábor foresaw this need and created the “companion” package, callr to simply

“Call R from R”

callr relies directly on processx , but provides the far simpler function, r\_bg to

“Evaluate an expression in another R session, in the background”

So r_bg provides the perfect tool for our needs. This function directly evaluates R code, without needing to render it to text as we did above in order to write it to an external script file. An r_bg version of the above would look like this:

f <- function () {
    x <- rnorm (1e6)
    y <- x ^ 2
    y [x < 0] <- -y [x < 0]
    saveRDS (sd (y), file = "myresult.Rds")
}
callr::r_bg (f)
## PROCESS 'R', running, pid 3494.

We immediately see that r_bg returns a handle to the process itself, along with the single piece of critical diagnostic information: Whether the process is still running or not:

px <- callr::r_bg (f)
px
## PROCESS 'R', running, pid 3502.
Sys.sleep (1)
px
## PROCESS 'R', finished.

Multiple processes can be generated and queried this way. The package is designed around, and returns, R6 class objects, enabling function calls on the objects, notably including the following:

px <- callr::r_bg (f)
px
## PROCESS 'R', running, pid 3524.
while (px$is_alive())
    px$wait ()
px
## PROCESS 'R', finished.

The px$is_alive() and px$wait() functions are all that is needed to wait until a background process is finished. In the context of using background processes to cache objects, these lines enable the primary R session to simply wait until caching is finished before retrieving the object.

processx, callr, and caching

There is only one remaining issue with the above code: Where is “myresult.Rds” in the following code?

f <- function () {
    x <- rnorm (1e6)
    y <- x ^ 2
    y [x < 0] <- -y [x < 0]
    saveRDS (sd (y), file = file.path (tempdir (), "myresult.Rds"))
}
px <- callr::r_bg (f)

It’s in tempdir(), but not the tempdir() of the current process. Where is his other tempdir()? It’s temporary of course, so has been dutifully cleaned up, thereby removing our desired result. What is needed is a way to store the result in the tempdir()of the current – active – R session. This tempdir() is merely specified as a character string, which we can pass directly to our function:

f <- function (temp_dir) {
    x <- rnorm (1e6)
    y <- x ^ 2
    y [x < 0] <- -y [x < 0]
    saveRDS (sd (y), file = file.path (temp_dir, "mynewresult.Rds"))
}

We then only need to note that the second parameter of r_bg is args, which is,

“Arguments to pass to the function. Must be a list.”

That is then all we need, so let it run …

px <- callr::r_bg (f, list (tempdir ()))
while (px$is_alive())
    px$wait ()
list.files (tempdir (), pattern = "^my")
## [1] "mynewresult.Rds"

And there is our new result, along with all we need to understand how to cache objects via background R processes.

Summary

  1. Define a function to generate the object to be cached, and include a tempdir() parameter if that is to be used as the cache location.
  2. Use callr::r_bg() to call that function in the background and deliver the result to the desired location.
  3. Examine the handle of the process returned by r_bg() to determine whether it has finished or not.
  4. … use the cached result.
    Originally posted: 06 Jun 19

Copyright © 2019--22 mark padgham