Triggered by the RStudio blog article about feather I did the
one line install and compared the results on a data frame
of 19 million rows. First results look indeed promising:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
# build the package
> devtools::install_github("wesm/feather/R")
# load an existing data frame (19 million rows with batch job execution results)
> load("batch-12-2015.rda")
# write it in feather format...
> write_feather(dt,"batch-12-2015.feather")
# ... which is not compressed, hence larger on disk
> system("ls -lh batch-12-2015.*")
-rw-r--r-- 1 dirkd staff 813M 7 Apr 11:35 batch-12-2015.feather
-rw-r--r-- 1 dirkd staff 248M 27 Jan 22:42 batch-12-2015.rda
# a few repeat reads on an older macbook with sdd
> system.time(load("batch-12-2015.rda"))
user system elapsed
8.984 0.332 9.331
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.103 1.094 7.978
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.045 0.352 9.418
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.110 0.658 3.997
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.009 0.356 9.393
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.099 0.711 4.548
|
So, around half the elapsed time and about 1/10th of the user cpu time (uncompressed) !
Of course these measurements are from file system cache rather than the laptop SSD, but
the reduction in wall time is nice for larger volume loads.
More important though is the cross-language support for R, Python,
Scala/Spark and others, which could make feather the obvious exchange
format within a team or between workflow steps with different implementation
language.