Using Data Frames in Feather format (Apache Arrow)

Triggered by the RStudio blog article about feather I did the one line install and compared the results on a data frame of 19 million rows. First results look indeed promising:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


# build the package
> devtools::install_github("wesm/feather/R")

# load an existing data frame (19 million rows with batch job execution results)
> load("batch-12-2015.rda")

# write it in feather format...
> write_feather(dt,"batch-12-2015.feather")

# ... which is not compressed, hence larger on disk
> system("ls -lh batch-12-2015.*")
-rw-r--r-- 1 dirkd staff 813M 7 Apr 11:35 batch-12-2015.feather
-rw-r--r-- 1 dirkd staff 248M 27 Jan 22:42 batch-12-2015.rda

# a few repeat reads on an older macbook with sdd
> system.time(load("batch-12-2015.rda"))
user system elapsed
8.984 0.332 9.331
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.103 1.094 7.978
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.045 0.352 9.418
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.110 0.658 3.997
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.009 0.356 9.393
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.099 0.711 4.548

So, around half the elapsed time and about 1/10th of the user cpu time (uncompressed) ! Of course these measurements are from file system cache rather than the laptop SSD, but the reduction in wall time is nice for larger volume loads.

More important though is the cross-language support for R, Python, Scala/Spark and others, which could make feather the obvious exchange format within a team or between workflow steps with different implementation language.