Theoretical Question: Data.table vs Data.frame with Big Data

I know that I can read in a very large csv file much faster with fread using the data.table library than with read.csv that reads a file in as a data.frame. However, dplyr can only perform operations on data.frame.

My questions are:

  1. Why was dplyr built to work with the slower of the two data structures?
  2. When working with big data is it good practice to read in as data.table then convert to data.frame to perform dplyr operations?
  3. Is there another strategy I am missing?

Topic dplyr data-table dataframe r

Category Data Science


  1. They were developed independently. They served (and continue to serve) different purposes. Also, data.table was very hard to actually program (as opposed to use interactively) in its early days. See here for a detailed comparison.
  2. No. As in the note above, use data.table::fread, then use dtplyr for efficient dplyr operations on the data.table objects.
  3. See above.

I have used data.table to manipulate data sets of several GB in memory and over 1 billion rows. I have not had the same success with dplyr.

Also, heed the note in the Readme for dtplyr:

dtplyr will always be a bit slower than data.table, because it creates copies of objects rather than mutating in place (that's the dplyr philosophy). Currently, dtplyr is quite a lot slower than bare data.table because the methods aren't quite smart enough. I hope interested dplyr & data.table users from the community will help me to improve the performance.


The tidyverse also encompasses readr with faster functions for reading text files, e.g. read_csv (https://cran.r-project.org/web/packages/readr/README.html).

You are confounding two issues here, speed of operations of data.table vs. data.frame, and reading in such data.

I would venture a guess that read.csv and read.table is slow, because it does a lot of inefficient guessing and (by default) tries to convert strings to factors. When dealing with large data sets, you should always prefer to tell the routine what it is reading, rather than letting it guess.

Lastly, if you are dealing with very large data sets, specialized subroutines that are implemented in Rcpp or the like might be preferable, assuming your data is structured.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.