faster alternatives to sparse.model.matrix?

I have a large dataset that is entirely categorical. I'm trying to train with it using xgboost, so I must first convert this categorical data to numerical. So far I've been using sparse.model.matrix() in the Matrix library but it is far too slow. I found a great solution here, however, the sparse matrix it returns in not the same one that sparse.model.matrix returns. I know there is a way to force sparse.model.matrix to return identical output as the solution in the link (by providing contrasts), however, that is not an effective solution as it is still too slow and results in different representation (and hence different training model).

Is there a way to accomplish the job sparse.model.matrix does even nearly as fast as the solution I posted? For my data, the solution I posted does it in about 15% of the time.

Topic representation r categorical-data

Category Data Science


Try that but remove LHS variables from dataset before hands? We use that for the data processing part of putting our xgboost models in production.

fspmatrix <- function(df) {
  # Building sparse matrix with sparseMatrix to save time vs sparse.model.matrix

  # Details about the df to help build triplet
  nr <- nrow(df)
  nc <- ncol(df)
  nlevels <- pmax(1, sapply(df, nlevels))
  fac <- which(sapply(df, is.factor))
  # Check that all factors have at least 2 levels
  stopifnot(min(nlevels[fac]) > 1)

  # Building the value vector removing zero positions and first levels of factors after the first factor
  x <- as.numeric(unlist(unlist(df)))
  jx <- x # for use in building j
  # Positions of values coming from factors
  pos <- as.vector(sapply(fac * nr, function(x) x - nr + (1:nr)))
  # Positions of first levels of factors after the first factor
  pos1 <- pos[-1:-nr]
  pos1 <- pos1[which(x[pos1] == 1)]
  # Replacing the values of factors with 1 for one hot encoding
  x[pos] <- 1
  # Positions of non-zero values
  nzpos <- which(x[-pos1] != 0)
  # Values to build triplet around
  x <- x[-pos1][nzpos]

  # Building the i vector, the first dim part of the triplet
  i <- rep(seq_len(nr), nc)[-pos1][nzpos]

  # Building the j vector, the second dim part of the triplet
  jnlevels <- nlevels
  jnlevels[fac[-1]] <- jnlevels[fac[-1]] - 1
  jx[-pos] <- 1
  jx[pos[-1:-nr]] <- jx[pos[-1:-nr]] - 1
  cs <- cumsum(c(0, head(jnlevels, -1)))
  j <- jx + rep(cs, each = nr)
  j <- j[-pos1][nzpos]

  # Building the dimnames vector
  tag <- TRUE
  fspnames <- as.vector(unlist(sapply(names(df), function(n) {
    clvl <- as.character(levels(df[[n]]))
    isL <- is.logical(df[[n]])
    isF <- is.factor(df[[n]])
    indx <- if (tag & isF) {
      1:length(clvl)
    } else {
      -1
    }
    if (tag & isF) tag <<- FALSE
    paste0(n, clvl[indx], ifelse(isL, "TRUE", ""))
  })))

  fspm <- sparseMatrix(i = i, j = j, x = x, dims = c(nr, tail(cs, 1) + tail(jnlevels, 1)), dimnames = list(NULL, fspnames))

  return(fspm)
}

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.