faster alternatives to sparse.model.matrix?

Question

faster alternatives to sparse.model.matrix?

Isaac T

2022年5月26日 08:01

I have a large dataset that is entirely categorical. I'm trying to train with it using xgboost, so I must first convert this categorical data to numerical. So far I've been using sparse.model.matrix() in the Matrix library but it is far too slow. I found a great solution here, however, the sparse matrix it returns in not the same one that sparse.model.matrix returns. I know there is a way to force sparse.model.matrix to return identical output as the solution in the link (by providing contrasts), however, that is not an effective solution as it is still too slow and results in different representation (and hence different training model).

Is there a way to accomplish the job sparse.model.matrix does even nearly as fast as the solution I posted? For my data, the solution I posted does it in about 15% of the time.

Topic representation r categorical-data

Category Data Science

Bruno Tremblay · Accepted Answer · 2018年12月1日 10:55

Try that but remove LHS variables from dataset before hands? We use that for the data processing part of putting our xgboost models in production.

fspmatrix <- function(df) {
  # Building sparse matrix with sparseMatrix to save time vs sparse.model.matrix

  # Details about the df to help build triplet
  nr <- nrow(df)
  nc <- ncol(df)
  nlevels <- pmax(1, sapply(df, nlevels))
  fac <- which(sapply(df, is.factor))
  # Check that all factors have at least 2 levels
  stopifnot(min(nlevels[fac]) > 1)

  # Building the value vector removing zero positions and first levels of factors after the first factor
  x <- as.numeric(unlist(unlist(df)))
  jx <- x # for use in building j
  # Positions of values coming from factors
  pos <- as.vector(sapply(fac * nr, function(x) x - nr + (1:nr)))
  # Positions of first levels of factors after the first factor
  pos1 <- pos[-1:-nr]
  pos1 <- pos1[which(x[pos1] == 1)]
  # Replacing the values of factors with 1 for one hot encoding
  x[pos] <- 1
  # Positions of non-zero values
  nzpos <- which(x[-pos1] != 0)
  # Values to build triplet around
  x <- x[-pos1][nzpos]

  # Building the i vector, the first dim part of the triplet
  i <- rep(seq_len(nr), nc)[-pos1][nzpos]

  # Building the j vector, the second dim part of the triplet
  jnlevels <- nlevels
  jnlevels[fac[-1]] <- jnlevels[fac[-1]] - 1
  jx[-pos] <- 1
  jx[pos[-1:-nr]] <- jx[pos[-1:-nr]] - 1
  cs <- cumsum(c(0, head(jnlevels, -1)))
  j <- jx + rep(cs, each = nr)
  j <- j[-pos1][nzpos]

  # Building the dimnames vector
  tag <- TRUE
  fspnames <- as.vector(unlist(sapply(names(df), function(n) {
    clvl <- as.character(levels(df[[n]]))
    isL <- is.logical(df[[n]])
    isF <- is.factor(df[[n]])
    indx <- if (tag & isF) {
      1:length(clvl)
    } else {
      -1
    }
    if (tag & isF) tag <<- FALSE
    paste0(n, clvl[indx], ifelse(isL, "TRUE", ""))
  })))

  fspm <- sparseMatrix(i = i, j = j, x = x, dims = c(nr, tail(cs, 1) + tail(jnlevels, 1)), dimnames = list(NULL, fspnames))

  return(fspm)
}

faster alternatives to sparse.model.matrix?

About