When to choose character instead of factor in R?

I am currently working on a dataset which contains a name attribute, which stands for a person's first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not reflect any group membership, I am uncertain to leave it as factor.

Is it necessary to convert name to character? Are there some advantages in doing (or not doing) this? Does it even matter?

Topic data-wrangling r

Category Data Science


A few thoughts on the question above:

  • I find this link about Factors in R very useful.
  • If you want to create a classification model or if you like to convert the character to numeric you have to convert the character to a factor first: as.numeric(as.factor(name)). In your case that could be named with more or less than 4 letters or names starting with a specific letter.
  • As mentioned before, converting the character to a factor saves memory!

Happy coding!


Factors are stored as numbers and a table of levels. If you have categorical data, storing it as a factor may save lots of memory.

For example, if you have a vector of length 1,000 stored as character and the strings are all 100 characters long, it will take about 100,000 bytes. If you store it as a factor, it will take about 8,000 bytes plus the sum of the lengths of the different factors.

Comparisons with factors should be quicker too because equality is tested by comparing the numbers, not the character values.

The advantage of keeping it as character comes when you want to add new items, since you are now changing the levels.

Store them as whatever makes the most sense for what the data represent. If name is not categorical, and it sounds like it isn't, then use character.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.