R Studio - grepl compare a column in a dataframe to a list of pattern

Question

R Studio - grepl compare a column in a dataframe to a list of pattern

vicky

2022年5月7日 23:03

I have a column named "MATCH" in a dataframe and a list of patterns named "PATTERN".

df1.MATCH - c("ABC", "abc" ,"BCD")
df1 - as.data.frame(df1.MATCH)
df2.PATTERN - c("ABC", "abc", "ABC abc")

I want to use grepl to compare MATCH column with PATTERN, if true, I will apply my functions. The desired result would be "ABC" matches "ABC" and "ABC abc". This is the code I used:

df1 %% filter(grepl(df1.MATCH,df2.PATTERN ))%% ...

I get error:

"Warning message: In grepl(TXN_GROUP, parm[3]) :argument 'pattern' has length  1 and only the first element will be used"

I understand I can't use grepl to a list of vectors. Is there any way to solve it?

Topic regex r

Category Data Science

Ben Norris · Accepted Answer · 2020年6月6日 14:30

TL;DR: grepl expects its first argument to be a string (length 1), not a vector. You can solve this with combinations of sapply and lapply (see below), but you are better served using a single regular expression that captures what you want to match in df1.MATCH and not use df2.PATTERN at all. This second option is much faster (if less intelligle) for a large data set. For this type of work, it is worth learning how to use regular expressions to their full potential.

df1 %>% filter(grepl(pattern = "^((ABC)( )*)+$", x = df1.MATCH, ignore.case = TRUE))

Explanation

The documentation for grepl shows the following usage:

grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
  fixed = FALSE, useBytes = FALSE)

The pattern argument is first, and this argument should be a string (one element). You are providing df1.MATCH to this argument, which is a vector.

We could use sapply to apply grepl to each element of df1.MATCH.

sapply(df1.MATCH, grepl, x = df2.PATTERN)
       ABC   abc   BCD
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE FALSE
[3,]  TRUE  TRUE FALSE

However, look at the output! You probably did not want a matrix. What happens when we run your grepl one just the first element of df1.MATCH?

grepl("ABC",df2.PATTERN)
[1]  TRUE FALSE  TRUE

We get a vector because grepl is checking ABC against each element of df2.PATTERN. To get a useful logical vector for filtering, you need to return a logical vector of the same length as df1.MATCH. I see two ways to do it.

Method 1: Use any

Since you want to know which elements in df1.MATCH match any elements in df2.PATTERN, you can use any, which returns TRUE if any element in its arguments is TRUE. We need a little bit different syntax to make this work. We need to wrap grepl in lapply to make a list of three vectors (one for each element in df1.MATCH1) that feeds into sapply wrapped any. If we just use sapply, any will only return one value since we have a matrix input.

any(grepl("ABC", df2.PATTERN))
[1]  TRUE

sapply(
  lapply(df1.MATCH, grepl, x = df2.MATCH),
  any)
[1]  TRUE  TRUE FALSE

Method 2: Write a better regular expression.

You want to match the contents of df1.MATCH against possible values that look like abc, ABC, ABC ABC, or ABC abc, etc. You can encompass all of this in a single regex string. The string you want is

"^((ABC)( )*)+$"
^                   # Nothing else before this
(ABC)               # Must contain ABC together as a group
( )*                # followed by any number of spaces (including 0)
((ABC)( )*)+        # Look for the ABC (space) pattern repeated one or more times
$                   # And nothing else after it

Then use grepl with ignore.case = TRUE:

grepl("^((ABC)( )*)+$", df1.MATCH, ignore.case = TRUE)
[1]  TRUE  TRUE  FALSE

Benchmarking

In a large dataset, one of these will perform faster. Let's find out. Your benchmark results will vary by your machine's resources.

df1.MATCH <- sample(c("ABC", "abc" ,"BCD"), size = 100000, replace = TRUE)
df1 <- data.frame(df1.MATCH) 
df2.PATTERN <- c("ABC", "abc", "ABC abc")

library(rbenchmark)

benchmark("any lapply" = {
              df1 %>% 
                filter(sapply(lapply(df1.MATCH, grepl, x=df2.PATTERN), any) )
          }, 
         "better regex" = {
              df1 %>%
                filter(grepl("^((ABC)( )*)+$", df1.MATCH, ignore.case = TRUE))
          }
          )

          test replications elapsed relative user.self sys.self user.child sys.child
1   any lapply          100  149.13   70.678    147.67     0.39         NA        NA   
2 better regex          100    2.11    1.000      2.10     0.02         NA        NA

It looks like the improved regex method is significantly faster. That's because it is performing only one operation per row (grepl) before filtering. The other method is performing four operations per row: lapply is performing grepl three times (one for each element of df2.PATTERN, and sapply then performs any for each list element (each row).

R Studio - grepl compare a column in a dataframe to a list of pattern

About