TL;DR: grepl
expects its first argument to be a string (length 1), not a vector. You can solve this with combinations of sapply
and lapply
(see below), but you are better served using a single regular expression that captures what you want to match in df1.MATCH
and not use df2.PATTERN
at all. This second option is much faster (if less intelligle) for a large data set. For this type of work, it is worth learning how to use regular expressions to their full potential.
df1 %>% filter(grepl(pattern = "^((ABC)( )*)+$", x = df1.MATCH, ignore.case = TRUE))
Explanation
The documentation for grepl
shows the following usage:
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
The pattern
argument is first, and this argument should be a string (one element). You are providing df1.MATCH
to this argument, which is a vector.
We could use sapply
to apply grepl
to each element of df1.MATCH
.
sapply(df1.MATCH, grepl, x = df2.PATTERN)
ABC abc BCD
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] TRUE TRUE FALSE
However, look at the output! You probably did not want a matrix. What happens when we run your grepl
one just the first element of df1.MATCH
?
grepl("ABC",df2.PATTERN)
[1] TRUE FALSE TRUE
We get a vector because grepl
is checking ABC
against each element of df2.PATTERN
. To get a useful logical vector for filtering, you need to return a logical vector of the same length as df1.MATCH
. I see two ways to do it.
Method 1: Use any
Since you want to know which elements in df1.MATCH
match any elements in df2.PATTERN
, you can use any
, which returns TRUE
if any element in its arguments is TRUE
. We need a little bit different syntax to make this work. We need to wrap grepl
in lapply
to make a list of three vectors (one for each element in df1.MATCH1
) that feeds into sapply
wrapped any
. If we just use sapply
, any
will only return one value since we have a matrix input.
any(grepl("ABC", df2.PATTERN))
[1] TRUE
sapply(
lapply(df1.MATCH, grepl, x = df2.MATCH),
any)
[1] TRUE TRUE FALSE
Method 2: Write a better regular expression.
You want to match the contents of df1.MATCH
against possible values that look like abc
, ABC
, ABC ABC
, or ABC abc
, etc. You can encompass all of this in a single regex string. The string you want is
"^((ABC)( )*)+$"
^ # Nothing else before this
(ABC) # Must contain ABC together as a group
( )* # followed by any number of spaces (including 0)
((ABC)( )*)+ # Look for the ABC (space) pattern repeated one or more times
$ # And nothing else after it
Then use grepl
with ignore.case = TRUE
:
grepl("^((ABC)( )*)+$", df1.MATCH, ignore.case = TRUE)
[1] TRUE TRUE FALSE
Benchmarking
In a large dataset, one of these will perform faster. Let's find out. Your benchmark results will vary by your machine's resources.
df1.MATCH <- sample(c("ABC", "abc" ,"BCD"), size = 100000, replace = TRUE)
df1 <- data.frame(df1.MATCH)
df2.PATTERN <- c("ABC", "abc", "ABC abc")
library(rbenchmark)
benchmark("any lapply" = {
df1 %>%
filter(sapply(lapply(df1.MATCH, grepl, x=df2.PATTERN), any) )
},
"better regex" = {
df1 %>%
filter(grepl("^((ABC)( )*)+$", df1.MATCH, ignore.case = TRUE))
}
)
test replications elapsed relative user.self sys.self user.child sys.child
1 any lapply 100 149.13 70.678 147.67 0.39 NA NA
2 better regex 100 2.11 1.000 2.10 0.02 NA NA
It looks like the improved regex method is significantly faster. That's because it is performing only one operation per row (grepl
) before filtering. The other method is performing four operations per row: lapply
is performing grepl
three times (one for each element of df2.PATTERN
, and sapply
then performs any
for each list element (each row).