[R Tips] Use “grepl” to search through text data  

[한국어안내는 제목을 클릭해주시길 바랍니다]

The project I got myself into requires me to identify brands just by looking at the given store name. I could look through all 3,000,000 store names, but thats silliness talking. CTRL+F is a basic command for searching keywords. “grepl” is the CTRL+F of R.

Lets first begin by creating a text data set

####################
#

Data set of random alphabet is created

####################
#

sample <- paste(rep(letters,2), rep(letters, each=2), sep=“”)

x <- letters

x <- x[order(x, decreasing = TRUE)]

sample <- paste(sample, rep(x,2), sep=“”)

sample <- as.data.frame(cbind(sample, 1:length(sample)))

colnames(sample) <- c(“name”, “number”)

sample <- sample[,c(2,1)]

####################
#

number name

1 1 aaz

2 2 bay

3 3 cbx

4 4 dbw

5 5 ecv

6 6 fcu

7 7 gdt

… (the code above should lead you to a data frame looking like this)

####################
#

Example 1) Find store names with “c” in it.
#

sample[grepl(“c”,sample$name),]
#

sample[grepl(“c”,sample[,2]),]
#

grepl(A,B), A is the keyword you are looking for. B is the data space where are you searching for.

In this example, we are finding “c” from vector sample$name


Example 2) Find store names beginning with “c”
#

sample[grepl(“^c”,sample$name),]
#

sample[grepl(“^c”,sample[,2]),]
#

Regular Expression is a must when dealing with text data.

“^” this is a symbol for beginning of line.


Example 3) Find store names ending with “c”
#

sample[grepl(“c$”,sample$name),]
#

sample[grepl(“c$”,sample[,2]),]
#

“$” this is a symbol for ending of a line.


Example 4) Find store names with “c” or “a”
#

x <- c(“c”,“a”)
#

sample[grepl(paste(x, collapse=“|”), sample$name),]
#

sample[grepl(paste(x, collapse=“|”), sample[,2]),]
#

collapse=“|” combines all the elements in x with an “or” statement in between.


Example 5) Find store names without “c”
#

sample[!grepl(“c”,sample$name),]
#

sample[!grepl(“c”,sample[,2]),]
#

The only difference from example 1 is the “!” in front of grepl.

“!” stands for complementary set in R.



If you are new to R, I recommend the video tutorials from Google.

https://www.youtube.com/playlist?list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP



Please comment if you have any questions about anything.

http://1000wonicecoffee.svbtle.com/

 
1
Kudos
 
1
Kudos

Now read this

[MMA] What is a way to record all actions inside the octagon?

The Problem # Mixed Martial Arts (MMA) is one of the fastest growing professional sports today. Although the early years of MMA was more of a circus / freak show, today the sport has settled into the realm of professional sports with... Continue →