R - wordcloud, grep, 정규표현식,stringr

Positive_Monster 2022. 2. 6. 17:21

▣ 주요 키워드 ▣

wordcloud
grep
정규표현식(Regular Expression)
stringr

★wordcloud

library(wordcloud)
word <- c('건강','취업','비전','희망','공부', 'd','a','z','g','h')
freq <- c(100,200,300,250,350,100,50,60,70,80)
length(word)
length(freq)

wordcloud(word,freq,colors=brewer.pal(9,'Set1'),  #wordcloud(문자열,빈도수,컬러)
          random.order=F, # F는 크기순, T는 랜덤
          scale=c(2,1), # 글자 크기 조정
          min.freq = 60, # 최소 60이상부터 출력
          max.words=100) # 최대 100단어까지만 출력

[문제182] 공백문자를 기준으로 분리 한 후 단어의 빈도수를 구하고 wordcloud를 이용해서 시각화 해주세요.
data <- "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.[6] The R language is widely used among statisticians and data miners for developing statistical software[7] and data analysis.[8] Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity;[9] as of January 2021, R ranks 9th in the TIOBE index, a measure of popularity of programming languages.[10]
A GNU package,[11] the official R software environment is written primarily in C, Fortran, and R itself[12] (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems. Although R has a command line interface, there are several third-party graphical user interfaces, such as RStudio, an integrated development environment, and Jupyter, a notebook interface.[13][14]"

data
data <- unlist(strsplit(data,split=' '))
data <- data.frame(table(data))
wordcloud(data$data,data$Freq,colors=brewer.pal(9,'Set1'),
          random.order=F,
          scale=c(2,1),
          min.freq = 60,
          max.words=100)

★ grep

- 동일한 문자열을 벡터에서 찾아서 인덱스번호, 문자열을 리턴하는 함수

text <-  c('a','ab','acb','accb','acccb','accccb')
'a' %in% text # text에 있는 값들중에 하나라도 a가 있으면 true출력
text %in% 'a' # text 각 값들을 a와 비교하여 맞으면 각각 true, 틀리면 false 출력
text[text %in% 'a'] # 
text[which(text %in% 'a')]

grep('a',text) # text변수에 a가 있는 인덱스를 모두 출력
grep('ab',text) 
grep('acb',text)
text[grep('c',text)] 
grep('c',text,value=T) # value=T c라는 글자가 있는 인덱스를 찾아서 단어를 보여줌

★ 정규표현식(Regular Expression)

test <- c('a','ab','(ab','ac','acb','abb','abbb','abc','abzzzzzzz','cab','bac','bbb','aaa','ss','s ab','a12b')
grep('a',test) # test변수에 a가 있는 값들의 인덱스를 추출
grep('a',test,value=T) # test변수에 a가 있는 인덱스가 아닌 값을 그대로 추출(value=T조건)

* -> 적어도 0번 이상의 패턴을 찾는다.
grep('ac*',test,value=T) # a가 있고 c가0번 이상 있는 값들 추출

+ -> 적어도 1번 이상의 패턴을 찾는다.
grep('ab+',test,value=T) # a가 있고 b가 1번 이상 있는 값들 추출

? -> 0번 또는 1번의 패턴을 찾는다.
grep('ab?c',test,value=T) # abc표현에서 b가 0번 있거나 1번 있는 값들 추출 

{n} -> n번의 패턴을 찾는다.
grep('ab{2}',test,value=T) # ab표현에서 b가 2번 있는 값들 추출

{n,} -> n번 이상의 패턴을 찾는다.
grep('ab{2,}',test,value=T) # ab표현에서 b가 2번 이상 있는 값들 추출

{n,m} -> n번부터 m번까지의 패턴을 찾는다.
grep('b{2,3}',test,value=T) # b가 2개 이상 3개 이하의 값들만 추출

^ -> 시작, 시작되는 문자를 찾는다.
grep('^b',test,value=T) # b로 시작하는 값만 추출

$ -> 끝, 끝나는 문자를 찾는다.
grep('b$',test,value=T) # b로 끝나는 값만 추출

\\b -> 시작되는 문자를 찾는데 빈문자열 뒤에 시작되는 문자도 찾는다.
grep('\\bab',test,value=T) # ab로 시작하는 값만 추출하는데  빈문자열 뒤에 시작되는 s ab도 찾는다.

. -> 어떤 문자 하나를 의미한다.
grep('ab.',test,value=T) # ab뒤에 문자 하나 이상있는 값들 추출
grep('ab...',test,value=T) #ab뒤에 문자가 3개 이상있는 값들 추출

[...] -> 리스트 안에 있는 문자패턴을 찾는다.
grep('ab[c,b]',test,value=T) #ab뒤에 c 또는 b가 있는 값들 추출

[n-m] -> 리스트 안에 n부터 m까지 문자 패턴을 찾는다.
grep('ab[a-d]',test,value=T) #ab뒤에 a부터 d까지 문자 패턴을 찾아서 추출

[^] -> 리스트 안에 있는 ^는 not을 의미한다.
grep('ab[^b]',test,value=T) # ab뒤에 b가 아닌 값들만 추출
grep('ab[^b,c]',test,value=T) # ab뒤에 b 또는 c가 나오지 않는 값들만 추출

\\*,\\+,\\?,\\^,\\$ -> 순수한 문자로 표현할 때 (역슬래시\\)사용
grep('\\(ab',test,value=T)

●숫자 찾는 방법
grep('[0-9]',text,value=T) #숫자가 들어간 값들 추출
grep('\\d',text,value=T)
grep('[[:digit:]]',text,value=T)

●대문자 찾는 방법
grep('[A-Z]',text,value=T)
grep('[[:upper:]]',text,value=T)

●소문자 찾는 방법
grep('[a-z]',text,value=T)
grep('[[:lower:]]',text,value=T)

●대소문자 다 찾는 방법
grep('[A-Za-z]',text,value=T)
grep('[A-z]',text,value=T)

●한글 찾는 방법
grep('[ㄱ-ㅣ]',text,value=T) # 자음모음
grep('[가-힣]',text,value=T) 
grep('[가-힣ㄱ-ㅣ]',text,value=T)

●글자들 다 찾기
grep('[[:alpha:]]',text,value=T)
grep('[A-Za-z가-힣ㄱ-ㅣ]',text,value=T)

●문자, 숫자가 있는 문자 패턴을 찾는 방법
grep('[A-Za-z가-힣ㄱ-ㅣ0-9]',text,value=T)
grep('[[:alnum:]]',text,value=T)
grep('\\w',text,value=T)

●특수문자가 있는 문자 패턴을 찾는 방법
grep('\\W',text,value=T)
grep('[[:punct:]]',text,value=T)

●숫자제외 (다른 문자들과 같이 있는 숫자는 출력됨)
grep('[^0-9]',text,value=T)
grep('\\D',text,value=T)
grep('[^[:digit:]]',text,value=T)

● | (또는)
grep('python|100',text,value=T) #python 또는 100이 들어간 값 추출

●(글자1|글자2) -> 글자1 또는 글자2의 문타패턴을 찾는 방법
text1 <- c('fight','figgt')
grep('fig(h|g)t',text1,value=T)

⊙ 한 줄로 들어온 긴 문장, 단어들을 공백으로 구분하고 조건에 맞는 단어들 추출하기

data <- "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.[6] The R language is widely used among statisticians and data miners for developing statistical software[7] and data analysis.[8] Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity;[9] as of January 2021, R ranks 9th in the TIOBE index, a measure of popularity of programming languages.[10]
A GNU package,[11] the official R software environment is written primarily in C, Fortran, and R itself[12] (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems. Although R has a command line interface, there are several third-party graphical user interfaces, such as RStudio, an integrated development environment, and Jupyter, a notebook interface.[13][14]"

data <- unlist(strsplit(data,split=' ')) # strsplit함수를 이용하여 공백단위로 나눠주고 나면 리스트 형식으로 들어가는데 추출하기 더 수월한 백터형이 필요하므로 unlist함수를 사용


[문제183] 첫문자가 대문자로 시작되는 단어를 찾으세요.
grep('^[[:upper:]]{1,}',data,value=T)
grep('^[A-Z]{1,}',data,value=T)

[문제184] 숫자가 있는 단어를 찾아주세요.
grep('[[:digit:]]',data,value=T)
grep('\\d',data,value=T)
grep('[0-9]',data,value=T)

[문제185] 특수문자가 있는 단어를 찾아주세요.
grep('[[:punct:]]',data,value=T)
grep('\\W',data,value=T)

[문제186] [숫자]가 있는 단어를 찾아주세요.
grep('\\[\\d+\\]',data,value=T)

★library(stringr)

text <- c('sql','SQL','Sql100','PLSQL','plsql','R','r','r0','python','PYTHON','pyth0n',
          'python#','100','*100','*','$','^','!','@','#','$','%','(',')','~','?','행복',
          'ㅋㅋㅋㅋ','ㅜㅜㅜㅜ')

●str_detect : 특정한 문자가 있는지 검사해서 TRUE,FALSE를 리턴하는 함수
text %in% 'SQL'
grep('SQL',text,value=T)
str_detect(text,'SQL') # boolean형식으로 출력
text[str_detect(text,'SQL')]

●str_count : 주어진 단어에서 해당 글자가 몇번 나오는지를 리턴하는 함수
str_count(text,'s')
str_count(text,'S')
str_count(text,'sql')

●str_c : 문자열을 합쳐서 출력하는 함수
paste('R','빅데이터분석')
paste('R','빅데이터분석',sep='')
paste0('R','빅데이터분석')
str_c('R','빅데이터분석')

●str_dup : 주어진 문자열을 주어진 횟수만큼 반복해서 출력하는 함수
str_dup('아 졸리다',10)

●str_length : 주어진 문자열의 길이를 리턴하는 함수
str_length('좋은 하루 되세요')

●str_locate : 주어진 문자열에서 특정한 문자가 처음으로 나오는 위치를 리턴하는 함수
str_locate('Have a nice day','a')

●str_locate_all : 주어진 문자열에서 특정한 문자가 나오는 모든 위치를 리턴하는 함수
str_locate_all('Have a nice day','a')
str_locate_all('Have a nice day','a')[[1]][1]
str_locate_all('Have a nice day','a')[[1]][2]
str_locate_all('Have a nice day','a')[[1]][3]

●str_replace : 주어진 문자열에서 문자를 새로운 문자로 바꾸는 함수, 첫번째 일치하는 문자만 바꿈
sub('a','*','abracadabra')
str_replace('abracadabra','a','*')

●str_replace_all : 주어진 문자열에서 문자를 새로운 문자로 바꾸는 함수, 일치하는 모든 문자 바꿈
gsub('a','*','abracadabra')
str_replace_all('abracadabra','a','*')

●str_split : 주어진 문자열에서 지정된 문자를 기준으로 분리하는 함수
strsplit('R,Developer',split=',')
str_split('R,Developer',',')

●str_sub : 주어진 문자열에서 지정된 시작인덱스부터 끝인덱스까지 문자를 추출하는 함수
substr('RDeveloper',1,1)
str_sub('RDeveloper',start=1,end=1)
str_sub('RDeveloper',1,1)
str_sub('RDeveloper',3,5)

substr('RDeveloper',2,nchar('RDeveloper')) #substr은 시작과 끝을 필수적으로 적어야한다.
substring('RDveloper',2)
str_sub('RDeveloper',start=2)

●str_trim : 접두, 접미부분에 연속되는 공백문자를 제거하는 함수
str_trim('        z       ')
nchar('        z       ')
str_length('        z       ')
str_length(str_trim('        z       '))
str_trim('        z       ',side='both') # side = 'both'기본값
str_trim('        z       ',side='left') # side = 'left' 외쪽 공백만 제거
str_trim('        z       ',side='right') # side = 'right'오른쪽 공백만 제거
trimws('        z       ')


text <- c('sql','SQL','Sql100','PLSQL','plsql','R','r','r0','python','PYTHON','pyth0n',
          'python#','100','*100','*','$','^','!','@','#','$','%','(',')','~','?','행복',
          'ㅋㅋㅋㅋ','ㅜㅜㅜㅜ')

●str_extract : 문자열에서 지정된 문자열을 찾는 함수
grep('[[:digit:]]',text,value=T)
text[str_detect(text,'\\d')]
str_extract(text,'[[:digit:]]+')

●str_extract_all : 문자열에서 지정된 문자열을 모두 찾는 함수
str_extract_all(text,'[[:digit:]]{1,}') # 리스트 형식으로 추출
unlist(str_extract_all(text,'[[:digit:]]{1,}'))

text <- 'R is programming language PYTHON is programming language'
str_extract_all(text,'programming')

'R' 카테고리의 다른 글

R - reshape2,cut,히스토그램,상자그림 (0)	2022.01.26
R barplot, 산점도 (0)	2022.01.25
R dplyr, sqldf함수 (0)	2022.01.21
R dplyr, rank 함수 (0)	2022.01.20
R subset,ddply 함수 (0)	2022.01.19

현재글R - wordcloud, grep, 정규표현식,stringr

배움에 대한 열정과 끈기

2024년을 맞이하여 다시 시작하는 T-Story 공부하는 모든 데이터 분석 지식들과 IT지식들을 하나씩 꾸준히 기록.

빅데이터분석가, 빅데이터, 데이터베이스, 프로그래밍, 전처리, 데이터분석, 빅데이터분석, R프로그래밍, R, SQL, 오라클, 데이터, SQLD, Python, 머신러닝, 딥러닝, 파이썬, 자격증, 통계, It,

Today :
Yesterday :

배움에 대한 열정과 끈기

R - wordcloud, grep, 정규표현식,stringr

★wordcloud

★ grep

'R' 카테고리의 다른 글

'R'의 다른글

티스토리툴바

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

R - wordcloud, grep, 정규표현식,stringr

★wordcloud

★ grep

'R' 카테고리의 다른 글

'R'의 다른글

관련글

티스토리툴바