자연어 전처리

티스토리 뷰

develop

자연어 전처리

yogae 2020. 5. 24. 19:12

install

pip로 nltk를 설치한다.

pip install nltk
pip install numpy

nltk는 pip로만 설치하여 사용할 수 없다. nltk에서 datasets/models는 따로 설치하고 되어있다. 파이썬 셸에서 ntlk.download() 함수를 실행하여 UI를 사용하여 설치할 추가 패키지를 선택할 수 있다. 명령어를 사용하는 것이 편해서 아래의 명령어를 사용하여 설치했다.

sudo python -m nltk.downloader -d /usr/local/share/nltk_data "<pakage 이름>"

/usr/local/share/nltk_data는 mac에서 nltk 패키지가 저장되는 위치이다. 다른 os를 사용하는 경우 Installing NLTK Data — NLTK 3.5 documentation를 참고해라. 사용할 nltk pakage만 위의 명령어를 사용하여 설치할 수 있다. 사용할 nltk pakage가 확실하지 않다면 popular를 사용한다.

Tokenization

punkt package를 설치한다.

sudo python -m nltk.downloader -d /usr/local/share/nltk_data punkt

토큰의 기준을 단어로 하는 경우, 단어 토큰화라고 한다. ”It's an amazing experience here.”문장을 단어를 기준으로 tokenize하면 아래와 같다.

text = "It's an amazing experience here."
from nltk.tokenize import word_tokenize
tokenized = word_tokenize(text) #['It', "'s", 'an', 'amazing', 'experience', 'here', '.']

문장 단위로 토큰화하는 방법을 알아본다.

from nltk.tokenize import sent_tokenize
text="I am actively looking for Ph.D. students. and you are a Ph.D student."
print(sent_tokenize(text)) # ['I am actively looking for Ph.D. students.', 'and you are a Ph.D student.']

NLTK는 단순히 온점을 구분자로 하여 문장을 구분하지 않았기 때문에, Ph.D.를 문장 내의 단어로 인식하여 성공적으로 인식하는 것을 볼 수 있다.

불용어(Stopword) 확인

단어 토큰 중 의미를 가지지 않는 단어 토큰을 제거한다. 예를들어 영어에서 a, an, the, i, you 등이 의미를 가지지 않는 불용어이다.

from nltk.corpus import stopwords
nltk_stopwords = stopwords.words("english")

stopwords.words("english")는 영어의 불용어를 반환한다. 소문자 list를 반환하므로 소문자로 변환하여 string비교를 진행한다.

word_token = [word for word in tokenized_text if not word.lower() in nltk_stopwords]

word_token은 불용어가 제거된 string list만 남게된다.

어간 추출(Stemming) / 표제어 추출(Lemmatization)

어간 추출과 표제어 추출은 하나의 단어로 일반화시켜서 문서 내의 단어 수를 줄이는 것이다.

표제어 추출은 단어들이 다른 형태를 가지더라도, 그 뿌리 단어를 찾아가서 단어의 개수를 줄이는 것이다. am, are, is는 서로 다른 스펠링이지만 원형은 be이다.

표제어 추출은 형태학적 파싱을 먼저 진행하는 것이 좋다. 형태소란 ‘의미를 가진 가장 작은 단위’를 뜻한다. 형태소에는 어간(stem)과 접사(affix)가 있다. 어간은 단어의 의미를 담고 있는 단어의 핵심 부분이고 접사는 단어에 추가적인 의미를 주는 부분이다.

# Stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()

stem_example1 = ["cook", "cooked", "cooking"]
ps_stem1 = [ps.stem(w) for w in stem_example1] #['cook', 'cook', 'cook']

from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

stem_example2 = ["friend", "friends", "friendship", "friendships"]
ps_stem2 = [ps.stem(w) for w in stem_example2] #['friend', 'friend', 'friendship', 'friendship']

sudo python -m nltk.downloader -d /usr/local/share/nltk_data wordnet

# Lemmatization
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
lem.lemmatize("is", pos="v") #be
lem.lemmatize("better", pos="a") #good
lem.lemmatize("friendship", pos="n") #
#n : 명사
#v : 동사
#a : 형용사
#r : 부사

정수 인코딩(Integer Encoding)

컴퓨터는 텍스트보다는 숫자를 더 잘 처리 할 수 있습니다. 이를 위해 자연어 처리에서는 텍스트를 숫자로 바꾸는 여러가지 기법들이 있다. 그중 CountVectorizer, TF-IDF Transform에 대해서 알아보겠다.

CountVectorizer

low당 단어 token의 빈도수를 구한다.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(lowercase=True, stop_words="english") /# 대문자 -> 소문자, 영어 불용어를 제거하고 남은 단어들로 vector를 생성한다./
cv_reviews = cv.fit_transform(reviews)
cv_reviews.toarray()
/# array([[0, 0, 2, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0],/
/# [1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1]], dtype=int64)/

u_words = cv.get_feature_names()
print(u_words)
/# ['appetizer', 'barbecue', 'delicious', 'favor', 'food', 'great', 'just', 'korean', 'large', 'soho', 'steak', 'tables', 'tartare']/

TF-IDF Transform

low당 단어 token의 빈도수를 구한 후 단어의 가중치를 계산한다. 빈도수가 많은 단어의 가중치를 줄이고 빈도수가 적은 단어 가중치를 높게 계산한다.

# tf: term frequency
# idf: inverse document-frequency

tf-idf(t, d) = tf(t, d) * idf(t)

# n: the total number of documents
# df(t): the document frequency of t
# if smooth_idf=False
idf(t) = log [ n / df(t) ] + 1

# if smooth_idf=True - 제로 나누기를 방지한다.
idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1

tfidf1 = TfidfVectorizer(lowercase= True, stop_words=new_stopwords, ngram_range = (1,2)).build_analyzer()
*def* tfidf_stem(x):
    return(ps.stem(w) for w in tfidf1(x))
tfidf_stem_vertorizer = TfidfVectorizer(analyzer=tfidf_stem)
tfidf_reviews = tfidf2.fit_transform(reviews)
tfidf_u_words = tfidf2.get_feature_names()
pd.DataFrame(tfidf_reviews.toarray(),columns=tfidf_u_words)

ngram_range는 n-gram은 n개의 연속적인 단어 나열을 의미한다. 참고

Reference

1) 토큰화(Tokenization) - 딥 러닝을 이용한 자연어 처리 입문

'develop' 카테고리의 다른 글

AWS ElasticSearch Service 운영 및 장애발생 해결 (0)	2020.07.09
S3 Shortener (0)	2020.06.03
stream 사용(Node.js와 web client 간 통신) (0)	2020.05.16
MongoDB Index (0)	2020.04.13
Clickjacking 보호 (0)	2020.04.13

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

글 보관함

yogae 블로그

티스토리 뷰