[이론] 형태소 분석

데이터 분석/키워드 분석

[이론] 형태소 분석

toraa 2025. 1. 21. 14:16

형태소분석기

mecab
Okt
Kkma
Hannanum
Komoran
kiwipiepy

mecab과 kiwipiepy 설치

https://python-mecab-ko.readthedocs.io/en/latest/

https://bab2min.github.io/kiwipiepy/v0.16.2/kr/

kiwipiepy API documentation (v0.16.2)

Package kiwipiepy Kiwipiepy란? Kiwipiepy는 한국어 형태소 분석기인 Kiwi(Korean Intelligent Word Identifier)의 Python 모듈입니다. C++로 작성되었고 다른 패키지에 의존성이 없으므로 C++ 컴파일이 가능한 환경이라

bab2min.github.io

!pip install python-mecab-ko kiwipiepy

1. mecab

import pandas as pd
from mecab import MeCab
import re


df = pd.read_csv('./data/reviews.csv')
mecab = MeCab()

text = df.loc[0,'text']
text

품사 태그 기준으로 토큰화

# 튜플로 토큰과 테그가 묶임
kwords = mecab.pos(text)
kwords

pos : 품사 태그를 기반으로 품사별로 텍스트를 나눠줌

원하는 품사 선택

## 원하는 tag만 가져오기
use_tags = ['NNP', 'NNG', 'NP', 'IC', 'MAG']
kwords = [w[0].lower() for w in kwords if w[1] in use_tags]
rnew_text = ' '.join(kwords)
rnew_text

2. kiwipie

from kiwipiepy import Kiwi
kw = Kiwi()

품사 태그로 토큰화

# 객체 형태로 토큰과 테그가 묶임
kwords = kw.tokenize(text)
kwords

원하는 품사만 가져오기

## 원하는 tag만 가져오기
use_tags = ['NNP', 'NNG', 'NP', 'IC', 'MAG']
kwords = [w[0].lower() for w in kwords if w[1] in use_tags]  #모두 한국어여서 .lower()는 생략가능
rnew_text = ' '.join(kwords)
rnew_text

어떤 품사가 유의미한지 판단해서 원하는 품사 태그만 가져올 수 있음

from kiwipiepy import Kiwi
kw = Kiwi()

# 객체 형태로 토큰과 테그가 묶임
kwords = kw.tokenize(text)
kwords


[Token(form='좋', tag='VA', start=0, len=1),
 Token(form='은', tag='ETM', start=1, len=1),
 Token(form='컨텐츠', tag='NNG', start=3, len=3),
 Token(form='들', tag='XSN', start=6, len=1),
 ...
  Token(form='~', tag='SO', start=188, len=1),
 Token(form=')', tag='SSC', start=189, len=1)]

token이라는 객체로 만들어짐

원하는 품사만 가져오기

## 원하는 테그 가져오기
use_tags = ['NNP', 'NNG', 'NP', 'IC', 'MAG']
kwords = [w.form for w in kwords if w.tag in use_tags] 
rnew_text = ' '.join(kwords)
rnew_text

동일하게 join함수로 텍스트 만듦

3. mecab과 kiwipie 비교

# 판다스 출력 옵션 설정(무한대 출력)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

def show_tag(text):
    mc_text = mecab.pos(text)
    kw_text = [(t.form, t.tag) for t in kw.tokenize(text)] 
    max_len = max([len(mc_text),len(kw_text)])

    mc_text += [None for i in range(max_len - len(mc_text))]
    kw_text += [None for i in range(max_len - len(kw_text))]

    text_set = {
        'Mecab':mc_text,
        'Kiwi':kw_text
    }
    return pd.DataFrame.from_dict(text_set)

show_tag(text)

pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 60)

전체데이터 클렌징

from mecab import MeCab

store_df = pd.read_csv('./data/sentence2_test.csv').dropna()

mecab = MeCab()

# 원하는 품사 태그 목록 설정
use_tags = ['NNP', 'NNG', 'NP', 'IC', 'MAG']
# 불용어 목록
stopwords = ['잘', '안', '너무', '더', '또', 'ㅎㅎ']

# 토큰화 함수 정의
def cleaning(review):
    # 품사 태그와 함께 토큰화
    kwords = mecab.pos(review)
    
    # 원하는 품사만 필터링
    kwords = [w[0]for w in kwords if w[1] in use_tags]
    # 불용어 제거 추가
    kwords = [word for word in kwords if word not in stopwords]
    clean = ' '.join(kwords)

    return clean

# 클렌징 함수 적용
store_df['text_tag'] = store_df['text'].apply(cleaning)
store_df.to_csv('./data/sentence2_tag.csv',index = False)
store_df

저작자표시 (새창열림)