[이론] 텍스트 시각화 - LDA 토픽 모델링

데이터 분석/키워드 분석

[이론] 텍스트 시각화 - LDA 토픽 모델링

toraa 2025. 1. 22. 14:07

토픽을 추출 - 숨겨진 주제를 추출함

벡터로 표현하여 군집화함

기본적으로 벡터화까지 전처리가 진행되어야 함

LatentDirichletAllocation

와

pyLDAvis모듈을 활용

!pip install pyldavis

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

LatentDirichletAllocation

Gallery examples: Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

scikit-learn.org

https://github.com/bmabey/pyLDAvis

GitHub - bmabey/pyLDAvis: Python library for interactive topic model visualization. Port of the R LDAvis package.

Python library for interactive topic model visualization. Port of the R LDAvis package. - bmabey/pyLDAvis

github.com

빈도 벡터를 활용해 LDA 모델링

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer


cvec = CountVectorizer(max_df=0.9, min_df=5 )
tf = cvec.fit_transform(store_df['text_tag']) # transform진행으로 벡터화
model = LatentDirichletAllocation(n_components=4, # 토픽개수(군집에 대한 갯수)
                                  max_iter=10, # 반복 횟수 (너무 길어지면 오래걸림)
                                  learning_method = 'online', 
                                  # 미니배치(배치에 대한 개념)
                                  random_state=0,  # 난수 설정
                                 )

model.fit(tf)

fit() 함수 사용 - 벡터화된 텍스트 데이터 넣어줌

→ 군집화 진행됨

model.components_


array([[2.61397759e-01, 2.60956015e-01, 2.27356715e+03, ...,
        5.48652641e+00, 2.79002745e-01, 2.56603783e-01],
       [2.59358247e-01, 2.58084851e-01, 2.60812946e-01, ...,

토픽안에 각 토큰들이 얼마나 높은 영향도를 가지고 있는지 확인

각 토큰이 얼마나 기여하는지 수치값이 나옴

높은 연관성을 가지는 토큰들을 사용함

[토픽, 각 토큰의 영향도]

https://scikit-learn.org/1.6/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

LatentDirichletAllocation

Gallery examples: Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

scikit-learn.org

pyLDAvis로 시각화

import pyLDAvis
import pyLDAvis.lda_model

pyLDAvis.enable_notebook()
#주피터노트북에서 바로 출력할수 있도록 작성해줘야 함!


# lda 모델, 벡터, 벡터모델을 인자로 하여 시각화 자료 생성
pl = pyLDAvis.lda_model.prepare(model, tf, cvec)
#model : lda에 대한 모델, tf : 벡터화된 텍스트데이터, cvec : 트랜스폼시키기위한 벡터

# 반응형 html로 저장
pyLDAvis.save_html(pl, './model/lda_cvec.html')
# 노트북에 출력 : display()함수
pyLDAvis.display(pl)

matpolib로 사용하지 않았기 때문에

웹페이지 형식으로 저장됨 (반응형)

저작자표시 (새창열림)

'데이터 분석 > 키워드 분석' 카테고리의 다른 글

[이론] 텍스트 임베딩 (1)	2025.01.23
[실습] 텍스트 시각화 (2)	2025.01.23
[이론] 텍스트 시각화 - 네트워크 그래프 (1)	2025.01.22
[이론] 텍스트 시각화 - 워드 클라우드 (1)	2025.01.22
[이론] 텍스트 시각화 - 빈도 & TF-IDF 그래프 (1)	2025.01.22

현재글[이론] 텍스트 시각화 - LDA 토픽 모델링

기록 저장소