頁籤選單縮合
題 名 | 以基因演算法為基礎建立自動化文件分類模式=A Genetic Algorithm Based Approach for Text Categorization |
---|---|
作 者 | 胡雅涵; 黃正魁; 楊承翰; | 書刊名 | 資訊管理學報 |
卷 期 | 21:3 2014.07[民103.07] |
頁 次 | 頁305-339 |
分類號 | 028.7 |
關鍵詞 | 文件分類; 基因演算法; 特徵選取; 分類器; Document categorization; Genetic algorithm; Feature selection; Classifier; |
語 文 | 中文(Chinese) |
中文摘要 | 數位資訊迅速地成長,增加人們在找尋資訊上的搜尋成本,如何有效地分類 管理文件已是一項重要的研究議題。因此,文件分類研究的重要性與日俱增,在 文件分類領域中存在文件特徵維度過高的問題,因此,我們以基因演算法(Genetic Algorithm, GA)為基礎選取文件中特徵字詞,透過對GA 染色體於文件特徵向量 設計和調整GA 設定的參數,讓分類器(Classifier)從訓練資料中選取特徵字詞, 並進行文件分類模式建構。本研究提出之GA 特徵選取(GA-based Feature Selection, GAFS)方式,透過讓各單一分類器都能自我學習達到最佳化,進而提升各分類器 的分類效能,以建構出分類效果最佳化的文件分類模式。實驗部分,本研究採用 WebKB 網頁文件資料集,評估GAFS 所建立的文件分類模式,並與傳統將所有特 徵集合進行訓練之方法(簡稱TOTAL)做比較。本研究採用六種不同的分類器模 式,包含貝氏分類器(Naïve Bayesian Classifier)、決策樹(Decision Tree)、分類 迴歸樹(Classification and Regression Tree)、隨機森林(Random Forest)、支援向 量機(Support Vector Machine),以及k 最近鄰居法(k Nearest Neighbor)。實驗結 果顯示,本研究提出之GAFS 方法能夠有效地改善各分類模式的分類效能,證實 以GA 為基礎之GAFS 自動化文件分類模式明顯優於TOTAL,並且在特徵維度逐 漸擴大的情況下,GAFS 仍能有效地改善分類效能,並且擁有穩定地分類準確率。 |
英文摘要 | Purpose: Digital data has been accumulated rapidly resulting in the significant increase in the cost of searching information from the data source. How to effectively manage documents (i.e., text categorization, TC) has become an important research issue. However, in TC, huge amount of index terms are selected for representing document vectors, resulting in poor prediction outcomes. This study proposes a genetic algorithm based feature selection (GAFS) method to optimize the selection of index terms. Design/methodology/approach: Before training classifiers, GAFS selects a reduced set of index terms that can optimize the prediction accuracy of classifiers. In experimental study, the WebKB dataset was used to evaluate the performance of GAFS. A total of six well-known classification techniques were considered, including naïve Bayesian classifier (NB), decision tree (DT), classification and regression tree (CART), random forest (RF), support vector machine (SVM) and k-nearest neighbor (kNN). The baseline model, denoted as TOTAL, is to consider complete set of index terms in allexperiments. Findings: The results show that the proposed GAFS method outperforms the TOTAL method. The performance of kNN and RF classifiers deteriorates as the number of features increases. Under different number of features, the SVM, NB, and DT classifiers perform stably but the CART classifier has relatively unstable performance. Research limitations/implications: This study only considers the WebKB dataset. Future research is recommended to include other well-known datasets in the TC domain. Other feature selection methods can be also considered in the experimental evaluation. Practical implications: Two practical implications are provided. First, this study reveals that different parameter settings in genetic algorithm (GA) can significantly affect the performance of feature selection in TC. Second, the proposed GAFS method allows users to systematically construct a robust classifier for TC. Originality/value: This paper investigates the influence of the parameters used in GA for the feature selection in TC. It advances the literature in choosing GA parameters and classification techniques for optimizing the TC performance. |
本系統中英文摘要資訊取自各篇刊載內容。