頁籤選單縮合
題 名 | 基於文本概念和kNN的跨語種文本過濾=Cross-Language Text Filtering Based on Text Concepts and kNN |
---|---|
作 者 | 蘇偉峰; 李紹滋; 李堂秋; 尤文建; | 書刊名 | International Journal of Computational Linguistics & Chinese Language Processing |
卷 期 | 7:1 2002.02[民91.02] |
頁 次 | 頁79-90 |
分類號 | 312.13 |
關鍵詞 | 可分義原; 向量空間; 文本表示; 知網; Classfiable sememe; Vector space; kNN; Text representation; HowNet; |
語 文 | 中文(Chinese) |
中文摘要 | 本文介紹一個可以從中文或英文大量的資訊中過濾出用戶的興趣所在的文檔 的模型,用一簇可分義原向量空間的向量來表示用戶所感興趣的文本,然後把 需要處理的文本也表示成一個可分義原空間中的一個向量,在向量空間中與k 個最相近的向量進行計算,從而決定是否將該文本呈現給用戶。實驗證明,這 是一個比較好的過濾方法。 |
英文摘要 | The WWW is increasingly being used source of information. The volume of information is accessed by users using direct manipulation tools. It is obviously that we’d like to have a tool to keep those texts we want and remove those texts we don’t want from so much information flow to us. This paper describes a module that sifts through large number of texts retrieved by the user. The module is based on HowNet, a knowledge dictionary developed by Mr. Zhendong Dong. In this dictionary, the concept of a word is divided into sememes. In the philosophy of HowNet, all concepts in the world can be expressed by a combination more than 1500 sememes. Sememe is a very useful concept in settle the problem of synonym which is the most difficult problem in text filtering. We classified the set of sememes into two sets of sememes: classfiable sememes and unclassficable semems. Classfiable sememes includes those sememes that are moreuseful in distinguishing a document’s class from other documents. Unclassfiable sememes include those sememes that have similar appearance in all documents. Classfiable includes about 800 sememes. We used these 800 classficable sememes to build Classficable Sememes Vector Space(CSVS). A text is represented as a vector in the CSVS after the following step: 1. text preprosessing: Judge the language of the text and do some process attribute to its language. 2. Part-of-Speech tagging |
本系統中英文摘要資訊取自各篇刊載內容。