頁籤選單縮合
題 名 | 高效率探勘關聯規則之演算法--EFI=An Efficient Algorithm for Mining Association Rules--EFI |
---|---|
作 者 | 黃仁鵬; 藍國誠; | 書刊名 | 資訊管理學報 |
卷 期 | 14:2 2007.04[民96.04] |
頁 次 | 頁139-167 |
分類號 | 312.1 |
關鍵詞 | 資料探勘; 關聯法則; 二階段過濾; Data mining; Association rules; Two phase filtrations; |
語 文 | 中文(Chinese) |
中文摘要 | 資料探勘的技術變得日益重要,也廣泛的應用在商業上的預測以及決策的支援。關聯法則在資料探勘的領域中也扮演相當重要的地位,許多關聯法則演算法不斷被提出、改進,以增進效能或節省記憶體空間;本研究也朝著這個目標,試著改進關聯規則演算法為主要方向。 本研究主要是針對探勘關聯規則QDI演算法的特性及缺點來加以改進,雖然QDI演算法已是很有效率的演算法之一,不過,它還是有兩個最主要的問題。第一,QDI演算法無法探勘交易長度太長的交易資料庫。第二,QDI演算法對記憶體使用率不佳;因此,QDI演算法的實用性大打折扣。 基本於上述理由,本研究提出一個改良QDI演算法產生項目集的核心概念新演算法EFI (An Efficient Approach for Filtering Infrequent Itemsets)。EFI演算法的特色就是二階段過濾的方式,因為該過濾方式可大量減少非高頻項目集的數量,將更能適用於探勘交易長度較長的資料庫,僅需掃描資料庫四次且不需要產生任何候選項目集,即可快速找出關聯規則。另外,EFI演算法也改進ICI-like演算法因儲存大量項目集須耗用龐大記憶體空間的缺點,每筆交易經過二階段過濾機制的處理後,僅會產生最有可能成為高頻的項目集,因此,EFI能大量降低項目表須耗用的記憶體空間,以提升記憶體的使用率。在現實生活中的資料庫容量通常都是大於記憶體容量,為了解決此問題,EFI演算法將選擇採用資料庫分割方式繼續執行探勘任務,每個子資料庫僅需四次I/O動作,不隨著高頻項目集的長度增長而增加I/O次數,以避免耗費過多的I/O時間,也可有效提高執行效率與實用性。 |
英文摘要 | The technology of data mining is more important in recent years, and it is generally applied to commercial forecast and decision supports. Association rules mining algorithms in the field of data mining play the important role. Many of association rules mining algorithms were proposed to improve the efficiency of data mining or save the utility rate of memory. So, our major study tries to improve the efficiency of association rules mining algorithms. In this paper, our major study is to improve the defects of the QDI algorithm. Although QDI algorithm was one of the most efficient algorithms, but it still has two serious problems; in the first place, QDI algorithm can't mine the transactions of databases whose record length is very long; in the second place, QDI algorithm isn't very efficient at utility rate of the memory. Therefore, the QDI algorithm is not very practical. Based on above reasons we propose a new algorithm-EFI (An Efficient Approach for Filtering Infrequent Itemsets) that is improved from QDI algorithm. The one of the characters of EFI algorithm is the two phrase filtrations which can reduce lots of non-frequent itemsets and is very suitable to mine the transactions of databases whose record length is very long. To find association rules quickly the EFI algorithm only scans database four times and doesn't generate any candidate itemset in mining process. Besides, the EFI algorithm also improves the defect of the ICI-like algorithms that need lots of memory spaces to store lots of sub-itemsets which are decomposed from the transaction records of database. However, the EFI algorithm uses the two phrase filtrations to filter out lots of non-frequent itemsets; it only generates the itemsets which are the most possible to be the frequent itemsets. So, the EFI algorithm can decrease a large number of non-frequent itemsets and increase the utility rate of memory. The size of the databases in the real world is always greater than the size of the memory. In order to solve this problem, the EFI algorithm divides a large database into many sub-databases and mines association rules from those sub-databases. The EFI algorithm only scans database four times and will not be affect by the length of frequent itemsets. The EFI algorithm avoids wasting a lot of I/O time and increases the efficiency and the practicability in application. |
本系統中英文摘要資訊取自各篇刊載內容。