A Classical Apriori Algorithm for Mining Association Rules
What is an Association Rule? Given a set of transactions {t1, t2, ...,tn} where a transaction ti is a set of items {Xi1, … , Xim} An association rule is an expression: A ==> B where A & B are sets of items, and A B = Meaning: transactions which contain A also contain B
Two Thresholds Measurement of rule strength in a relational transaction database A ==> B [support, confidence] support (AB) = confidence (A ==> B) =
Strong Rules We are interested in strong associations, i.e., support min_sup & confidence min_conf Examples: bread & butter ==> milk [support=5%, confidence=60%] beer ==> diapers [support=10%, confidence=80%]
Mining Association Rules Mining association rules from a large data set of items can improve the quality of business decisions A supermarket with a large collection of items, typical business decisions: what to put on sale how to design coupons, how to place merchandise on shelves to maximize the profit, etc.
Mining Association Rules (2) There are two main steps in mining association rules 1. Find all combinations of items that have transaction support above minimum support (frequent itemsets) 2. Generate association rules from the frequent itemsets Most existing algorithms focused on the first step because it requires a great deal of computation, memory, and I/O, and has a significant impact on the overall performance
The Classical Mining Algorithm Apriori (Agrawal, et al.’94) At the first iteration, scan all the transactions and count the the number of occurrences for each items. This derives the frequent 1-itemsets, L1 At the k-th iteration, the candidate set Ck are those whose every (k-1)-item subset is in Lk-1 is formed Scan the database and count the number of occurrences for each candidate k-itemset Totally, it needs x database scans for x levels
Moving 1 level at a time (Apriori) through an itemset lattice Level x … Level (k+1) Level k … Level 3 Level 2 Level l
The Algorithm Apriori 1. L1 = {frequent 1-itemset} 2. For (k=2; Lk-1 L < > 0, k++) { 3. Ck = Apriori_gen(Lk-1) ; 4. for all transactions t in D do 5. for all candidates c in D do 6. c.count++ ; 7. Lk = {c in Ck | c.count >= minimum support} 8. } 9. Result = Uk Lk
The Algorithm Apriori _gen Pre: all itemsets in Lk-1 Post: itemsets in Ck Insert into Ck Select p.item1, p.item2, …, p.itemk-1, q.itemk-1 From Lk-1 p, Lk-1 q Where p.item1 = q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 = q.itemk-1
The prune step Pre: itemsets in Ck and Lk-1 Post: itemsets in Ck such that some (k-1)-subset of c which is not in Lk-1 are deleted Forall itemsets c Ck do Forall (k-1)-subsets s of c do if (s Lk-1) then delete c from Ck
An Example minsup = 20% L1 = { A, B, C, D, E, F} Input Dataset Tid items 1 A B C 2 B C E 3 A B C E F 4 A B C D 5 A B C E 6 A B C E F 7 B C D E F 8 A B C 9 A C D E 10 B C E F minsup = 20% L1 = { A, B, C, D, E, F}
An Example (2) C2 = {AB, AC, AD, AE, AF, BC, BD, BE, BF, CD, CE, CF, DE, DF, EF} After counting C2 = {AB(6), AC(7), AD(2), AE(4), AF(2), BC(8), BD(2), BE(6), BF(4), CD(3), CE(6), CF(3), DE(2), DF(1), EF(4)} L2 = {AB, AC, AD, AE, AF, BC, BD, BE, BF, CD, CE, CF, DE, EF}
An Example (3) C3 = {ABC, ABD, ABE, ABF, ACD, ACE, ACF, ADE, ADF, AEF, BCD, BCE, BCF, BDE, BDF, BEF, CDE, CDF, CEF} After pruning C3 = {ABC, ABD, ABE, ABF, ACD, ACE, ACF, ADE, AEF, BCD, BCE, BCF, BDE, BEF, CDE, CEF} After counting C3 = {ABC(6), ABD(1), ABE(3), ABF(2), ACD(2), ACE(4), ACF(2), ADE(1), AEF(2), BCD(2), BCE(2), BCF(3), BDE(1), BEF(4), CDE(2), CEF(3)} L3 = {ABC, ABE, ABF, ACD, ACE, ACF, AEF, BCD, BCE, BCF, BEF, CDE, CEF}
An Example (4) C4 = {ABCE, ABCF, ABEF, ACDE, ACDF, BCDE, BCDF, BCEF} After pruning C4 = {ABCE, ABCF, ABEF, ACEF, BCEF} After counting C4 = {ABCE(3), ABCF(2), ABEF(2), ACEF(2), BCEF(3),} L4 = {ABCE, ABCF, ABEF, ACEF, BCEF}
An Example (5) C5 = {ABCEF} After counting C5 = {ABCEF(2)} L5 = {ABCEF}
Assignment 1 Work: ให้เขียนโปรแกรมที่สอดคล้องกับ An algorithm Apriori เพื่อ generate Frequent itemsets ในแต่ละ Level ของ Itemsets lattice Data sets : สามารถ download จากเครื่อง “angsila/~nuansri/310214” run ด้วยค่า minimum support ต่างๆดังนี้ xt10.data ==> minsup = 20%, 15%, และ 10% tr2000.data ==> minsup= 10%, 8% และ 5%
Assignment 1 (2) สาธิตโปรแกรมและเอกสารประกอบโปรแกรม ณ ห้อง SD417 Note: Due : วันจันทร์ ที่ 15 ก.ย. 2546 สาธิตโปรแกรมและเอกสารประกอบโปรแกรม ณ ห้อง SD417 Note: Frequent itemsets ในทุก Level ของ Itemsets lattice จะต้องเหมือนกัน ไม่ว่า run โดยคนละโปรแกรม หรือโปรแกรมใช้โครงสร้างข้อมูลที่ต่างกัน ใน data sets ชุดเดียวกัน ดังนั้นนิสิตทุกคน สามารถตรวจความถูกต้องของ จำนวนและค่าของ frequent itemsets ใน data sets ชุดเดียวกัน กับเพื่อนร่วมชั้นเรียน