2. (75 points) Download the file "trans.txt" and implement a streaming algorithm for...

60.1K

Verified Solution

Question

Accounting

image

2. (75 points) Download the file "trans.txt" and implement a streaming algorithm for mining the top- k most frequent patterns. In the data file "trans.txt", every line is a transaction represented by a set of item ids and the largest transaction contains 15 items. a) (15 points) Prove that to mine top- k most frequent patterns, we do not need to consider patterns of size greater than m=log2(k+1). b) (60 points) Apply the idea of the Misra-Gries Algorithm to mine approximate frequent patterns by scanning each transaction only once. Specifically, implement your algorithm as follows. (1). Maintain at most C counters. Each counter is a (key, value) pair where "key" represents a specific pattern and "value" indicates the corresponding (approximate) support of the pattern. (2). When reading a transaction, enumerate all its subsets of size at most m. Suppose for the i-th transaction we have Li such valid subsets and clearly, Li=j=1min(li,m)(lij) where li is the size of the i-th transaction. Transform the i-th transaction to a stream of Li subsets (the order could be arbitaray) and use the Misra-Gries Algorithm to count each subset's number of appearances (support). b.1) (8 points) Suppose in total we have M transactions. Let L=i=1MLi. Suppose fS is the real support of a pattern S and f^S is the approximate support maintained by your Misra-Gries Algorithm. Prove that for any pattern S, we have that fSf^SfSC+1L. b.2) (7 points) Suppose Sk is the real k-th most frequent pattern. Let f^k be the k-th largest (approximate) support obtained by your Misra-Gries Algorithm. Prove that fSkf^kfSkC+1L. b.3) (15 points) Since we only have the approximate supports of patterns obtained by our Misra-Gries Algorithm, we can only use such approximate supports to return approximate top- k patterns. We hope to collect all the true top- k patterns by returning a collection of patterns A={Sf^St} where t is a threshold for us to filter out nonfrequent patterns. Prove tha t=f^kC+1L (1 A points) (2) The minimum support of patterns in AminSup(A)=minSAfSfSkC+12L minSup(A)fSkC+12L. (9 points) b.4) (30 points) Set k=500. Run your Misra-Gries Algorithm on the "trans.txt" dataset and report the values of L and minSup(A) when setting C=500000,750000,1000000. To compute minSup(A), you can refer to the file 2. (75 points) Download the file "trans.txt" and implement a streaming algorithm for mining the top- k most frequent patterns. In the data file "trans.txt", every line is a transaction represented by a set of item ids and the largest transaction contains 15 items. a) (15 points) Prove that to mine top- k most frequent patterns, we do not need to consider patterns of size greater than m=log2(k+1). b) (60 points) Apply the idea of the Misra-Gries Algorithm to mine approximate frequent patterns by scanning each transaction only once. Specifically, implement your algorithm as follows. (1). Maintain at most C counters. Each counter is a (key, value) pair where "key" represents a specific pattern and "value" indicates the corresponding (approximate) support of the pattern. (2). When reading a transaction, enumerate all its subsets of size at most m. Suppose for the i-th transaction we have Li such valid subsets and clearly, Li=j=1min(li,m)(lij) where li is the size of the i-th transaction. Transform the i-th transaction to a stream of Li subsets (the order could be arbitaray) and use the Misra-Gries Algorithm to count each subset's number of appearances (support). b.1) (8 points) Suppose in total we have M transactions. Let L=i=1MLi. Suppose fS is the real support of a pattern S and f^S is the approximate support maintained by your Misra-Gries Algorithm. Prove that for any pattern S, we have that fSf^SfSC+1L. b.2) (7 points) Suppose Sk is the real k-th most frequent pattern. Let f^k be the k-th largest (approximate) support obtained by your Misra-Gries Algorithm. Prove that fSkf^kfSkC+1L. b.3) (15 points) Since we only have the approximate supports of patterns obtained by our Misra-Gries Algorithm, we can only use such approximate supports to return approximate top- k patterns. We hope to collect all the true top- k patterns by returning a collection of patterns A={Sf^St} where t is a threshold for us to filter out nonfrequent patterns. Prove tha t=f^kC+1L (1 A points) (2) The minimum support of patterns in AminSup(A)=minSAfSfSkC+12L minSup(A)fSkC+12L. (9 points) b.4) (30 points) Set k=500. Run your Misra-Gries Algorithm on the "trans.txt" dataset and report the values of L and minSup(A) when setting C=500000,750000,1000000. To compute minSup(A), you can refer to the file

Answer & Explanation Solved by verified expert
Get Answers to Unlimited Questions

Join us to gain access to millions of questions and expert answers. Enjoy exclusive benefits tailored just for you!

Membership Benefits:
  • Unlimited Question Access with detailed Answers
  • Zin AI - 3 Million Words
  • 10 Dall-E 3 Images
  • 20 Plot Generations
  • Conversation with Dialogue Memory
  • No Ads, Ever!
  • Access to Our Best AI Platform: Flex AI - Your personal assistant for all your inquiries!
Become a Member

Other questions asked by students