AI and Deep Learning in Python
Current location: Home > Workshops > AI and Deep Learning in Python

Python 5 - FP-growth Algorithm and FP-Tree

Posted:2019-06-08 21:54:47 Click:4050

FP-growth algorithm


Have you ever gone to a search engine, typed in a word or part of a word, and the search engine automatically completed the search term for you? Perhaps it recommended something you didn’t even know existed, and you searched for that instead. This requires a way to find frequent itemsets efficiently. FP-growth algorithm find frequent itemsets or pairs, sets of things that commonly occur together, by storing the dataset in a special structure called an FP-tree.


The FP-Growth Algorithm, proposed by Han in , is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the TreeProjection .


The FP-growth algorithm scans the dataset only twice. The basic approach to finding frequent itemsets using the FP-growth algorithm is as follows:

1 Build the FP-tree.

2 Mine frequent itemsets from the FP-tree.

The FP stands for “frequent pattern.” An FP-tree looks like other trees in computer science, but it has links connecting similar items. The linked items can be thought of as a linked list.

The FPtree is used to store the frequency of occurrence for sets of items. Sets are stored as paths in the tree.

Sets with similar items will share part of the tree.

Only when they differ will the tree split.

A node identifies a single item from the set and the number of times it occurred in this sequence.

A path will tell you how many times a sequence occurred.

The links between similar items, known as node links, will be used to rapidly find the location of similar items.



FP-growth Algorithm (Pros and Cons)

Pros: Usually faster than Apriori.
Cons: Difficult to implement; certain datasets degrade the performance.
Works with: Nominal values.



General approach to FP-growth algorithm

1. Collect: Any method.

2. Prepare: Discrete data is needed because we’re storing sets. If you have continuous data, it will

3. need to be quantized into discrete values.

4. Analyze: Any method.
5. Train: Build an FP-tree and mine the tree.

6. Test: Doesn’t apply.
7. Use: This can be used to identify commonly occurring items that can be used to make decisions, suggest items, make forecasts, and so on.


FP-Tree using Spyder in Anaconda


1. Download and install Anaconda (distribution version) from Anaconda official site.





2. Choose your OS platform (Win / Mac / Linux) and download Anaconda installer.





3. Once the installation is completed. Launch the Anaconda Navigator.




Anaconda Navigator is a useful platform which contain all commonly used anaconda applications such as:

1. Spyder - Scientific python development environment IDE with advanced editing, interactive testing, debugging features.

2. Jupyter - Web-based, interactive computing environment. It allows edit and run human-readable docs while describing the data analysis.

3. JupyterLab - An extensible environment for interactive and reproducible computing, based on the Jupyter Notebook and Architecture.


Besides, Anaconda Environments function allow user to install different computation environments (default is "base (root)" environment ) via the installation of different conda python packages.

The following figure show the Anaconda platform with the installation of both "base root" and "tensorflow" environment which is used for Deep Learning Networks. We will talk about it in the future labs.





4. Launch the Spyder application by pressing the "Launch" button or run the Spyder application directly from the Windows menu. Like this.

(Note: If your Spyder haven't installed yet, press the "Install" button in the Anaconda Navigator platform. )



As shown in the above figure, Spyder allows us to open/view/edit/run any python program (Left) and the results will be shown on the Right, which is a typical command-line console.

Of course, we can also directly enter any commands and scripts in this console and run the python code directly.

In this lab, we will use these functions and features in this lab to run the FP-tree scripts.


Step 1 - Define the variables and treeNode class


In [1]:

#variables:
#name of the node, a count
#nodelink used to link similar items
#parent vaiable used to refer to the parent of the node in the tree
#node contains an empty dictionary for the children in the node
class treeNode:
def __init__(self, nameValue, numOccur, parentNode):
self.name = nameValue
self.count = numOccur
self.nodeLink = None
self.parent = parentNode      #needs to be updated
self.children = {}
#increments the count variable with a given amount
def inc(self, numOccur):
self.count += numOccur
#display tree in text. Useful for debugging
def disp(self, ind=1):
print ('  '*ind, self.name, ' ', self.count)
for child in self.children.values():
child.disp(ind+1)


Step 2 - Define the Root node "Pyramid"

In [2]:

rootNode = treeNode('pyramid',9,None)



Step 3 - Define the first child node "eye"

In [3]:

rootNode.children['eye'] = treeNode('eye',13,None)



Step 4 - Display nodes

In [4]:

rootNode.disp()


Output:

pyramid   9
eye   13



Constructing the FP-tree



In addition to the FP-tree, you need a header table to point to the first instance of a given type. The header table will allow you to quickly access all of the elements of a given type in the FP-tree.

In addition to storing pointers, you can use the header table to keep track of the total count of every type of element in the FP-tree.


createTree(), takes the dataset and the minimum support as arguments and builds the FP-tree. This makes two passes through the dataset. The first pass goes through everything in the dataset and counts the frequency of each term. These are stored in the header table.


Step 5 - Create FP-tree

In [5]:

def createTree(dataSet, minSup=1): #create FP-tree from dataset but don't mine
headerTable = {}
#go over dataSet twice
for trans in dataSet:#first pass counts frequency of occurance
for item in trans:
headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
for k in list(headerTable):  #remove items not meeting minSup
if headerTable[k] < minSup:
del(headerTable[k])
freqItemSet = set(headerTable.keys())
#print 'freqItemSet: ',freqItemSet
if len(freqItemSet) == 0: return None, None  #if no items meet min support -->get out
for k in headerTable:
headerTable[k] = [headerTable[k], None] #reformat headerTable to use Node link
#print 'headerTable: ',headerTable
retTree = treeNode('Null Set', 1, None) #create tree
for tranSet, count in dataSet.items():  #go through dataset 2nd time
localD = {}
for item in tranSet:  #put transaction items in order
if item in freqItemSet:
localD[item] = headerTable[item][0]
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]
updateTree(orderedItems, retTree, headerTable, count)#populate tree with ordered freq itemset
return retTree, headerTable #return tree and header table

Step 6 - UpdateTree() grow the Fp-tree with an itemset.

In [6]:

def updateTree(items, inTree, headerTable, count):
if items[0] in inTree.children:#check if orderedItems[0] in retTree.children
inTree.children[items[0]].inc(count) #incrament count
else:   #add items[0] to inTree.children
inTree.children[items[0]] = treeNode(items[0], count, inTree)
if headerTable[items[0]][1] == None: #update header table
headerTable[items[0]][1] = inTree.children[items[0]]
else:
updateHeader(headerTable[items[0]][1], inTree.children[items[0]])
if len(items) > 1:#call updateTree() with remaining ordered items
updateTree(items[1::], inTree.children[items[0]], headerTable, count)




Step 7 - UpdateHeader() makes sure the node links points to every instance of the this item on the tree

In [7]:

def updateHeader(nodeToTest, targetNode):   #this version does not use recursion
while (nodeToTest.nodeLink != None):    #Do not use recursion to traverse a linked list!
nodeToTest = nodeToTest.nodeLink
nodeToTest.nodeLink = targetNode


Step 8 - Load sample data

In [8]:


def loadSimpDat():
simpDat = [['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
return simpDat


Step 9 - The createTree() function doesn’t take the input data as lists. It expects a dictionary with the itemsets as the dictionary keys and the frequency as the value. A createInitSet() function does this conversion for you.

In [9]:

def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
retDict[frozenset(trans)] = 1
return retDict




Step 10 - Load Sample Data

In [10]:

simpDat = loadSimpDat()




Step 11 - Display Sample Data

In [11]:

simpDat

Output:

Out[11]:
[['r', 'z', 'h', 'j', 'p'],
['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
['z'],
['r', 'x', 'n', 'o', 's'],
['y', 'r', 'x', 'z', 'q', 't', 'p'],
['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]




Step 12 - Create InitSet

In [12]:

initSet = createInitSet(simpDat)



Step 13 - Display InitSet

In [13]:

initSet


Output:

Out[13]:
{frozenset({'h', 'j', 'p', 'r', 'z'}): 1,
frozenset({'s', 't', 'u', 'v', 'w', 'x', 'y', 'z'}): 1,
frozenset({'z'}): 1,
frozenset({'n', 'o', 'r', 's', 'x'}): 1,
frozenset({'p', 'q', 'r', 't', 'x', 'y', 'z'}): 1,
frozenset({'e', 'm', 'q', 's', 't', 'x', 'y', 'z'}): 1}



Step 14 - Create FP Tree - myFPtree

In [14]:

#The FP-tree
myFPtree, myHeaderTab = createTree(initSet, 3)





Step 15 - Display myFPtree

In [15]:

myFPtree.disp()

Output:


Null Set   1
z   5
r   1
x   3
t   3
y   3
s   2
r   1
x   1
s   1
r   1

The item and its frequency count are displayed with indentation representing the depth of the tree.



Mining frequent items from an FP-tree

There are three basic steps to extract the frequent itemsets from the FP-tree:

1 Get conditional pattern bases from the FP-tree.

2 From the conditional pattern base, construct a conditional FP-tree.

3 Recursively repeat steps 1 and 2 on until the tree contains a single item.

The conditional pattern base is a collection of paths that end with the item you’re looking for.

Each of those paths is a prefix path. In short, a prefix path is anything on the tree between the item you’re looking for and the tree root.

ascendTree(), which ascends the tree, collecting the names of items it encounters



Step 16 - Define ascendTree

In [16]:

def ascendTree(leafNode, prefixPath): #ascends from leaf node to root
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendTree(leafNode.parent, prefixPath)


The findPrefixPath() function iterates through the linked list until it hits the end.

For each item it encounters, it calls ascendTree().
This list is returned and added to the conditional pattern base dictionary called condPats.


Step 17 - Define findPrefixPath

In [17]:

def findPrefixPath(basePat, treeNode): #treeNode comes from header table
condPats = {}
while treeNode != None:
prefixPath = []
ascendTree(treeNode, prefixPath)
if len(prefixPath) > 1:
condPats[frozenset(prefixPath[1:])] = treeNode.count
treeNode = treeNode.nodeLink
return condPats




Step 18 - Execute findPrefixPath for 'x'

In [18]:

findPrefixPath('x', myHeaderTab['x'][1])

Output:


Out[18]: {frozenset({'z'}): 3}



Step 19 - Execute findPrefixPath for 'r'

In [19]:

findPrefixPath('r', myHeaderTab['r'][1])

Output:

Out[19]:
{frozenset({'z'}): 1,
frozenset({'s', 'x'}): 1,
frozenset({'t', 'x', 'y', 'z'}): 1}



Conclusion

The FP-growth algorithm is an efficient way of finding frequent patterns in a dataset. 

The FP-growth algorithm works with the Apriori principle but is much faster. 

The Apriori algorithm generates candidate itemsets and then scans the dataset to see if they’re frequent.

FP-growth is faster because it goes over the dataset only twice. 

The dataset is stored in a structure called an FP-tree. 

After the FP-tree is built, you can find frequent itemsets by finding conditional bases for an item and building a conditional FP-tree. 

This process is repeated, conditioning on more items until the conditional FPtree has only one item.






到价提醒[关闭]