FP-Tree 频繁模式树算法

优质

小牛编辑

141浏览

2023-12-01

参考资料：http://blog.csdn.net/sealyao/article/details/6460578
更多数据挖掘算法：https://github.com/linyiqun/DataMiningAlgorithm

介绍

FP-Tree算法全称是FrequentPattern Tree算法，就是频繁模式树算法，他与Apriori算法一样也是用来挖掘频繁项集的，不过不同的是，FP-Tree算法是Apriori算法的优化处理，他解决了Apriori算法在过程中会产生大量的候选集的问题，而FP-Tree算法则是发现频繁模式而不产生候选集。但是频繁模式挖掘出来后，产生关联规则的步骤还是和Apriori是一样的。

算法原理

FP树，FP树，那他当然是最终被构造成一个树的形状了。所以步骤如下：

1、创建根节点，用NULL标记。

2、统计所有的事务数据，统计事务中各个类型项的总支持度(在下面的例子中就是各个商品ID的总个数)

3、依次读取每条事务，比如T1， 1， 2， 5，因为按照总支持度计数数量降序排列，输入的数据顺序就是2， 1， 5，然后挂到根节点上。

4、依次读取后面的事务，并以同样的方式加入的FP树中，顺着根节点路径添加，并更新节点上的支持度计数。

最后就会形成这样的一棵树：

然后还要新建一个项头表，代表所有节点的类型和支持度计数。这个东西在后面会有大用处。如果你以为FP树的算法过程到这里就结束了，你就大错特错了，算法的终结过程为最后的FP树只包括但路径，就是树呈现直线形式，也就是节点都只有1个孩子或没有孩子，顺着一条线下来，没有其他的分支。这就算是一条挖掘出的频繁模式。所以上面的算法还要继续递归的构造FP树，递归构造FP树的过程：

1、这时我们从最下面的I5开始取出。把I5加入到后缀模式中。后缀模式到时会于频繁模式组合出现构成最终的频繁模式。

2、获取频繁模式基，<I2, Ii>，<I2, I1, I3>，计数为I5节点的count值，然后以这2条件模式基为输入的事务，继续构造一个新的FP树

3、这就是我们要达到的FP树单路径的目标了，不过这里个要求，要把支持度计数不够的点排除，这里的I3:1就不符号，所以最后I5后缀模式下的<I2, I1>与I5的组合模式了，就为<I2, I5>, <I1, I5>,<I1, I2, I5>。

I5下的挖掘频繁模式是比较简单的，没有出现递归，看一下I3下的递归构造，这就不简单了，同样的操作，最后就会出现下面这幅图的样子：

发现还不是单条路径，继续递归构造，此时的后缀模式硬卧I3+I1,就是<I3, I1>，然后就来到了下面这幅图的情形了。

后面的例子会有更详细的说明。

算法的实现

输入数据如下：

交易ID	商品ID列表
T100	I1，I2，I5
T200	I2，I4
T300	I2，I3
T400	I1，I2，I4
T500	I1，I3
T600	I2，I3
T700	I1，I3
T800	I1，I2，I3，I5
T900	I1，I2，I3

在文件中的形式就是：

算法的树节点类：

/**
 * FP树节点
 * 
 * @author lyq
 * 
 */
public class TreeNode implements Comparable<TreeNode>, Cloneable{
  // 节点类别名称
  private String name;
  // 计数数量
  private Integer count;
  // 父亲节点
  private TreeNode parentNode;
  // 孩子节点，可以为多个
  private ArrayList<TreeNode> childNodes;
  
  public TreeNode(String name, int count){
    this.name = name;
    this.count = count;
  }

  public String getName() {
    return name;
  }

  public void setName(String name) {
    this.name = name;
  }

  public Integer getCount() {
    return count;
  }

  public void setCount(Integer count) {
    this.count = count;
  }

  public TreeNode getParentNode() {
    return parentNode;
  }

  public void setParentNode(TreeNode parentNode) {
    this.parentNode = parentNode;
  }

  public ArrayList<TreeNode> getChildNodes() {
    return childNodes;
  }

  public void setChildNodes(ArrayList<TreeNode> childNodes) {
    this.childNodes = childNodes;
  }

  @Override
  public int compareTo(TreeNode o) {
    // TODO Auto-generated method stub
    return o.getCount().compareTo(this.getCount());
  }

  @Override
  protected Object clone() throws CloneNotSupportedException {
    // TODO Auto-generated method stub
    //因为对象内部有引用，需要采用深拷贝
    TreeNode node = (TreeNode)super.clone(); 
    if(this.getParentNode() != null){
      node.setParentNode((TreeNode) this.getParentNode().clone());
    }
    
    if(this.getChildNodes() != null){
      node.setChildNodes((ArrayList<TreeNode>) this.getChildNodes().clone());
    }
    
    return node;
  }
  
}

算法主要实现类：

package DataMining_FPTree;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

/**
 * FPTree算法工具类
 * 
 * @author lyq
 * 
 */
public class FPTreeTool {
  // 输入数据文件位置
  private String filePath;
  // 最小支持度阈值
  private int minSupportCount;
  // 所有事物ID记录
  private ArrayList<String[]> totalGoodsID;
  // 各个ID的统计数目映射表项，计数用于排序使用
  private HashMap<String, Integer> itemCountMap;

  public FPTreeTool(String filePath, int minSupportCount) {
    this.filePath = filePath;
    this.minSupportCount = minSupportCount;
    readDataFile();
  }

  /**
   * 从文件中读取数据
   */
  private void readDataFile() {
    File file = new File(filePath);
    ArrayList<String[]> dataArray = new ArrayList<String[]>();

    try {
      BufferedReader in = new BufferedReader(new FileReader(file));
      String str;
      String[] tempArray;
      while ((str = in.readLine()) != null) {
        tempArray = str.split(" ");
        dataArray.add(tempArray);
      }
      in.close();
    } catch (IOException e) {
      e.getStackTrace();
    }

    String[] temp;
    int count = 0;
    itemCountMap = new HashMap<>();
    totalGoodsID = new ArrayList<>();
    for (String[] a : dataArray) {
      temp = new String[a.length - 1];
      System.arraycopy(a, 1, temp, 0, a.length - 1);
      totalGoodsID.add(temp);
      for (String s : temp) {
        if (!itemCountMap.containsKey(s)) {
          count = 1;
        } else {
          count = ((int) itemCountMap.get(s));
          // 支持度计数加1
          count++;
        }
        // 更新表项
        itemCountMap.put(s, count);
      }
    }
  }

  /**
   * 根据事物记录构造FP树
   */
  private void buildFPTree(ArrayList<String> suffixPattern,
      ArrayList<ArrayList<TreeNode>> transctionList) {
    // 设置一个空根节点
    TreeNode rootNode = new TreeNode(null, 0);
    int count = 0;
    // 节点是否存在
    boolean isExist = false;
    ArrayList<TreeNode> childNodes;
    ArrayList<TreeNode> pathList;
    // 相同类型节点链表，用于构造的新的FP树
    HashMap<String, ArrayList<TreeNode>> linkedNode = new HashMap<>();
    HashMap<String, Integer> countNode = new HashMap<>();
    // 根据事物记录，一步步构建FP树
    for (ArrayList<TreeNode> array : transctionList) {
      TreeNode searchedNode;
      pathList = new ArrayList<>();
      for (TreeNode node : array) {
        pathList.add(node);
        nodeCounted(node, countNode);
        searchedNode = searchNode(rootNode, pathList);
        childNodes = searchedNode.getChildNodes();

        if (childNodes == null) {
          childNodes = new ArrayList<>();
          childNodes.add(node);
          searchedNode.setChildNodes(childNodes);
          node.setParentNode(searchedNode);
          nodeAddToLinkedList(node, linkedNode);
        } else {
          isExist = false;
          for (TreeNode node2 : childNodes) {
            // 如果找到名称相同，则更新支持度计数
            if (node.getName().equals(node2.getName())) {
              count = node2.getCount() + node.getCount();
              node2.setCount(count);
              // 标识已找到节点位置
              isExist = true;
              break;
            }
          }

          if (!isExist) {
            // 如果没有找到，需添加子节点
            childNodes.add(node);
            node.setParentNode(searchedNode);
            nodeAddToLinkedList(node, linkedNode);
          }
        }

      }
    }

    // 如果FP树已经是单条路径，则输出此时的频繁模式
    if (isSinglePath(rootNode)) {
      printFrequentPattern(suffixPattern, rootNode);
      System.out.println("-------");
    } else {
      ArrayList<ArrayList<TreeNode>> tList;
      ArrayList<String> sPattern;
      if (suffixPattern == null) {
        sPattern = new ArrayList<>();
      } else {
        // 进行一个拷贝，避免互相引用的影响
        sPattern = (ArrayList<String>) suffixPattern.clone();
      }

      // 利用节点链表构造新的事务
      for (Map.Entry entry : countNode.entrySet()) {
        // 添加到后缀模式中
        sPattern.add((String) entry.getKey());
        //获取到了条件模式机，作为新的事务
        tList = getTransactionList((String) entry.getKey(), linkedNode);
        
        System.out.print("[后缀模式]：{");
        for(String s: sPattern){
          System.out.print(s + ", ");
        }
        System.out.print("}, 此时的条件模式基：");
        for(ArrayList<TreeNode> tnList: tList){
          System.out.print("{");
          for(TreeNode n: tnList){
            System.out.print(n.getName() + ", ");
          }
          System.out.print("}, ");
        }
        System.out.println();
        // 递归构造FP树
        buildFPTree(sPattern, tList);
        // 再次移除此项，构造不同的后缀模式，防止对后面造成干扰
        sPattern.remove((String) entry.getKey());
      }
    }
  }

  /**
   * 将节点加入到同类型节点的链表中
   * 
   * @param node
   *            待加入节点
   * @param linkedList
   *            链表图
   */
  private void nodeAddToLinkedList(TreeNode node,
      HashMap<String, ArrayList<TreeNode>> linkedList) {
    String name = node.getName();
    ArrayList<TreeNode> list;

    if (linkedList.containsKey(name)) {
      list = linkedList.get(name);
      // 将node添加到此队列中
      list.add(node);
    } else {
      list = new ArrayList<>();
      list.add(node);
      linkedList.put(name, list);
    }
  }

  /**
   * 根据链表构造出新的事务
   * 
   * @param name
   *            节点名称
   * @param linkedList
   *            链表
   * @return
   */
  private ArrayList<ArrayList<TreeNode>> getTransactionList(String name,
      HashMap<String, ArrayList<TreeNode>> linkedList) {
    ArrayList<ArrayList<TreeNode>> tList = new ArrayList<>();
    ArrayList<TreeNode> targetNode = linkedList.get(name);
    ArrayList<TreeNode> singleTansaction;
    TreeNode temp;

    for (TreeNode node : targetNode) {
      singleTansaction = new ArrayList<>();

      temp = node;
      while (temp.getParentNode().getName() != null) {
        temp = temp.getParentNode();
        singleTansaction.add(new TreeNode(temp.getName(), 1));
      }

      // 按照支持度计数得反转一下
      Collections.reverse(singleTansaction);

      for (TreeNode node2 : singleTansaction) {
        // 支持度计数调成与模式后缀一样
        node2.setCount(node.getCount());
      }

      if (singleTansaction.size() > 0) {
        tList.add(singleTansaction);
      }
    }

    return tList;
  }

  /**
   * 节点计数
   * 
   * @param node
   *            待加入节点
   * @param nodeCount
   *            计数映射图
   */
  private void nodeCounted(TreeNode node, HashMap<String, Integer> nodeCount) {
    int count = 0;
    String name = node.getName();

    if (nodeCount.containsKey(name)) {
      count = nodeCount.get(name);
      count++;
    } else {
      count = 1;
    }

    nodeCount.put(name, count);
  }

  /**
   * 显示决策树
   * 
   * @param node
   *            待显示的节点
   * @param blankNum
   *            行空格符，用于显示树型结构
   */
  private void showFPTree(TreeNode node, int blankNum) {
    System.out.println();
    for (int i = 0; i < blankNum; i++) {
      System.out.print("\t");
    }
    System.out.print("--");
    System.out.print("--");

    if (node.getChildNodes() == null) {
      System.out.print("[");
      System.out.print("I" + node.getName() + ":" + node.getCount());
      System.out.print("]");
    } else {
      // 递归显示子节点
      // System.out.print("【" + node.getName() + "】");
      for (TreeNode childNode : node.getChildNodes()) {
        showFPTree(childNode, 2 * blankNum);
      }
    }

  }

  /**
   * 待插入节点的抵达位置节点，从根节点开始向下寻找待插入节点的位置
   * 
   * @param root
   * @param list
   * @return
   */
  private TreeNode searchNode(TreeNode node, ArrayList<TreeNode> list) {
    ArrayList<TreeNode> pathList = new ArrayList<>();
    TreeNode tempNode = null;
    TreeNode firstNode = list.get(0);
    boolean isExist = false;
    // 重新转一遍，避免出现同一引用
    for (TreeNode node2 : list) {
      pathList.add(node2);
    }

    // 如果没有孩子节点，则直接返回，在此节点下添加子节点
    if (node.getChildNodes() == null) {
      return node;
    }

    for (TreeNode n : node.getChildNodes()) {
      if (n.getName().equals(firstNode.getName()) && list.size() == 1) {
        tempNode = node;
        isExist = true;
        break;
      } else if (n.getName().equals(firstNode.getName())) {
        // 还没有找到最后的位置，继续找
        pathList.remove(firstNode);
        tempNode = searchNode(n, pathList);
        return tempNode;
      }
    }

    // 如果没有找到，则新添加到孩子节点中
    if (!isExist) {
      tempNode = node;
    }

    return tempNode;
  }

  /**
   * 判断目前构造的FP树是否是单条路径的
   * 
   * @param rootNode
   *            当前FP树的根节点
   * @return
   */
  private boolean isSinglePath(TreeNode rootNode) {
    // 默认是单条路径
    boolean isSinglePath = true;
    ArrayList<TreeNode> childList;
    TreeNode node;
    node = rootNode;

    while (node.getChildNodes() != null) {
      childList = node.getChildNodes();
      if (childList.size() == 1) {
        node = childList.get(0);
      } else {
        isSinglePath = false;
        break;
      }
    }

    return isSinglePath;
  }

  /**
   * 开始构建FP树
   */
  public void startBuildingTree() {
    ArrayList<TreeNode> singleTransaction;
    ArrayList<ArrayList<TreeNode>> transactionList = new ArrayList<>();
    TreeNode tempNode;
    int count = 0;

    for (String[] idArray : totalGoodsID) {
      singleTransaction = new ArrayList<>();
      for (String id : idArray) {
        count = itemCountMap.get(id);
        tempNode = new TreeNode(id, count);
        singleTransaction.add(tempNode);
      }

      // 根据支持度数的多少进行排序
      Collections.sort(singleTransaction);
      for (TreeNode node : singleTransaction) {
        // 支持度计数重新归为1
        node.setCount(1);
      }
      transactionList.add(singleTransaction);
    }

    buildFPTree(null, transactionList);
  }

  /**
   * 输出此单条路径下的频繁模式
   * 
   * @param suffixPattern
   *            后缀模式
   * @param rootNode
   *            单条路径FP树根节点
   */
  private void printFrequentPattern(ArrayList<String> suffixPattern,
      TreeNode rootNode) {
    ArrayList<String> idArray = new ArrayList<>();
    TreeNode temp;
    temp = rootNode;
    // 用于输出组合模式
    int length = 0;
    int num = 0;
    int[] binaryArray;

    while (temp.getChildNodes() != null) {
      temp = temp.getChildNodes().get(0);

      // 筛选支持度系数大于最小阈值的值
      if (temp.getCount() >= minSupportCount) {
        idArray.add(temp.getName());
      }
    }

    length = idArray.size();
    num = (int) Math.pow(2, length);
    for (int i = 0; i < num; i++) {
      binaryArray = new int[length];
      numToBinaryArray(binaryArray, i);

      // 如果后缀模式只有1个，不能输出自身
      if (suffixPattern.size() == 1 && i == 0) {
        continue;
      }

      System.out.print("频繁模式：{【后缀模式：");
      // 先输出固有的后缀模式
      if (suffixPattern.size() > 1
          || (suffixPattern.size() == 1 && idArray.size() > 0)) {
        for (String s : suffixPattern) {
          System.out.print(s + ", ");
        }
      }
      System.out.print("】");
      // 输出路径上的组合模式
      for (int j = 0; j < length; j++) {
        if (binaryArray[j] == 1) {
          System.out.print(idArray.get(j) + ", ");
        }
      }
      System.out.println("}");
    }
  }

  /**
   * 数字转为二进制形式
   * 
   * @param binaryArray
   *            转化后的二进制数组形式
   * @param num
   *            待转化数字
   */
  private void numToBinaryArray(int[] binaryArray, int num) {
    int index = 0;
    while (num != 0) {
      binaryArray[index] = num % 2;
      index++;
      num /= 2;
    }
  }

}

算法调用测试类：

/**
 * FPTree频繁模式树算法
 * @author lyq
 *
 */
public class Client {
  public static void main(String[] args){
    String filePath = "C:\\Users\\lyq\\Desktop\\icon\\testInput.txt";
    //最小支持度阈值
    int minSupportCount = 2;
    
    FPTreeTool tool = new FPTreeTool(filePath, minSupportCount);
    tool.startBuildingTree();
  }
}

输出的结果为：

[后缀模式]：{3, }, 此时的条件模式基：{2, }, {1, }, {2, 1, }, 
[后缀模式]：{3, 2, }, 此时的条件模式基：
频繁模式：{【后缀模式：3, 2, 】}
-------
[后缀模式]：{3, 1, }, 此时的条件模式基：{2, }, 
频繁模式：{【后缀模式：3, 1, 】}
频繁模式：{【后缀模式：3, 1, 】2, }
-------
[后缀模式]：{2, }, 此时的条件模式基：
-------
[后缀模式]：{1, }, 此时的条件模式基：{2, }, 
频繁模式：{【后缀模式：1, 】2, }
-------
[后缀模式]：{5, }, 此时的条件模式基：{2, 1, }, {2, 1, 3, }, 
频繁模式：{【后缀模式：5, 】2, }
频繁模式：{【后缀模式：5, 】1, }
频繁模式：{【后缀模式：5, 】2, 1, }
-------
[后缀模式]：{4, }, 此时的条件模式基：{2, }, {2, 1, }, 
频繁模式：{【后缀模式：4, 】2, }
-------

读者可以自己手动的构造一下，可以更深的理解这个过程，然后对照本人的代码做对比。

算法编码时的难点

1、在构造树的时候要重新构建一棵树的时候，要不能对原来的树做更改，在此期间用了老的树的对象，又造成了重复引用的问题了，于是果断又new了一个TreeNode，只把原树的name，和count值拿了过来，父子节点关系完全重新构造。

2、在事务生产树的过程中，把事务映射到TreeNode数组中，然后过程就是加Node节点或者更新Node节点的count值，过程简单许多，也许会让人很难理解，应该个人感觉这样比较方便，如果是死板的String[]字符串数组的形式，中间还要与TreeNode各种转化非常麻烦。

3、在计算条件模式基的时候，我是存在了HashMap<String, ArrayList<TreeNode>>map中，并并没有搞成链表的形式，直接在生成树的时候就全部统计好。

4、此处算法用了2处递归，一个地方是在添加树节点的时候，搜索要在哪个node上做添加的方法，searchNode(TreeNode node, ArrayList<TreeNode> list)，还有一个是整个的buildFPTree()算法，都不是能够一眼就能看明白的地方。希望大家能够理解我的用意。

FP-Tree算法的缺点

尽管FP-Tree算法在挖掘频繁模式的过程中相较Apriori算法里没有产生候选集了，比Apriori也快了一个数量级上了，但是整体上FP-Tree算法的时间，空间消耗开销上还是挺大的。