糖尿病康复,内容丰富有趣,生活中的好帮手!
糖尿病康复 > mahout使用PFP和FPG算法

mahout使用PFP和FPG算法

时间:2021-06-03 20:18:48

相关推荐

mahout使用PFP和FPG算法

mahout提供了内存中的FPG和分布式的PFP两种算频繁项集的方法,其中PFP实现上也是将feature分组,然后在节点上独立地运行FPG算法。PFP默认分组为50,如果项的数量特别多,可能需要考虑修改这个值。

先来看一下mahout 0.5的FPG测试代码:

public void testMaxHeapFPGrowth() throws Exception {FPGrowth<String> fp = new FPGrowth<String>();Collection<Pair<List<String>,Long>> transactions = new ArrayList<Pair<List<String>,Long>>();transactions.add(new Pair<List<String>,Long>(Arrays.asList("E", "A", "D", "B"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("D", "A", "C", "E", "B"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("C", "A", "B", "E"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("B", "A", "D"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("D"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("D", "B"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("A", "D", "E"), 1L));transactions.add(new Pair<List<String>,Long>(Arrays.asList("B", "C"), 1L));Path path = getTestTempFilePath("fpgrowthTest.dat");Configuration conf = new Configuration();FileSystem fs = FileSystem.get(conf);SequenceFile.Writer writer =new SequenceFile.Writer(fs, conf, path, Text.class, TopKStringPatterns.class);fp.generateTopKFrequentPatterns(transactions.iterator(),fp.generateFList(transactions.iterator(), 3),3,100,new HashSet<String>(),new StringOutputConverter(new SequenceFileOutputCollector<Text,TopKStringPatterns>(writer)),new ContextStatusUpdater(null));writer.close();List<Pair<String, TopKStringPatterns>> frequentPatterns = FPGrowth.readFrequentPattern(conf, path);assertEquals("[(C,([B, C],3)), "+ "(E,([A, E],4), ([A, B, E],3), ([A, D, E],3)), "+ "(A,([A],5), ([A, D],4), ([A, E],4), ([A, B],4), ([A, B, E],3), ([A, D, E],3), ([A, B, D],3)), "+ "(D,([D],6), ([B, D],4), ([A, D],4), ([A, D, E],3), ([A, B, D],3)), "+ "(B,([B],6), ([A, B],4), ([B, D],4), ([A, B, D],3), ([A, B, E],3), ([B, C],3))]",frequentPatterns.toString());}

注意这个测试代码的例子,将输入写到了一个sequence file中,这个writer经过StringOutputConverter的包装。在实际跑作业时,这么写是可能出现问题的。因为集群中可能有权限控制,直接用writer往HDFS上写数据,可能造成写入的文件权限出问题(别人都不能读写啦)。

所以如果只是要在控制台看输出,不妨改造一下这个包装类,下面的类实现了在控制台输出频繁项集的结果:

public final class PrintStreamConverter implements OutputCollector<String, List<Pair<List<String>, Long>>> {private final PrintStream collector;public PrintStreamConverter(PrintStream collector) {this.collector = collector;}@Overridepublic void collect(String key,List<Pair<List<String>, Long>> values) throws IOException {for (Pair<List<String>, Long> pair : values) {collector.print(key +": " + StringUtils.join(pair.getFirst(),",") + "\t" + pair.getSecond() + "\n");}}

注意这里的输出是以feature为key的闭频繁项集及其支持度。

这时FPG代码就可修改为:

fp.generateTopKFrequentPatterns(transactions.iterator(),fp.generateFList(transactions.iterator(), 3),3,100,new HashSet<String>(),new PrintStreamConverter(System.out),new ContextStatusUpdater(null));

有时候希望在集群节点的内存中做FPG,这时需要一些额外的包装,下面的类提供了包装,并将频繁项集输出为<Text, Text>:

import mons.lang.StringUtils;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.OutputCollector;import org.mon.Pair;import org.binationGenerator;import java.io.IOException;import java.util.List;public final class TextOutputConverter implements OutputCollector<String, List<Pair<List<String>, Long>>> {private final OutputCollector<Text, Text> collector;public TextOutputConverter(OutputCollector<Text, Text> collector) {this.collector = collector;}@Overridepublic void collect(String key,List<Pair<List<String>, Long>> values) throws IOException {for (Pair<List<String>, Long> pair : values) {collector.collect(new Text(key + "," + StringUtils.join(pair.getFirst(), ";")),new Text(pair.getSecond().toString()));}}}

reduce中代码如下:

public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {FPGrowth<String> fp = new FPGrowth<String>();Collection<Pair<List<String>, Long>> transactions = new ArrayList<Pair<List<String>, Long>>();while (values.hasNext()) {List<String> list = new ArrayList<String>();String[] parts = values.next().toString().split(" ");Collections.addAll(list, parts);transactions.add(new Pair<List<String>, Long>(list, 1L));}fp.generateTopKFrequentPatterns(transactions.iterator(),fp.generateFList(transactions.iterator(), 5),5, 1000, new HashSet<String>(),new TextOutputConverter(output),new ContextStatusUpdater(null));}

FP算法中还有一些可调的参数,通过Parameters类来封装,它是一个<key, value>对集合。

numGroups:feature分组的数目,默认50。对于大项集来说,可能设大一些会好点

input:输入路径

output:输出路径

minSupport:最小支持度,默认为3

如果觉得《mahout使用PFP和FPG算法》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。