数据算法 Hadoop/Spark大数据处理---第十五章

2018-07-08 本文已影响17人 _Kantin

本章为情感分析

情感分析算法的思想

判断一句话是好/坏的，是根据这句话的词语决定的。先建立一个好词的集合和一个坏词的集合，然后判断一句话中好坏词占据多少的百分比，从而断定这句话是什么类型的情感

本章实现方式

1.使用mapreduce的伪代码形式。以美国总统选举为例子。

++基于mapreduce的伪代码来实现++

1. map端

setup(){
    positiveWords =<load positive words from distributed cache>
    negativeWords =<load negative words from distributed cache>
    allCandidates =<load all candidates from distributed cache>
}
map(key,value){
    date = key  //日期
    List<String> tweetWords = normalizeAndTokenize(value)
    int positiveCount = 0;
    int negativeCount =0;
    //遍历每个候选人
    for(String candidate : allCandidates){
        if(candidate is in the tweetWords){
            //计算好词和坏词的格式化
            int positiveCount = <count of positive words in tweetWords>
            int negativeCount = <count of positive words in tweetWords>
            //除以总的个数
            double positiveRatio = positiveCount /tweetWords.size();
            double negativeCount = negativeCount/tweetWords.size();
            outputKey = Pair(data,candidate)
            outputValue = positiveRatio - negativeCount;
            emit(outputKey,outputValue)
        }
    }
}

2. reduce端

//key:是候选人  values 表示一个概率列表
reduce(key,values){
    double  sumOfRation = 0.0
    int n=0;
    for(Double value : values){
        n++;
        sumOfRation+=value;
    }
    emit(key,sumOfRatio/n)
}

数据算法 Hadoop/Spark大数据处理---第十五章

本章为情感分析

情感分析算法的思想

本章实现方式

++基于mapreduce的伪代码来实现++

1. map端

2. reduce端

猜你喜欢

热点阅读