数据挖掘

第一章:简单推荐算法

2018-10-24  本文已影响6人  无赖宵小

找出相似用户


曼哈顿距离

最简单的距离计算方式;

在二维模型中,每个目标对象都可以用 (x, y) 的点来表示,我们可以用下标来表示不同的对象, (x1, y1)表示A, (x2, y2)表示B,那么他们之间的曼哈顿距离就是:

曼哈顿距离

欧几里得距离

另一种计算距离的方法就是看两点之间的直线距离,即利用勾股定理计算距离:

欧几里得距离

推广:闵可夫斯基距离

将曼哈顿距离和欧几里得距离归纳成一个公式,这个公式称为闵可夫斯基距离:

其他:

r = ∞ 极大距离, r 值越大,单个唯独的差值大小会对整体距离有更大的影响

def minkowski(rating1, rating2, r):
    distance = 0
    for key in rating1:
        if key in rating2:
            distance += pow(abs(rating1[key] - rating2[key]), r)
    return pow(distance, 1.0 / r)

皮尔逊相关数

皮尔逊相关数用于衡量两个变量之间的相关性,它的值在 -1 至 1 之间, 1 表示完全吻合, -1 表示完全相悖。

“分数膨胀”:审计标准较低导致的数据结果数值偏高。

def pearson(rating1, rating2):
    sum_xy = 0
    sum_x = 0
    sum_y = 0
    sum_x2 = 0
    sum_y2 = 0
    n = 0

    for key in rating1:
        if key in rating2:
            n += 1
            x = rating1[key]
            y = rating2[key]
            sum_xy += x * y
            sum_x += x
            sum_y += y
            sum_x2 += pow(x, 2)
            sum_y2 += pow(y, 2)

    # 计算分母
    denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)
    if denominator == 0:
        return 0
    else:
        return (sum_xy - (sum_x * sum_y) / n) / denominator

余弦相似度

余弦相似度的范围从 1 到 -1 , 1 表示完全匹配, - 1 表示完全相悖。

小结:

1、如果数据存在“分数膨胀”问题,就是用皮尔逊相关系数。

2、如果数据比较“密集”,变量之间基本都存在公有值,且这些距离数据是非常重要的,那就是用欧几里得或曼哈顿距离。

3、如果数据是稀疏的,则使用余弦近似值。


# -*- coding:utf-8 -*-

"""
曼哈顿、皮尔逊系数、余弦相似度整合简单推荐算法
"""

import codecs 
from math import sqrt

users = {
    "Angelica": {
        "Blues Traveler": 3.5, 
        "Broken Bells": 2.0, 
        "Norah Jones": 4.5, 
        "Phoenix": 5.0, 
        "Slightly Stoopid": 1.5, 
        "The Strokes": 2.5, 
        "Vampire Weekend": 2.0
        },
    "Bill":{
        "Blues Traveler": 2.0, 
        "Broken Bells": 3.5, 
        "Deadmau5": 4.0, 
        "Phoenix": 2.0, 
        "Slightly Stoopid": 3.5, 
        "Vampire Weekend": 3.0
        },
    "Chan": {
        "Blues Traveler": 5.0, 
        "Broken Bells": 1.0, 
        "Deadmau5": 1.0, 
        "Norah Jones": 3.0, 
        "Phoenix": 5, 
        "Slightly Stoopid": 1.0
        },
    "Dan": {
        "Blues Traveler": 3.0, 
        "Broken Bells": 4.0, 
        "Deadmau5": 4.5, 
        "Phoenix": 3.0, 
        "Slightly Stoopid": 4.5, 
        "The Strokes": 4.0, 
        "Vampire Weekend": 2.0
        },
    "Hailey": {
        "Broken Bells": 4.0, 
        "Deadmau5": 1.0, 
        "Norah Jones": 4.0, 
        "The Strokes": 4.0, 
        "Vampire Weekend": 1.0
        },
    "Jordyn":  {
        "Broken Bells": 4.5, 
        "Deadmau5": 4.0, 
        "Norah Jones": 5.0, 
        "Phoenix": 5.0, 
        "Slightly Stoopid": 4.5, 
        "The Strokes": 4.0, 
        "Vampire Weekend": 4.0
        },
    "Sam": {
        "Blues Traveler": 5.0, 
        "Broken Bells": 2.0, 
        "Norah Jones": 3.0, 
        "Phoenix": 5.0, 
        "Slightly Stoopid": 4.0, 
        "The Strokes": 5.0
        },
    "Veronica": {
        "Blues Traveler": 3.0, 
        "Norah Jones": 5.0, 
        "Phoenix": 4.0, 
        "Slightly Stoopid": 2.5, 
        "The Strokes": 3.0
        }
    }

class recommender:
    def __init__(self, data, k = 1, metric = 'pearson', n =5):
        """ 初始化推荐模块
        data 训练数据
        k K 邻近算法中的值
        metric 使用何种距离计算方式
        n 推荐结果数量
        """
        self.k = k 
        self.n = n 
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # 将距离计算方式保存下来
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        
        # 如果 data 是一个字典类型,则保存下来,否则忽略
        if type(data).__name__ == 'dict':
            self.data = data
            
    def convertProductID2name(self, id):
        # 通过产品 ID 获取名称
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id        
            
    def userRatings(self, id, n):
        # 返回该用户评分最高的物品
        print("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # 排序并返回结果
        ratings.sort(key=lambda artisTuple: artisTuple[1], reverse=True)
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))
            
    def loadBookDB(self, path = ''):
        # 加载 BX 数据集,path 是数据文件位置
        self.data = {}
        i = 0
        # 将数据评分数据放入 self.data
        f = codecs.open(path + "\BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        # 将数据信息存入 self.productid2name
        # 包括 isbn 号、书名、作者等
        f = codecs.open(path + "\BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        # 将用户信息存入 self.userid2name 和 self.username2id
        f = codecs.open(path + "\BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            # print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)
        
    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator
            
    def computeNearestNeighbor(self, username):
        """获取邻近用户"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username], self.data[instance])
                distances.append((instance, distance))
        # 按距离排序,距离近的排在前面
        distances.sort(key=lambda artistTuple: artistTuple[1], reverse=True)
        return distances
    
    def recommend(self, user):
        """返回推荐列表"""
        recommendations = {}
        # 首先获取邻近用户
        nearest = self.computeNearestNeighbor(user)
        # 获取用户评价过的商品
        userRatings = self.data[user]
        # 计算总距离
        totalDistance = 0.0
        for i in range(self.k):
            totalDistance += nearest[i][1]
        # 汇总 k 邻近用户的评分
        for i in range(self.k):
            # 计算饼图的每个分片
            weight = nearest[i][1] / totalDistance
            # 获取用户名称
            name = nearest[i][0]
            # 获取用户评分
            neighborRatings = self.data[name]
            # 获取没有评价过的商品
            for artist in  neighborRatings:
                if not artist in userRatings:
                    if artist not in recommendations:
                        recommendations[artist] = (neighborRatings[artist] * weight)
                    else:
                        recommendations[artist] = (recommendations[artist] + neighborRatings[artist] * weight)
            # 开始推荐
            recommendations = list(recommendations.items())
            recommendations = [(self.convertProductID2name(k), v)
                               for (k, v) in recommendations]
            # 排序并返回
            recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse=True)
            # 返回前 n 个结果      
            return recommendations[:self.n]
                  
                  
R = recommender(users)
R.loadBookDB("...")
print(R.recommend('276747'))
print(R.recommend('276822'))
print(R.recommend('276813'))

参考原文作者:Ron Zacharski CC BY-NC 3.0] https://github.com/egrcc/guidetodatamining

参考原文原文 http://guidetodatamining.com/

参考译文来自 @egrcchttps://github.com/egrcc/guidetodatamining

上一篇 下一篇

猜你喜欢

热点阅读