回归HW3-逻辑回归的python实现

2018-07-26  本文已影响0人  在做算法的巨巨

在写这篇的时候,其实有两个问题在我的脑海里不停浮现,一个是逻辑回归真正的数学含义,一个是进行逻辑回归之前我需要做的数据预处理。
所谓逻辑回归的数学含义,例如,hypothesis function, cost function,sigmoid function,这些内容都是很久前看的,后面我会陆陆续续把这些补上。
所谓数据预处理,包括普通点的去掉空值,根据需求切片等等,对于逻辑回归而言,比较unique的就是需要我们将数据中的文本数据转化为离散型的数值型数据。至于说,为什么?因为我们要做假设方程的最优化计算,这里的预处理有点类似于朴素贝叶斯中的文本向量化,将文本信息转化为数据矩阵。这里先不说这些了,有时间我会将细节单独写出来。


废话不多说了,上硬货

import pandas as pd
import numpy as np
data=pd.read_csv('c:\\PDM\\data.csv'
data=data.dropna()

这里有两种实现路径。

  1. 通过pandas自带的get_dummies功能,将所有分类扁平化扩张以增加列的形式实现,离散数据按照[0,1]分布。
  2. 人工定义:将每一列的文本信息种类以int值进行标注,最后通过DataFrame的map功能实现。

首先尝试第一种方法,get_dummies

dummyColumns = ['Gender', 'Home Ownership', 'Internet Connection', 'Marital Status','Movie Selector','Prerec Format', 'TV Signal']
for column in dummyColumns:
    data[column]=data[column].astype('category')
dummiesData = pandas.get_dummies(data, columns=dummyColumns,prefix=dummyColumns,prefix_sep=" ",drop_first=True)

尝试第二种方法,人工定义,DataFrame的map功能

educationLevelDict = {
    'Post-Doc': 9,
    'Doctorate': 8,
    'Master\'s Degree': 7,
    'Bachelor\'s Degree': 6,
    'Associate\'s Degree': 5,
    'Some College': 4,
    'Trade School': 3,
    'High School': 2,
    'Grade School': 1
}
dummiesData['Education Level Map'] = dummiesData['Education Level'].map(educationLevelDict)

freqMap = {
    'Never': 0,
    'Rarely': 1,
    'Monthly': 2,
    'Weekly': 3,
    'Daily': 4
}
dummiesData['PPV Freq Map'] = dummiesData['PPV Freq'].map(freqMap)
dummiesData['Theater Freq Map'] = dummiesData['Theater Freq'].map(freqMap)
dummiesData['TV Movie Freq Map'] = dummiesData['TV Movie Freq'].map(freqMap)
dummiesData['Prerec Buying Freq Map'] = dummiesData['Prerec Buying Freq'].map(freqMap)
dummiesData['Prerec Renting Freq Map'] = dummiesData['Prerec Renting Freq'].map(freqMap)
dummiesData['Prerec Viewing Freq Map'] = dummiesData['Prerec Viewing Freq'].map(freqMap)
dummiesSelect = [
    'Age', 'Num Bathrooms', 'Num Bedrooms', 'Num Cars', 'Num Children', 'Num TVs', 
    'Education Level Map', 'PPV Freq Map', 'Theater Freq Map', 'TV Movie Freq Map', 
    'Prerec Buying Freq Map', 'Prerec Renting Freq Map', 'Prerec Viewing Freq Map', 
    'Gender Male',
    'Internet Connection DSL', 'Internet Connection Dial-Up', 
    'Internet Connection IDSN', 'Internet Connection No Internet Connection',
    'Internet Connection Other', 
    'Marital Status Married', 'Marital Status Never Married', 
    'Marital Status Other', 'Marital Status Separated', 
    'Movie Selector Me', 'Movie Selector Other', 'Movie Selector Spouse/Partner', 
    'Prerec Format DVD', 'Prerec Format Laserdisk', 'Prerec Format Other', 
    'Prerec Format VHS', 'Prerec Format Video CD', 
    'TV Signal Analog antennae', 'TV Signal Cable', 
    'TV Signal Digital Satellite', 'TV Signal Don\'t watch TV'
]
inputData = dummiesData[dummiesSelect]
outputData= dummiesData[['Home Ownership Rent']]
from sklearn import linear_model
IrModel = linear_model.LogisticRegression()
IrModel.fit(inputData, outputData)
IrModel.score(inputData,outputData)

这里还是需要先对预测数据进行预处理,重复上边预处理的工作

newData = read_csv('C:\\PDM\\newData.csv')
newData = newData.dropna()
for column in dummyColumns:
    newData[column] = newData[column].astype('category', categories=data[column].cat.categories)
dummiesNewData = pandas.get_dummies(newData,  columns=dummyColumns, prefix=dummyColumns, prefix_sep=" ",drop_first=True)
newData['Education Level Map'] = newData['Education Level'].map(educationLevelDict)
newData['PPV Freq Map'] = newData['PPV Freq'].map(freqMap)
newData['Theater Freq Map'] = newData['Theater Freq'].map(freqMap)
newData['TV Movie Freq Map'] = newData['TV Movie Freq'].map(freqMap)
newData['Prerec Buying Freq Map'] = newData['Prerec Buying Freq'].map(freqMap)
newData['Prerec Renting Freq Map'] = newData['Prerec Renting Freq'].map(freqMap)
newData['Prerec Viewing Freq Map'] = newData['Prerec Viewing Freq'].map(freqMap)

建立纯离散数据的新矩阵

inputNewData = dummiesNewData[dummiesSelect]

预测

lrModel.predict(inputData)
上一篇下一篇

猜你喜欢

热点阅读