Julia MLJ 逻辑回归 机器学习 梯度下降 调参 kagg

2020-09-28  本文已影响0人  二方亨
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.    
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.0-rc1.0 (2020-06-26)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release  
|__/                   |
数据集字段概览

这是一个已知用户的各种属性, 预测用户是否会购买车险(Response)的标准分类问题. 数据集大家可以去kaggle自行下载.

  1. 载入数据
using Queryverse, MLJ, StatsKit, PrettyPrinting, LossFunctions, Plots
train_data = Queryverse.load("D:\\data\\archive\\train.csv") |> DataFrame
test_data = Queryverse.load("D:\\data\\archive\\test.csv") |> DataFrame

"|>" 是Julia的管道函数, 等效于R的"%>%". 作用是将上一个结果作为下一个函数的参数传入. 在上述语句中:是将读取的数据转换为DataFrame类型

  1. 查看数据的科学类型(Scitype)
train_data |> MLJ.schema
MLJ.schema

可以看到返回了两种类型:
1.types (机器类型)
2.scitypes (科学类型)
机器类型很好理解, 与R, python, SQL一样, 代表数据的存储类型. 科学类型是MLJ库为方便模型理解而定义的类型, 不同的模型兼容的科学类型也不同, 使用时需要注意.
详细说明文档里有

  1. 查看训练集统计摘要
train_data |> describe |> print

│ Row │ variable             │ mean     │ min      │ median   │ max       │ nunique │ nmissing │ eltype   │
│     │ Symbol               │ Union…   │ Any      │ Union…   │ Any       │ Union…  │ Nothing  │ DataType │
├─────┼──────────────────────┼──────────┼──────────┼──────────┼───────────┼─────────┼──────────┼──────────┤
│ 1   │ id                   │ 190555.0 │ 1        │ 190555.0 │ 381109    │         │          │ Int64    │
│ 2   │ Gender               │          │ Female   │          │ Male      │ 2       │          │ String   │
│ 3   │ Age                  │ 38.8226  │ 20       │ 36.0     │ 85        │         │          │ Int64    │
│ 4   │ Driving_License      │ 0.997869 │ 0        │ 1.0      │ 1         │         │          │ Int64    │
│ 5   │ Region_Code          │ 26.3888  │ 0.0      │ 28.0     │ 52.0      │         │          │ Float64  │
│ 6   │ Previously_Insured   │ 0.45821  │ 0        │ 0.0      │ 1         │         │          │ Int64    │
│ 7   │ Vehicle_Age          │          │ 1-2 Year │          │ > 2 Years │ 3       │          │ String   │
│ 8   │ Vehicle_Damage       │          │ No       │          │ Yes       │ 2       │          │ String   │
│ 9   │ Annual_Premium       │ 30564.4  │ 2630.0   │ 31669.0  │ 540165.0  │         │          │ Float64  │
│ 10  │ Policy_Sales_Channel │ 112.034  │ 1.0      │ 133.0    │ 163.0     │         │          │ Float64  │
│ 11  │ Vintage              │ 154.347  │ 10       │ 154.0    │ 299       │         │          │ Int64    │
│ 12  │ Response             │ 0.122563 │ 0        │ 0.0      │ 1         │         │          │ Int64    │

id: 对训练模型没有帮助需要剔除
Gender, Driving_License, Region_Code, Previously_Insured, Previously_Insured, Vehicle_Age, Vehicle_Damage, 以及Response: 分类变量处理为one-hot编码

  1. 查看正负样本是否均衡
train_data.Response |> StatsKit.countmap
StatsKit.countmap

正负样本不均衡, 选择后续在模型中处理. (也可在测试集中做欠采样)

  1. 从训练集中剔除id变量
train_data = train_data[:, Not(:id)]
  1. 拆包 - 将数据分为预测变量和目标变量
y, X = unpack(train_data, ==(:Response), colname -> true)
MLJ.unpack
  1. 先用自动转换科学类型方法, 将预测变量转换为模型可接受的科学类型
X = coerce(X, autotype(X)) #先对训练集自动转换scitype为学习支持类型
微信截图_20200928201525.jpg

预测变量的被转换成了三种科学类型: 无序分类, 有序因子, 连续数值

  1. 连续数值化
X = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), X)), X)
  1. 标准化
X = MLJ.transform(fit!(machine(Standardizer(), X)), X)
Standardizer

为提高梯度下降效率, 将数据标准化为标准差=1, 均值=0

  1. 将目标变量的科学类型转换为OrderedFactor
y = coerce(y, OrderedFactor)
  1. 查看逻辑回归学习器参数
info("LogisticClassifier", pkg = "ScikitLearn") |> pprint

[ Info: Training Machine{ContinuousEncoder} @192.
name = "LogisticClassifier",
 package_name = "ScikitLearn",
 is_supervised = true,
 docstring = "Logistic regression classifier.\n→ based on [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl).\n→ do `@load LogisticClassifier pkg=\"ScikitLearn\"` to use the model.\n→ do `?LogisticClassifier` for documentation.",  
 hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing),
 hyperparameter_types = ("String", "Bool", "Float64", "Float64", "Bool", "Float64", "Any", "Any", "String", "Int64", "String", "Int64", "Bool", "Union{Nothing, Int64}", "Union{Nothing, Float64}"),
 hyperparameters = (:penalty, :dual, :tol, :C, :fit_intercept, :intercept_scaling, :class_weight, :random_state, :solver, :max_iter, :multi_class, :verbose, :warm_start, :n_jobs, :l1_ratio),
 implemented_methods = [:clean!, :fit, :fitted_params, :predict],
 is_pure_julia = false,
 is_wrapper = true,
 load_path = "MLJScikitLearnInterface.LogisticClassifier",
 package_license = "BSD",
 package_url = "https://github.com/cstjean/ScikitLearn.jl",
 package_uuid = "3646fa90-6ef7-5e7e-9f22-8aca16db6324",
 prediction_type = :probabilistic,
 supports_online = false,
 supports_weights = false,
 input_scitype = Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous),
 target_scitype = AbstractArray{_s267,1} where _s267<:Finite,
 output_scitype = Unknown)
  1. 载入模型
@load LogisticClassifier pkg="ScikitLearn"

lc = LogisticClassifier(class_weight = "balanced",  #由于样本不均衡, 让模型自动计算权重
                        solver = "sag") #优化算法选择 随机梯度下降
  1. 训练模型
r = range(lc, :max_iter, lower = 100, upper = 500) #选择测试提升轮数的范围

tm = TunedModel(model = lc,
                tuning = Grid(), #参数范围的搜索策略
                resampling = CV(rng = 11, nfolds = 10),
                range = [r], #参数范围
                measure = area_under_curve #判断最优结果的指标 ROC曲线下面积
                )

mtm = machine(tm, X, y)  #构造machine(学习器)

fit!(mtm) #拟合已调整的模型


[ Info: Training Machine{ProbabilisticTunedModel{Grid,…}} @931.
[ Info: Attempting to evaluate 10 models.
Evaluating over 10 metamodels: 100%[=========================] Time: 0:07:00

14.可视化调参结果

res = report(mtm).plotting
scatter(res.parameter_values[:,1],
        res.measurements)
scatter
best_model = fitted_params(mtm).best_model #查看模型最佳参数
best_model

max_iter = 278时, AUC最大(ROC曲线下面积)

15.同样的转换方法处理预测集

test_data |> describe |> pprint
id = test_data[:, :id]
test_data = select(test_data, Not(:id))

test_data = coerce(test_data, autotype(test_data)) #自动scitype
test_data = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), test_data)), test_data) #数值化scitype
test_data = MLJ.transform(fit!(machine(Standardizer(), test_data)), test_data) #标准化
test_data
  1. 用训练好的模型进行预测
result = predict_mode(mtm, test_data)
predict_mode
  1. 查看结果比例
result |> countmap
countmap
  1. 将id与预测结果合并至DataFrame
result_data = DataFrame(id = id, Response = result)
result_data
上一篇下一篇

猜你喜欢

热点阅读