ML - hw3

2019-01-17  本文已影响11人  谢小帅

1. Neural Networks

test, loss and accuracy

epoch10, loss and accuracy

2. K-Nearest Neighbor

(a) try KNN with different K

Conclusion:

(b) How to choose a proper K

We can use Cross Validation.
Since we don't know which K is the best K, We can set a range of probable K values and test every KNN's performance in validation data, then choose the K with the least valid error as our best K.

(c) hack the CAPTCHA

I chose 20 Captchas or 100 numbers as train set and 10 Captchas as test set, like the picture below.

train set test set

Then I labeled the train set and stored the 100 samples and labels in hack_data.npz, then KNN is used to predicate the numbers in test set, the accuracy is 100%.

10 test Captchas

3. Decision Tree and ID3

ID3 use Infomation Gain to choose partition feature.
Below is the calculation process.

Draw the decision tree and annotate each non-leaf node.

The script I used to calculate the Infomation Gain.

import numpy as np

def cal_info_entropy(*probs):
    info_e = 0
    for p in probs:
        info_e -= p * np.log2(p)
    return info_e

# GPA non-left node
h = cal_info_entropy(4 / 9, 5 / 9)
h_gender = 205 / 450 * cal_info_entropy(105 / 205, 100 / 205) + 245 / 450 * cal_info_entropy(95 / 245, 150 / 245)
h_gpa = 215 / 450 * cal_info_entropy(15 / 215, 200 / 215) + 235 / 450 * cal_info_entropy(185 / 235, 50 / 235)
print('Dataset info:', h)
print('Gender info:', h_gender)
print('GPA info:', h_gpa)
print('GPA info Gain:', h - h_gpa)

# Gender left node
h = cal_info_entropy(185 / 235, 50 / 235)
h_gender = 115 / 235 * cal_info_entropy(95 / 115, 20 / 115) + 120 / 235 * cal_info_entropy(90 / 120, 30 / 120)
print('Gender left info Gain:', h - h_gender)

# Gender right node
h = cal_info_entropy(15 / 215, 200 / 215)
h_gender = 90 / 215 * cal_info_entropy(10 / 90, 80 / 90) + 125 / 215 * cal_info_entropy(5 / 125, 120 / 125)
print('Gender right info Gain:', h - h_gender)

Results:

Dataset info: 0.9910760598382222
Gender info: 0.9798427350133525
GPA info: 0.5643777078470715
GPA info Gain: 0.4266983519911507
Gender left info Gain: 0.006269007038336882
Gender right info Gain: 0.01352135817465705

4. K-Means Clustering

(a) k-means two trials

Black Point and Green Point are used to annotate the initial and final cluster centers in the following trail images

When k = 2

When k = 3

Conclusion:

(b) Get a stable result using k-means

We can choose k initial centers with the largest distance between each other. The process:

  1. randomly choose the 1st cluster center
  2. choose the 2nd center with the largest distance to the 1st center
  3. choose the 3rd center with the largest sum of the distance to the 1st and 2nd centers
  4. repeat 3 until we get all the k centers

Another way, we can use Hierarchical Clustering or Canopy Algorithm to do clustering and use its results as the initial cluster centers.

(c) Run k-means on digit_data.mat

test1

test2

Conclusion:

(d) Vector quantization

When using Fixed Length Encoding

When using Huffman Encoding

上一篇 下一篇

猜你喜欢

热点阅读