ML - hw3

2019-01-17 本文已影响11人谢小帅

1. Neural Networks

test, loss and accuracy

epoch10, loss and accuracy

2. K-Nearest Neighbor

(a) try KNN with different K

Conclusion：

A little K, like K=1 above, means we use a little neighborhood to make a prediction. Only the training instances close or similar to the test instance matters. The approximate error will reduce but the estimation error will increase. In other words, the decrease of K value means that the whole model will become more complex and easy to overfit.
A large K, like K=100 above, means we use a large neighborhood to make a prediction. It can reduce estimation error but large neighborhood means training instances far away from test instance also matters and may result in a wrong prediction. In an extreme case, K equals the number of training instances, the classifier just picks the modal number of the label. That's an entirely lousy classifier. It just outputs the same label for any test instances.
In short, K plays a critical role in KNN classifier. We should choose a relatively reliable K when using this method.
The extreme case, K=400, like the picture below. All test instances have been classified as the same label 0, so the result image plots no contours.

(b) How to choose a proper K

We can use Cross Validation.
Since we don't know which K is the best K, We can set a range of probable K values and test every KNN's performance in validation data, then choose the K with the least valid error as our best K.

(c) hack the CAPTCHA

I chose 20 Captchas or 100 numbers as train set and 10 Captchas as test set, like the picture below.

train set

test set

Then I labeled the train set and stored the 100 samples and labels in hack_data.npz, then KNN is used to predicate the numbers in test set, the accuracy is 100%.

10 test Captchas

3. Decision Tree and ID3

ID3 use Infomation Gain to choose partition feature.
Below is the calculation process.

Draw the decision tree and annotate each non-leaf node.

The script I used to calculate the Infomation Gain.

import numpy as np

def cal_info_entropy(*probs):
    info_e = 0
    for p in probs:
        info_e -= p * np.log2(p)
    return info_e

# GPA non-left node
h = cal_info_entropy(4 / 9, 5 / 9)
h_gender = 205 / 450 * cal_info_entropy(105 / 205, 100 / 205) + 245 / 450 * cal_info_entropy(95 / 245, 150 / 245)
h_gpa = 215 / 450 * cal_info_entropy(15 / 215, 200 / 215) + 235 / 450 * cal_info_entropy(185 / 235, 50 / 235)
print('Dataset info:', h)
print('Gender info:', h_gender)
print('GPA info:', h_gpa)
print('GPA info Gain:', h - h_gpa)

# Gender left node
h = cal_info_entropy(185 / 235, 50 / 235)
h_gender = 115 / 235 * cal_info_entropy(95 / 115, 20 / 115) + 120 / 235 * cal_info_entropy(90 / 120, 30 / 120)
print('Gender left info Gain:', h - h_gender)

# Gender right node
h = cal_info_entropy(15 / 215, 200 / 215)
h_gender = 90 / 215 * cal_info_entropy(10 / 90, 80 / 90) + 125 / 215 * cal_info_entropy(5 / 125, 120 / 125)
print('Gender right info Gain:', h - h_gender)

Results:

Dataset info: 0.9910760598382222
Gender info: 0.9798427350133525
GPA info: 0.5643777078470715
GPA info Gain: 0.4266983519911507
Gender left info Gain: 0.006269007038336882
Gender right info Gain: 0.01352135817465705

4. K-Means Clustering

(a) k-means two trials

Black Point and Green Point are used to annotate the initial and final cluster centers in the following trail images

When k = 2

smallest SD

largest SD

When k = 3

smallest SD

largest SD

Conclusion：

The randomly chosen cluster centers have a significant influence on the iteration number of k-means algorithm.

(b) Get a stable result using k-means

We can choose k initial centers with the largest distance between each other. The process：

randomly choose the 1st cluster center
choose the 2nd center with the largest distance to the 1st center
choose the 3rd center with the largest sum of the distance to the 1st and 2nd centers
repeat 3 until we get all the k centers

Another way, we can use Hierarchical Clustering or Canopy Algorithm to do clustering and use its results as the initial cluster centers.

(c) Run k-means on digit_data.mat

test1

test2

Conclusion：

When k = 10, we just get 10 number labels, almost covering all the data labels.
When k increases, we have more cluster centers, which cluster some same numbers to different labels. It's like the overfitting problem.
So a suitable k is significant to the k-means algorithm, and sometimes, we need some prior knowledge from experts to find a better k.