大数据,机器学习,人工智能

12. Support Vector Machines

2020-08-21  本文已影响0人  玄语梨落

Support Vector Machines

Optimization objective

SVM hypothesis:
logistic regression:

y = \frac{1}{1+e^{-\theta^Tx}}

cost function:

\min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2

Large Margin Intution

\min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2

If y=1, we want \theta^Tx\ge1 (not just \ge0)
If y=0, we want \theta^Tx\le-1 (not just \le0)

If C is too large, the deasion boundary will be sensitive by outliers

The mathematics behind large margin classification (optional)

Vector Inner Product

SVM Decision Boundary

\min\limits_\theta\frac{1}{2}\sum\limits_{j=1}^n\theta_j^2=\frac{1}{2}||\theta||^2

Kernels I

Non-liner decision boundary:

Given x, compute new feature feature depending on proximity to landmarks defined manually.

Kernels and Similarity (Gaussian kernel):

f_1=similarity(x,l^{(1)})=\exp (-\frac{||x-l^{(1)}||^2}{2\sigma^2})=\exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(1)})^2}{2\sigma^2})

If x\approx l^{(1)}\qquad f_1\approx 1
If x far from l^{(1)} f_1\approx0


Kernels II

Choosing the land marks:
Where to get l ?
Give (x^{(1)},y^{(i)}),(x^{(2)},y^{(2)}),...(x^{(n)},y^{(n)}
choose l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(n)}=x^{(n)}

For training examples (x^{(i)},y^{(i)})

f_m^{(i)} = sim(x^{(i)},l^{(m)}) f_0 =1

SVM with Kernels

Hypothesis: Given x, compute features f\in R^{m+1}
Predict 'y=1' if \theta^Tf\ge0
Training:\min\limits_\theta C\sum\limits_{i=1}^my^{(i)}cost_1(\theta^Tf^{(i)})+(1-y^{(i)})cost_0(\theta^Tf^{(i)})+\frac{1}{2}\sum\limits_{j=1}^n\theta_j^2\quad (n=m)

Kernels ususally were used with SVM, although it can be used with logistic regressin, it runs slowly.

SVM parameters

C :

\sigma^2 :

Using an SVM

Need to specify:

Note: Do perform feature scaling before using the Gaussian kernel.

Other choices of kernel

Not all similarity functions similarity(x,l) make valid kernels. (Need to satisfy technical condition called "Mercer's Theorem") to make sure SVM packages' optimizations run correctly, and do not diverge.

Many off-the-shelf kernels avaliable:

Multi-class classification

Many SVM packages already have build-in multi-class classification functionality.

Logistic regression vs. SVM

n = number of features, m = number of training examples.

上一篇 下一篇

猜你喜欢

热点阅读