Andrew Ng Machine Learning Notes
Notation
n = number of features
m = number of training examples
x(i) = input of ith training example
xj(i) = value of feature j in ith training example
Model
x(i) to denote the “input” variables (living area in this example), also called input features
y(i) to denote the “output” or target variable
A pair (x(i),y(i)) is called a training example
a list of m training examples (x(i),y(i));i=1,...,m—is called a training set
Cost Function
Gradient Descent
The gradient descent algorithm is:
where j=0,1 represents the feature index number.
Feature Scaling
To achieve gradient decent goal, Two techniques to help with this are feature scaling and mean normalization.
- Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
- Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.
Learning Rate
- If α is too small: slow convergence.
- If α is too large: may not decrease on every iteration and thus may not converge.
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well. We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Normal Equation
In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:
In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
Logistic Regression
For multiple classification, we predict the probability that 'y' is a member of one of our classes.
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.
Overfitting
The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.
Neural Networks
Reference
- Coursera - Machine Learning Stanford University