2017-05-11发表2022-10-30更新底层能力 / 相关课程8 分钟读完 (大约1167个字)

Coursera ML(10)-机器学习诊断法

假设你在开发一个机器学习系统，或者在改进一个机器学习系统的性能，应如何做？

目前已有的方法：

Getting more training examples
Trying smaller sets of features
Trying additional features
Trying polynomial features
Increasing or decreasing λ

每种方法都有自己不同的应用场景

Evaluating a Hypothesis

根据测试集得到参数，对训练集运用模型。有两种误差计算方法

For linear regression:
$J_{test}(\Theta) = \dfrac{1}{2m_{test} } \sum_{i=1}^{m_{test} }(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$
For classification :
误分类的比例，对于每一个测试实例，计算：
$err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}$
然后急死俺平均
$\text{Test Error} = \dfrac{1}{m_{test} } \sum^{m_{test} }_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$

Model Selection and Train/Validation/Test Sets(交叉验证机)

使用60%的数据作为训练集，20%的数据作为交叉验证集，20%的数据作为测试集

Optimize the parameters in Θ using the training set for each polynomial degree.
Find the polynomial degree d with the least error using the cross validation set.
Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$, (d = theta from polynomial with lower error);
简单来讲：
训练集训练出 10 个模型 ->10 个模型分别对交叉验证集计算得出交叉验证误差（代价函数的值）->选取代价函数值最小的模型->用选出的模型对测试集计算得出推广误差（代价函数的值）

Diagnosing Bias vs. Variance

high biais and high variance

High bias (underfitting): both $J{train}(\Theta)$ and $J{CV}(\Theta)$ will be high. Also, $J{CV}(\Theta) \approx J{train}(\Theta)$.
High variance (overfitting): $J{train}(\Theta)$ will be low and $J{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$.

很多情况下，欠拟合会导致高误差，高方差意味着拟合过度。

Decide Bias or Variance

训练集误差和交叉验证集误差近似时：偏差/欠拟合
交叉训练集误差 >> 训练集误差时：方法/过拟合

Regularization and Bias/Variance

mark

Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
Create a set of models with different degrees or any other variants.
Iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn some $\Theta$.
Compute the cross validation error using the learned Θ (computed with λ) on the $J_{CV}(\Theta)$ without regularization or λ = 0.
Select the best combo that produces the lowest error on the cross validation set.
Using the best combo Θ and λ, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.
简单说：
训练12个不同归一化的模型->分别对应交叉验证集计算误差->选出最小的那个->使用在测试集上
Regularization 相关结论
当$\lambda$较小时，训练集误差较小（过拟合）而交叉验证集误差较大。
随着$\lambda$增加，训练集误差不断增加（欠拟合），而交叉验证集误差则是先减小后增大。

Learning Curves

学习曲线是一个很好的工具，我们会经常使用学习曲线来判断某一个学习算法是否处于偏差、方差问题。
学习曲线试讲训练集误差和交叉验证集误差作为训练集实例数量（m）的函数绘制的图表。

Experiencing high bias:

Low training set size: causes $J{train}(\Theta)$ to be low and $J{CV}(\Theta)$ to be high.
Large training set size: causes both $J{train}(\Theta)$ and $J{CV}(\Theta)$ to be high with $J{train}(\Theta)$≈$J{CV}(\Theta)$
因此在高偏差（欠拟合）的情况下，增加训练集数量并不是一个好办法。此时，我们应当增加features。
Experiencing high variance:
Low training set size: $J{train}(\Theta)$ will be low and $J{CV}(\Theta)$ will be high.
Large training set size: $J{train}(\Theta)$ increases with training set size and $J{CV}(\Theta)$ continues to decrease without leveling off. Also, $J{train}(\Theta)$ < $J{CV}(\Theta)$ but the difference between them remains significant.
对比之下，如果在高方差（过拟合）的情况下，增加训练集数量可以明显降低误差，提高算法效果。

决定下一步做什么

获得更多的训练实例——解决高方差
尝试减少特征的数量——解决高方差
尝试获得更多的特征——解决高偏差
尝试增加多项式特征——解决高偏差
尝试减少归一化程度 λ—->提高拟合准确度—->解决高偏差
尝试增加归一化程度 λ—->防止过拟合—->解决高方差

Coursera ML(10)-机器学习诊断法

https://iii.run/archives/cbe707809e44.html

作者

mmmwhy

发布于

2017-05-11

更新于

2022-10-30

许可协议

#Coursera 机器学习

Coursera ML(10)-机器学习诊断法

目前已有的方法：

Evaluating a Hypothesis

Model Selection and Train/Validation/Test Sets(交叉验证机)

Diagnosing Bias vs. Variance

Decide Bias or Variance

Regularization and Bias/Variance

Regularization 相关结论

Learning Curves

Experiencing high bias:

Experiencing high variance:

决定下一步做什么

作者

发布于

更新于

许可协议

评论

目录

分类