Coursera ML(10)-机器学习诊断法

假设你在开发一个机器学习系统,或者在改进一个机器学习系统的性能,应如何做?


目前已有的方法:

  • Getting more training examples
  • Trying smaller sets of features
  • Trying additional features
  • Trying polynomial features
  • Increasing or decreasing λ

每种方法都有自己不同的应用场景

Evaluating a Hypothesis

根据测试集得到参数,对训练集运用模型。有两种误差计算方法

  • For linear regression:

  • For classification :
    误分类的比例,对于每一个测试实例,计算:

    然后急死俺平均

Model Selection and Train/Validation/Test Sets(交叉验证机)

使用60%的数据作为训练集,20%的数据作为交叉验证集,20%的数据作为测试集

  • Optimize the parameters in Θ using the training set for each polynomial degree.
  • Find the polynomial degree d with the least error using the cross validation set.
  • Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$, (d = theta from polynomial with lower error);
    简单来讲:
    训练集训练出 10 个模型 ->10 个模型分别对交叉验证集计算得出交叉验证误差(代价函数的值)->选取代价函数值最小的模型->用选出的模型对测试集计算得出推广误差(代价函数的值)

Diagnosing Bias vs. Variance

high biais and high variance

  • High bias (underfitting): both $J{train}(\Theta)$ and $J{CV}(\Theta)$ will be high. Also, $J{CV}(\Theta) \approx J{train}(\Theta)$.
  • High variance (overfitting): $J{train}(\Theta)$ will be low and $J{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$.

很多情况下,欠拟合会导致高误差,高方差意味着拟合过度。

Decide Bias or Variance

  • 训练集误差和交叉验证集误差近似时:偏差/欠拟合
  • 交叉训练集误差 >> 训练集误差时:方法/过拟合

Regularization and Bias/Variance

mark

  • Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
  • Create a set of models with different degrees or any other variants.
  • Iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn some $\Theta$.
  • Compute the cross validation error using the learned Θ (computed with λ) on the $J_{CV}(\Theta)$ without regularization or λ = 0.
  • Select the best combo that produces the lowest error on the cross validation set.
  • Using the best combo Θ and λ, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.
    简单说:
    训练12个不同归一化的模型->分别对应交叉验证集计算误差->选出最小的那个->使用在测试集上

    Regularization 相关结论

  • 当$\lambda$较小时,训练集误差较小(过拟合)而交叉验证集误差较大。
  • 随着$\lambda$增加,训练集误差不断增加(欠拟合),而交叉验证集误差则是先减小后增大。

Learning Curves

  • 学习曲线是一个很好的工具,我们会经常使用学习曲线来判断某一个学习算法是否处于偏差、方差问题。
  • 学习曲线试讲训练集误差和交叉验证集误差作为训练集实例数量(m)的函数绘制的图表。

Experiencing high bias:

  • Low training set size: causes $J{train}(\Theta)$ to be low and $J{CV}(\Theta)$ to be high.
  • Large training set size: causes both $J{train}(\Theta)$ and $J{CV}(\Theta)$ to be high with $J{train}(\Theta)$≈$J{CV}(\Theta)$
    因此在高偏差(欠拟合)的情况下,增加训练集数量并不是一个好办法。此时,我们应当增加features。

    Experiencing high variance:

  • Low training set size: $J{train}(\Theta)$ will be low and $J{CV}(\Theta)$ will be high.
  • Large training set size: $J{train}(\Theta)$ increases with training set size and $J{CV}(\Theta)$ continues to decrease without leveling off. Also, $J{train}(\Theta)$ < $J{CV}(\Theta)$ but the difference between them remains significant.
    对比之下,如果在高方差(过拟合)的情况下,增加训练集数量可以明显降低误差,提高算法效果。

决定下一步做什么

  • 获得更多的训练实例——解决高方差
    high biais and high variance
  • 尝试减少特征的数量——解决高方差
  • 尝试获得更多的特征——解决高偏差
  • 尝试增加多项式特征——解决高偏差
  • 尝试减少归一化程度 λ—->提高拟合准确度—->解决高偏差
  • 尝试增加归一化程度 λ—->防止过拟合—->解决高方差

Coursera ML(10)-机器学习诊断法

https://iii.run/archives/cbe707809e44.html

作者

mmmwhy

发布于

2017-05-11

更新于

2022-10-30

许可协议

评论