机器学习中为什么需要寻找稀疏模型（一）

2022-02-25 辽阔天空 9468 1

正文翻译

Why need to find sparse models in machine learning?

机器学习中为什么需要寻找稀疏模型?

评论翻译

Vitali Zagorodnov, Executive Director at Pupsik Studio (2013-present)

Vitali Zagorodnov, Pupsik Studio执行董事(2013年至今)

There are quite a few reasons:
1) sparse models contain fewer features and hence are easier to train on limited data. Fewer features also means less chance of over fitting
2) fewer features also means it is easier to explain to users, as only most meaningful features remain
3) in face recognition, sparse models provide a unique way to recognize a face from a database of profiles taken under different orientations
4) in MRI, sparse models promise faster image acquisition.
5) sparse model allow using something called over-complete representation, where two or more non- orthogonal basis sets are mixed. Useful, when none of basis sets alone is ideal.

原因有很多:
稀疏模型包含较少的特征，因此更容易在有限的数据上进行训练。更少的功能也意味着过度安装的可能性更小;
更少的功能也意味着更容易向用户解释，因为只剩下最有意义的功能;
在人脸识别中，稀疏模型提供了一种独特的方法，可以从不同方向下获取的轮廓数据库中识别人脸;
在核磁共振成像中，稀疏模型可以实现更快的图像采集。
稀疏模型允许使用所谓的超完备表示，其中两个或多个非正交基集是混合的,当没有一个基集是理想的时，它很有用。
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Prasoon Goyal, Have been working in Machine Learning for a few years

Prasony Goyal在机器学习领域工作了几年
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Other than the reasons already mentioned, sparse models are more memory-efficient and faster.
As a simple example, consider a non-linear SVM, that’s trained on 1 million data points. So, you will have 1 million dual variables, and the prediction requires taking a linear combination of all these dual variables.
Now, if only 1% of these are non-zero, then you only need to store the non-zero values. So if you deploy your model on a low-memory device, such as a a mobile phone or a drone, you need much less disk space. Secondly, while prediction, you need less RAM to load the model. Finally, instead of 1 million computations, you need to perform only 10 thousand computations.

除了前面提到的原因之外，稀疏模型的内存效率更高，速度更快。
作为一个简单的例子，考虑在100万个数据点上训练的非线性空间矢量调制。所以，你将有一百万个双变量，预测需要所有这些双变量的线性组合。
现在，如果其中只有1%是非零值，那么只需要存储非零值。因此，如果将模型部署在低内存设备上，例如移动电话或无人机，则需要的磁盘空间要少得多。其次，在进行预测时，您需要更少的随机存取存储器来加载模型。最后，您只需要执行1万次计算，而不是100万次计算。

Mikael Rusin, studied at Lund University

Mikael Rusin在隆德大学学习

I think Sparsity inherently entails differentiated Matris parsing and differentiating mathematical convex accounting.
Meaning that the way you approach the complexity of parsing matrices, meaning the way you approach the idea of parsing data to begin with - akin to error accounting - amount of data clustering needing to account for in each layer of the Hyperdimension of the Kernel..
Overall, i’d think it has to do with new Optimization in terms of parsing structures, accounting for zero gradient declines and what not.
I mean, as soon as you start to run factorial analysis in terms of Complexity overriding a certain level, i.e, complexity and more noisy kernels - the more issues you are going to have to due error factorial
Past this, it, i also think, has to do with variance rate.
If you just have a sparse model, you need not account for same level of intensity of complexity in terms of variance accounting, since the model is very sparse to begin with.
You have your clusters and you work on thoose, you need not go into overfitting .
A common recurring theme in terms of ML, overfitting.
There’s also the inherent issue of amount of variables needed.
Like, the very inherent need of accounting for input set comparatively to output in terms of going along, is a differential to which is very hard to imagine, in my opinion.
See, the problem is that, if we change one set of hyperparameter, the means can change, teh overfitting, the convergence etc..
It’s super convoluted.
So, we are attempting ot break down and derive new models that make sense, from the fact of Sparse Methods instead.

我认为稀疏性本质上意味着区分矩阵解析和区分数学凸计算。
这意味着你处理解析矩阵的复杂性的方式，意味着你处理解析数据的方式—类似于错误核算—需要在内核超维的每一层中考虑的数据聚类量。
总的来说，我认为这与解析结构方面的新优化有关，考虑到零梯度下降等等。
我的意思是，一旦你开始在复杂度超过一定水平的情况下进行阶乘分析，也就是说，错综复杂和更嘈杂的内核—你将不得不面对更多的问题，因为错误的阶乘等等。
除此之外，我还认为这与变异率有关。
如果你只是一个稀疏模型，你不需要在方差计算方面考虑到同样程度的复杂性，因为这个模型从一开始就是非常稀疏的。
你有你的集群，你在做这些，你不需要去过度装修。
就机器语言而言，这是一个常见的反复出现的主题。
还有一个固有的问题，即需要多少变量。
比如，在我看来，考虑投入与产出的内在需求，是一种很难想象的差异。
问题是，如果我们改变一组超参数，平均值就会改变，过度拟合，收敛等等。
非常复杂。
因此，我们试图从稀疏方法的事实出发，分解并推导出有意义的新模型。
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Arjun Gowda, Masters, Computer Science from University of Queensland (2017)

Arjun Gowda，昆士兰大学计算机科学硕士（2017）

What are the common disadvantages of having a sparse data set when creating an ML model?
Common disadvantages would be (not exactly for the learning but maybe from a resource availability perspective)
Limited memory
Combination of which makes it impossible to run the algorithms at times or makes it really slow. You can uate the situation though.
Sparsity is not always a disadvantage, even though we talk about “The Curse of Dimensionality”. In fact, this gives more insight on the estimation of the model.
You could use PCA, LDA, Autoencoders etc. to reduce the dimensionality.
Handling high dimensional data is tricky and sometimes impossible if the chosen algorithm requires whole of the data to be in the memory.
However, there are “Online learning” methods like Stochastic Gradient, which does not require you to load the whole of dataset into the memory.
(PS: From my understanding ! There may be a lot of other things affecting the process)

创建机器语言模型时，稀疏数据集的常见缺点是什么？
常见的缺点是（不完全是为了学习，但从资源可用性的角度来看）
内存有限
这两者的结合使得算法有时无法运行，或者运行速度非常慢，不过你可以评估一下情况。
稀疏并不总是一个缺点，尽管我们谈论“维度的诅咒”。事实上，这让我们对模型的估计有了更多的了解。
您可以使用PCA、LDA、自动编码器等来降低维度。
如果所选算法要求所有数据都在内存中，那么处理高维数据是很棘手的，有时甚至是不可能的。
然而，有一些“在线学习”方法，比如随机梯度法，它不需要将整个数据集加载到内存中。
（附录：据我所知！可能还有很多其他因素影响这个过程）

Anonymous
Must we avoid sparse and high dimensional features in machine learning models?
No, it’s not a must, but it can help.
Intuitively, excessive dimensionality is problematic: it is difficult for anybody — a machine or human — to make predictions about items that are unlike any it has ever seen before.
If you have more dimensions than features, that means some combinations of feature values will have never been seen before. So as a rule of them it’s good to have orders of magnitude more training data (rows) than features (columns).
There are ways to encode sparse features and reduce dimensionality, but another way is to just find or create a lot more training data.
At the end of the day it really depends on the problem and how you are optimising across accuracy, performance, engineering work and other costs.

匿名的
我们必须避免机器学习模型中的稀疏和高维特征吗？
不，这不是必须的，但它能帮上忙。
直觉上，过多的维度是有问题的：任何人——无论是机器还是人类——都很难预测出与以往不同的项目。
如果尺寸大于特征，这意味着以前从未见过特征值的某些组合。因此，一般来说，训练数据（行）比特征（列）多几个数量级是好的。
有一些方法可以对稀疏特征进行编码并降低维数，但另一种方法是查找或创建更多的训练数据。
归根结底，这实际上取决于问题，以及您如何优化精度、性能、工程工作和其他成本。
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Kien Huynh, I know some ML

Kien Huynh，我知道一些机器语言

Why are sparse autoencoders sparse?
You should read the lecture notes from Prof. Andrew Ng. It is available here. He explained it quite well.
Basically, when you train an autoencoder, the hidden units in the middle layer would fire (activate) too frequently, for most training samples. We don’t want this characteristic. We want to lower their activation rate so that they only activate for a small fraction of the training examples. This constraint is also called the sparsity constraint. It is sparse because each unit only activates to a certain type of inputs, not all of them.
Why is sparsity constraint important? Think of it as a Jack of all trade person. If a person can do many jobs from A, B, C… to Z then generally he/she is not a master of any of those. While someone who only does A or B in their entire life would be a master. Similarly, if a neuron unit is forced to fire for whatever training samples it is fed with, even if those training samples are vastly different, then that unit would not work well for all of those samples.
Here are 100 images that would maximally activate 100 trained hidden units:
If you look at the first image (first row, first column), you can see that this 1st unit is only activated strongly if the input has some sort of sharp, diagonal edge in it. If you feed it a horizontal edge, it would not activate much. Since it would only maximally respond to this type of edge, and not all training samples have it, we should be confident to call it sparse enough. The same goes for the rest of the units.
By putting the KL-Divergence into the obxtive function, we can force the units to only activate to a small fraction of the training samples. If you read further into the lecture notes, you can see how beautiful the KL-Divergence is.

为什么稀疏自动编码器是稀疏的？
你应该读Andrew Ng教授的讲稿，在这里可以找到，他解释得很好。
基本上，当你训练一个自动编码器时，中间层中的隐藏单元会对大多数训练样本进行太频繁的激发（激活）。我们不想要这种特征。我们希望降低他们的激活率，这样他们只会激活一小部分训练示例。该约束也称为稀疏约束。它是稀疏的，因为每个单元只激活特定类型的输入，而不是所有输入。
为什么稀疏约束很重要？把它想象成一个通商的人。如果一个人能做从a、B、C…到Z的许多工作，那么他/她通常不是这些工作领域的大师。而一个一生只做A或B的人将成为大师。类似地，如果一个神经元单元被迫为它所输入的任何训练样本激活，即使这些训练样本有很大的不同，那么该单元也不会对所有这些样本都有效。
这里有100张图片，可以最大限度地激活100个经过训练的隐藏单位：
如果你看第一张图片（第一行，第一列），你会发现只有当输入有某种尖锐的对角线边缘时，第一个单元才会被强烈激活。如果你给它一个水平的边缘，它不会激活太多。由于它只会对这种类型的边缘做出最大的响应，而且并非所有的训练样本都有这种响应，我们应该有信心称之为足够稀疏。其他单位也是如此。
通过将KL散度放入目标函数中，我们可以强制单元只激活训练样本的一小部分。如果你进一步阅读课堂讲稿，你会发现KL发散是多么美丽。

Elena Sergeev, grad student in applied ML

Elena Sergeev，应用机器语言硕士研究生

What are the advantages of using sparse representation in machine learning, especially in deep learning models?
All right, I will try (There is a chance I have forgotten to mention something important so please feel free to add more)
1) Computational considerations:
a) You can store Sparse matrices efficiently, and for some methods you have to have all of your data Matrix in active memory)
b) Since it sort of implies that there is a lot of multiplications by zero it lessens the amount of computations you have to perform (since multiplying by zero is a zero anyways)
2) Result related considerations:
a) Everything Correlates with Everything due to noise, distant dependencies etc. These nuisance dependencies assuming there is a lot of them can overwhelm true dependencies and will have an undue influence on the result. - You can think of sparsity constraints as a way to do feature sextion for your examples.
P.S It does not always mean sparse representations are intrinsically better then non sparse ones. In fact I have seen a recent paper on useful overcomplete representations

在机器学习中，尤其是在深度学习模型中，使用稀疏表示有什么优势？
好的，我会努力的（有一次我忘了提到一些重要的事情，所以请随意添加）：
计算方面的考虑：
可以高效地存储稀疏矩阵，对于某些方法，必须将所有数据矩阵都存储在活动内存中；
因为它意味着有很多乘零运算，所以它减少了你必须执行的计算量（因为乘零无论如何都是零）；
与成果有关的考虑：
由于噪音、远距离依赖等原因，所有事物都相互关联。假设存在大量讨厌的依赖，这些依赖可能会压倒真正的依赖，并对结果产生不适当的影响。你可以将稀疏约束看作是为示例选择特征的一种方式。
另外，它并不总是意味着稀疏表示本质上比非稀疏表示更好。事实上，我最近看过一篇关于有用的超完备表示的论文。
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Yisong Yue, Machine Learning Professor @Caltech
What machine learning theory do I need to know in order to be a successful machine learning practitioner?
The list below probably isn't exhaustive, but contains the first things I thought of. So hopefully, they're the most important/fundamental ones! (Although I'm sure I missed some.)

要成为一名成功的机器学习实践者，我需要了解哪些机器学习理论?
下面的列表可能并不详尽，但包含了我想到的第一件事。所以，希望它们是最重要/最基本的!(不过我肯定我漏掉了一些。)

Statistical Learning Theory:
Overfitting -- A central concept in machine learning is that of overfitting. Roughly speaking, overfitting happens when you train a model that captures the idiosyncrasies of your training data. A model that overfits to training data cannot generalize well to new unseen test examples, which is ultimately what we want most machine learning models to do.
Generalization error -- One way to quantify overfitting is through generalization error. Roughly speaking, generalization error measure the gap between the error on the training set versus the test set. Thus, the larger the generalization error, the more the model is overfitting.
Bias–variance tradeoff -- Sometimes, it's OK if the model you train overfits so long as the generalization error is not too large. For example, if you train a complex model that achieves 0.2 error on the training set and 0.5 error on the test set, that might be desirable to a simple model that achieves 0.5 error on the training set and 0.6 error on the test set. Even though the simple model overfits less, it was so simple that it still performs worse on the test set compared to a complex model that overfits more. The bias-variance tradeoff is a way of reasoning about this issue: when does it make sense to use a more complex model even though it overfits more?

统计学习理论：
过度拟合—机器学习中的一个核心概念是过度拟合。粗略地说，当你训练一个能捕捉训练数据特质的模型时，就会发生过度拟合。过度拟合训练数据的模型无法很好地推广到新的看不见的测试示例，这最终是我们希望大多数机器学习模型能够做到的。
泛化误差—量化过度拟合的一种方法是通过泛化误差。粗略地说，泛化误差衡量的是训练集和测试集上的误差之间的差距。因此，泛化误差越大，模型过度拟合的情况就越多。
偏差—方差权衡—有时，只要泛化误差不太大，如果训练的模型拟合过度就可以了。例如，如果训练的复杂模型在训练集上的误差为0.2，在测试集上的误差为0.5，那么对于在训练集上的误差为0.5，在测试集上的误差为0.6的简单模型来说，这可能是可取的。尽管简单模型的过度拟合程度较低，但与过度拟合程度较高的复杂模型相比，它非常简单，在测试集上的性能仍然较差。偏差-方差权衡是对这个问题进行推理的一种方式：什么时候使用更复杂的模型才有意义，尽管它拟合得更多？

Empirical risk m0
inimization -- When most people think of machine learning, they're probably thinking of empirical risk minimization. That is, they want a model that achieves low error on some training set. However, it is important to keep in mind what the assumptions are of empirical risk minimization. Most notably, that the training set is sampled independently from the test distribution you really care about. If this assumption is violated, you can get machine learning models that don't behave in the way you want (cf. Algorithms and Bias: Q. and A. With Cynthia Dwork).

经验风险m0最小化——当大多数人想到机器学习时，他们可能想到的是经验风险最小化。也就是说，他们想要一个在某些训练集上能够实现低误差的模型。然而，重要的是要记住经验风险最小化的假设是什么。最值得注意的是，训练集的采样是独立于你真正关心的测试分布的。如果这个假设被违背了，你可以得到不按你想要的方式行事的机器学习模型(参见《算法和偏见:与辛西娅·德沃克的问答》)。

Cross-validation (statistics) -- Typically, one cannot test on data that one trained on, one must split the existing data into training and test sets. However, this is statistically wasteful and also increases the variability since you aren't testing on every data point at your disposal. Cross validation is a way of getting around that by rotating what's in the training vs test sets.

交叉验证(统计)——通常，我们不能测试我们所训练的数据，我们必须将现有的数据分割成训练集和测试集。然而，这在统计上是一种浪费，而且还增加了可变性，因为你并没有对你所使用的每个数据点进行测试。交叉验证是一种通过旋转训练集和测试集来解决这个问题的方法。

Confidence interval -- The most direct quantitative way to compare two models is by looking at their respective test errors (e.g., via cross validation). However, how do we know if two numbers actually reflect meaningful differences between the two models or are just due to some spurious effects caused by a finite sample size? Confidence intervals are the most common way to deal with this issue.

置信区间——比较两个模型最直接的定量方法是看它们各自的测试错误(例如，通过交叉验证)。然而，我们如何知道两个数字是否真的反映了两个模型之间有意义的差异，或者只是由于有限的样本容量造成的一些伪效应?置信区间是处理这个问题的最常见方法。

Statistical hypothesis testing -- A related concept to confidence intervals is statistical hypothesis testing. The most common thing to use this for is answer whether two models have statistically distinguishable accuracies. The way statistical hypothesis testing is typically implemented involves using confidence intervals and setting the size of the confidence intervals at an appropriate width w.r.t. the desired statistical significance level.

统计假设检验——与置信区间相关的一个概念是统计假设检验。最常用的是回答两个模型在统计上是否有可区分的准确性。统计假设检验的方法通常包括使用置信区间，并将置信区间的大小设置为适当的宽度，即所需的统计显著性水平。

Bootstrapping (statistics) -- Another way of uating the variability of the model is via bootstrapping, which effectively samples from the training set with replacement to generate new training sets that are statistically similar to the original training set.

引导指令(统计)——评估模型可变性的另一种方法是通过引导指令，它有效地从训练集进行样本替换，生成在统计上与原始训练集相似的新训练集。

Statistical Modeling:
Metrics | Kaggle -- It's important to understand what your metric of choice is for whatever modeling problem you're solving. For some tasks, you only care that your model can make good predictions at the top (e.g., ranking in web search)
Regularization (mathematics) -- Regularization serves two purposes. First, it is commonly used to control for overfitting so that the learned model is not too complex. Second, different choices of regularization reflect different assumptions about what "simple" means. For instance, using L1 regularization encourages sparsity in the trained model, and interprets simple as having few non-zero parameters. On the other hand, using L2 regularization encourages the norm of the learned model to be low, and interprets simple as having a small magnitude.

统计建模：
度量 | Kaggle（一个数据建模和数据分析竞赛平台）——对于你正在解决的任何建模问题，了解您选择的度量标准都很重要。对于某些任务，你只关心你的模型能够在顶部做出良好的预测（例如，在网页搜索中排名）。
正则化（数学）——正则化有两个目的。首先，它通常用于控制过度拟合，以便学习的模型不会太复杂。其次，不同的正则化选择反映了对“简单”含义的不同假设。例如，使用L1正则化可以鼓励训练模型中的稀疏性，并将simple解释为具有很少的非零参数。另一方面，使用L2正则化鼓励学习模型的范数较低，并将简单解释为具有较小的幅度。

Machine learning Types and Tasks -- People often think of supervised learning when they think machine learning (and in fact most of the topics listed here are described through the lens of supervised learning). But supervised learning is not the only learning setup. Others include unsupervised learning, semi-supervised learning, transductive learning, etc. It's important to understand what kind of learning problem you're dealing with. If you would rather deal with the supervised learning problem, that often means throwing away data for which you do not have labels for. Sometimes that's good, and sometimes that's not so good.
Correlation vs. causation -- It's important to keep in mind when inspecting a learned model that many things learned by the model are purely correlation and should not be interpreted causally.

机器学习类型和任务——当人们想到机器学习时，通常会想到监督学习（事实上，这里列出的大多数主题都是从监督学习的角度来描述的）。但监督学习并不是唯一的学习方式。其他包括无监督学习、半监督学习、直推学习等。了解你在处理什么样的学习问题很重要。如果你更愿意处理监督学习问题，那通常意味着扔掉你没有标签的数据。有时这很好，有时也不太好。
相关性与因果关系——在检查学习模型时，重要的是要记住，通过模型学习到的许多东西都是纯粹的相关性，不应该被随意解释。
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Optimization:
Stochastic gradient descent -- Most machine learning models are trained via some form of stochastic gradient descent. It's generally useful to understand when different methods work well, so you can train your models more efficiently.
Nesterov’s Accelerated Gradient Descent -- One important concept in gradient descent is momentum, of which Nesterov's method is arguably the most beautiful instance. Momentum is typically extremely useful for speeding up training.
Convex analysis -- It's important to understand when your learning problem is convex versus non-convex. Convex learning problems always converge to the same optimal model, so you don't have to be too careful about how you train (apart for speed considerations). Non-convex learning problems can get stuck in local optima, and so the model you get back can vary greatly. As such, it's often important to be careful about how you initialize the non-convex learning problem.

优化：
随机梯度下降——大多数机器学习模型都是通过某种形式的随机梯度下降来训练的。了解不同的方法何时能很好地工作通常很有用，这样您就可以更有效地训练模型。
Nesterov的加速梯度下降——梯度下降中的一个重要概念是动量，Nesterov的方法可以说是其中最美丽的例子。动量通常对加速训练非常有用。
凸分析——了解学习问题是凸问题还是非凸问题很重要。凸学习问题总是收敛到同一个最优模型，所以你不必太在意如何训练（除了速度方面的考虑）。非凸学习问题会陷入局部最优，所以你得到的模型会有很大的变化。因此，对于如何初始化非凸学习问题，经常需要小心。
原创翻译：龙腾网 https://www.ltaaa.cn 转载请注明出处

Linear Algebra:
Norm (mathematics) -- Norms are used a lot in machine learning. For instance, many regularization formulations are written as norms. Understanding the behavior of different norms will help you in deciding which kind of regularization you want to impose.
Matrix_(mathematics) -- A lot of times, data and models are expressed using matrices. Sometimes, you can save a lot of computation time by being clever about how you order the matrix operations. Other times, you can figure out how to transform your data into the format that some learning toolkit uses by using matrix transforms.
The Statistical Whitening Transform -- One particularly useful approach for standardizing your data is whitening. It's good to understand what assumptions are built into whitening, so that you'll have a good sense of when whitening will and won't work.

线性代数：
标准（数学）——标准在机器学习中被大量使用。例如，许多正则化公式都是作为规范编写的。了解不同规范的行为将有助于你决定要实施哪种规范。
矩阵（数学）——很多时候，数据和模型都是用矩阵表示的。有时，你可以通过巧妙地安排矩阵运算的顺序来节省大量的计算时间。其他时候，你可以通过矩阵变换，找出如何将数据转换成某些学习工具包使用的格式。
统计白化转换——标准化数据的一种特别有用的方法是白化。很好的一点是，要了解白化中包含了哪些假设，这样你就能很好地了解白化何时有效，何时无效。

Outlook:
These days, there are a lot of tools being developed that can automate away a lot of the issues described above, and will thus make machine learning more intuitive & easier to use for more people. For example, many machine learning packages already do cross validation automatically. However, those tools are far from perfect, and so having a solid grasp of the theoretical fundamentals will be very beneficial in the long run, because it'll allow you to more intelligently use and compose the existing tools to achieve whatever data modeling task you're trying to solve.

见解：
如今，有很多工具正在开发中，它们可以自动解决上述许多问题，从而使机器学习更直观，更容易为更多人使用。例如，许多机器学习软件包已经自动进行交叉验证。然而，这些工具还远远不够完美，因此从长远来看，牢牢掌握理论基础将非常有益，因为它将允许你更智能地使用和组合现有工具，以实现你试图解决的任何数据建模任务。

很赞 1