机器学习中为什么需要寻找稀疏模型(二)
正文翻译
Why need to find sparse models in machine learning?
机器学习中为什么需要寻找稀疏模型
Why need to find sparse models in machine learning?
机器学习中为什么需要寻找稀疏模型
评论翻译
Patrick Senti, Founder@omegaml.io - productizing data science (2018-present)
帕特里克·森蒂,Founder@omegaml.io-数据科学产品化(2018年至今)
帕特里克·森蒂,Founder@omegaml.io-数据科学产品化(2018年至今)
First of all let’s be specific what machine learning is — and most importantly what it is not.
It’s all math, not magic.
There is no intelligence in those machine learning algorithms. Just math. And lots of automation and data processing.
The intelligence part is entirely provided by humans. The rest is math.
Any decision taken by a machine learning algorithm, however magic that may seem, is an application of its programming, done by humans. It’s not the machine taking decisions, it’s humans telling it when and how (by which criteria) to take one decision over another. We might call that very sophisticated programming. Note, however, unlike traditional ‘fixed rules’ programming, in machine learning it is only the data presented to the algorithm that eventually finalize the rules.
Any other claims regarding seemingly intelligent products are marketing and PR. Yes, it’s truly amazing what math combined with data & automation can achieve. Still, no known algorithm is intelligent.
首先,让我们明确机器学习是什么,最重要的是它不是什么。
这都是数学,不是魔法。
这些机器学习算法没有智能。只是数学以及大量的自动化和数据处理。
智力部分完全由人类提供,剩下的是数学。
机器学习算法做出的任何决定,无论看起来多么神奇,都是其编程的应用,由人类完成。不是机器在做决定,而是人类告诉它何时、如何(根据哪些标准)做出一个决定。我们可以称之为非常复杂的编程。然而,请注意,与传统的“固定规则”编程不同,在机器学习中,只有呈现给算法的数据才能最终确定规则。
关于看似智能产品的任何其他说法都是营销和公关。数学与数据和自动化相结合能实现什么,真是令人惊讶。尽管如此,没有已知的算法是智能的。
It’s all math, not magic.
There is no intelligence in those machine learning algorithms. Just math. And lots of automation and data processing.
The intelligence part is entirely provided by humans. The rest is math.
Any decision taken by a machine learning algorithm, however magic that may seem, is an application of its programming, done by humans. It’s not the machine taking decisions, it’s humans telling it when and how (by which criteria) to take one decision over another. We might call that very sophisticated programming. Note, however, unlike traditional ‘fixed rules’ programming, in machine learning it is only the data presented to the algorithm that eventually finalize the rules.
Any other claims regarding seemingly intelligent products are marketing and PR. Yes, it’s truly amazing what math combined with data & automation can achieve. Still, no known algorithm is intelligent.
首先,让我们明确机器学习是什么,最重要的是它不是什么。
这都是数学,不是魔法。
这些机器学习算法没有智能。只是数学以及大量的自动化和数据处理。
智力部分完全由人类提供,剩下的是数学。
机器学习算法做出的任何决定,无论看起来多么神奇,都是其编程的应用,由人类完成。不是机器在做决定,而是人类告诉它何时、如何(根据哪些标准)做出一个决定。我们可以称之为非常复杂的编程。然而,请注意,与传统的“固定规则”编程不同,在机器学习中,只有呈现给算法的数据才能最终确定规则。
关于看似智能产品的任何其他说法都是营销和公关。数学与数据和自动化相结合能实现什么,真是令人惊讶。尽管如此,没有已知的算法是智能的。
How it works
With that out of the way here’s the gist of how machine learning works, in laymen’s terms:
Learning refers to the process by which the machine uses a given set of data to fill in the blanks in one or many math formula(es) of some sort.
This process is a painstakingly defined step-by-step calculation of some sort, called an algorithm.
It works something like this. The algorithm looks at the data (e.g. using statistics) and fills in the blanks in a given formula.
It does this repeatedly until it has filled the blanks in such a way that the result of its computations match the expected value (give or take). This process is called training.
Once the trainig is complete (all blanks filled and answers match expectations) we can present the algorithm with new data and it will compute a result. That’s called prediction.
工作原理:
下面是机器学习工作原理的要点,用外行的话说:
学习是指机器使用一组给定的数据来填充一个或多个某种数学公式中的空格的过程。
这个过程是一个精心定义的逐步计算,称为算法。
它是这样工作的。该算法查看数据(例如使用统计数据)并在给定公式中填充空格。
它反复这样做:以这样一种方式填补空白,使其计算结果与预期值匹配(给出或接受),这个过程叫做培训。
一旦培训完成(所有空格填写完毕,答案符合预期),我们就可以用新数据呈现算法,它将计算出一个结果,这叫做预测。
With that out of the way here’s the gist of how machine learning works, in laymen’s terms:
Learning refers to the process by which the machine uses a given set of data to fill in the blanks in one or many math formula(es) of some sort.
This process is a painstakingly defined step-by-step calculation of some sort, called an algorithm.
It works something like this. The algorithm looks at the data (e.g. using statistics) and fills in the blanks in a given formula.
It does this repeatedly until it has filled the blanks in such a way that the result of its computations match the expected value (give or take). This process is called training.
Once the trainig is complete (all blanks filled and answers match expectations) we can present the algorithm with new data and it will compute a result. That’s called prediction.
工作原理:
下面是机器学习工作原理的要点,用外行的话说:
学习是指机器使用一组给定的数据来填充一个或多个某种数学公式中的空格的过程。
这个过程是一个精心定义的逐步计算,称为算法。
它是这样工作的。该算法查看数据(例如使用统计数据)并在给定公式中填充空格。
它反复这样做:以这样一种方式填补空白,使其计算结果与预期值匹配(给出或接受),这个过程叫做培训。
一旦培训完成(所有空格填写完毕,答案符合预期),我们就可以用新数据呈现算法,它将计算出一个结果,这叫做预测。
Example
A simple example is to figure out what value should be inserted in this formula: y = __ × x to match the following example data: (x=1, y=2), (x=2, y=4), (x=3, y=6), (x=4, y=8). There is a specific algorithm to solve this, called Linear Regression, but the example is simple enough to do it in your head (spoiler alx: the correct value is 2). Now consider how this changes with a different set of examples: (x=4, y=12), (x=9, y=27), (x=15, y=35). What is the blank’s value now?
Data is the key ingredient
As can be seen by this simple example, the same formula can produce very different results depending in how the blanks are filled, which in turn depends on the data presented during training. That’s why getting a good set of data is of utmost importance. In fact that’s the most difficult part of applying machine learning.
Note that while this example works out perfectly and there is only one answer (for each dataset), this is usually not the case.
实例
一个简单的例子是在这个公式中找出应该插入的值:y=_x,以匹配以下示例数据:(x=1,y=2),(x=2,y=4),(x=3,y=6),(x=4,y=8)。有一个特定的算法来解决这个问题,称为线性回归,但这个例子足够简单,可以在你的头脑中实现(剧透:正确的值是2)。现在考虑这是如何用一组不同的例子来改变的:(x=4,y=12),(x=9,y=27),(x=15,y=35)。现在空白的值是多少?
数据是关键因素
从这个简单的例子可以看出,同一个公式可以产生非常不同的结果,这取决于空格的填充方式,而填充方式又取决于训练期间提供的数据。这就是为什么获得一组好的数据至关重要的原因。事实上,这是应用机器学习最困难的部分。
请注意,虽然这个例子效果很好,而且只有一个答案(对于每个数据集),但通常情况并非如此。
A simple example is to figure out what value should be inserted in this formula: y = __ × x to match the following example data: (x=1, y=2), (x=2, y=4), (x=3, y=6), (x=4, y=8). There is a specific algorithm to solve this, called Linear Regression, but the example is simple enough to do it in your head (spoiler alx: the correct value is 2). Now consider how this changes with a different set of examples: (x=4, y=12), (x=9, y=27), (x=15, y=35). What is the blank’s value now?
Data is the key ingredient
As can be seen by this simple example, the same formula can produce very different results depending in how the blanks are filled, which in turn depends on the data presented during training. That’s why getting a good set of data is of utmost importance. In fact that’s the most difficult part of applying machine learning.
Note that while this example works out perfectly and there is only one answer (for each dataset), this is usually not the case.
实例
一个简单的例子是在这个公式中找出应该插入的值:y=_x,以匹配以下示例数据:(x=1,y=2),(x=2,y=4),(x=3,y=6),(x=4,y=8)。有一个特定的算法来解决这个问题,称为线性回归,但这个例子足够简单,可以在你的头脑中实现(剧透:正确的值是2)。现在考虑这是如何用一组不同的例子来改变的:(x=4,y=12),(x=9,y=27),(x=15,y=35)。现在空白的值是多少?
数据是关键因素
从这个简单的例子可以看出,同一个公式可以产生非常不同的结果,这取决于空格的填充方式,而填充方式又取决于训练期间提供的数据。这就是为什么获得一组好的数据至关重要的原因。事实上,这是应用机器学习最困难的部分。
请注意,虽然这个例子效果很好,而且只有一个答案(对于每个数据集),但通常情况并非如此。
No silver bullet
In real life applications, machine learning is much more complex because not all data is so nicely presented or clear-cut as in this intentionally simple example, and because some tasks were machine learning is applied is so complex that there is no right or wrong, there is only better or worse choices (e.g. in self-driving cars).
没有良方
在现实生活中应用,机器学习要复杂得多,因为不是所有的数据都是那么好或如这个简单的例子一样清楚,由于机器学习应用的一些任务是如此复杂,没有对错之分,只有更好或更坏的选择(例如在自动驾驶汽车中)。
In real life applications, machine learning is much more complex because not all data is so nicely presented or clear-cut as in this intentionally simple example, and because some tasks were machine learning is applied is so complex that there is no right or wrong, there is only better or worse choices (e.g. in self-driving cars).
没有良方
在现实生活中应用,机器学习要复杂得多,因为不是所有的数据都是那么好或如这个简单的例子一样清楚,由于机器学习应用的一些任务是如此复杂,没有对错之分,只有更好或更坏的选择(例如在自动驾驶汽车中)。
Colleen Farrelly, Data Scientist/Poet/Social Scientist/Topologist (2009-present)
Colleen Farrelly,数据科学家/诗人/社会科学家/拓扑学家(2009年至今)
Colleen Farrelly,数据科学家/诗人/社会科学家/拓扑学家(2009年至今)
What are some good books to learn sparse modeling?
Not sure on books, but there are a lot of papers covering this on Google Scholar and ArXiv. I'd suggest looking at some of the foundational papers for Ridge Regression and LASSO, as well as elastic net, as these were developed for sparse modeling. I have a few of those papers lixed to this PPT if you'd like a quick reference sheet that explains the principles of sparse modeling and has the main associated papers: https://www.slideshare.net/ColleenFarrelly/machine-learning-by-analogy-59094152
学习稀疏建模的好书有哪些?
在书中我不确定,但是在谷歌Scholar和ArXiv上有很多关于这方面的论文。我建议看看一些关于归分析和套索工具的基础论文,以及弹性网,因为这些都是为稀疏建模而开发的。如果你想要一个解释稀疏建模原理的快速参考表,我有一些相关的论文链接到这个PPT上:https://www.slideshare.net/ColleenFarrelly/machi
Not sure on books, but there are a lot of papers covering this on Google Scholar and ArXiv. I'd suggest looking at some of the foundational papers for Ridge Regression and LASSO, as well as elastic net, as these were developed for sparse modeling. I have a few of those papers lixed to this PPT if you'd like a quick reference sheet that explains the principles of sparse modeling and has the main associated papers: https://www.slideshare.net/ColleenFarrelly/machine-learning-by-analogy-59094152
学习稀疏建模的好书有哪些?
在书中我不确定,但是在谷歌Scholar和ArXiv上有很多关于这方面的论文。我建议看看一些关于归分析和套索工具的基础论文,以及弹性网,因为这些都是为稀疏建模而开发的。如果你想要一个解释稀疏建模原理的快速参考表,我有一些相关的论文链接到这个PPT上:https://www.slideshare.net/ColleenFarrelly/machi
Pragathi, AI &ML at Royal Dutch Shell (2021-present)
Pragathi, AI &ML,荷兰皇家壳牌公司(2021年至今)
Pragathi, AI &ML,荷兰皇家壳牌公司(2021年至今)
Generally for sparse high dimensional data,XGBoost works great as it provides seperate path for sparse data and better than that you can try with LIGHTGBM model .Boruta works best but it takes time for feature sextion for sparse data.Try any other methods like VIF which may provide better results with reduced dimensions.
一般来说,对于稀疏的高维数据,XGBoost工作很好,因为它为稀疏数据提供了单独的路径,比你可以尝试使用LIGHTGBM模型更好。Boruta的效果最好,但对于稀疏的数据,它需要时间进行特征选择。尝试任何其他方法,如VIF,可以提供更好的结果与减少的维度。
一般来说,对于稀疏的高维数据,XGBoost工作很好,因为它为稀疏数据提供了单独的路径,比你可以尝试使用LIGHTGBM模型更好。Boruta的效果最好,但对于稀疏的数据,它需要时间进行特征选择。尝试任何其他方法,如VIF,可以提供更好的结果与减少的维度。
Clem Wang, Practicing Feature Engineer for over 15 years.
Clem Wang,有超过15年的实践功能工程师经验。
Clem Wang,有超过15年的实践功能工程师经验。
You're on dangerous ground when the feature dimension is larger than the number of examples in data set.
If the problem is looked at as N linear equations in M unknowns, where N < M, the results are under constrained and you'll have a theoretical set of solutions. In practice, the solutions are disturbingly uncertain.
Machine Learning can't magically solve problems like this…
You have to either try to get more training set data or identify and remove “useless” features. I say “useless” because you won't know for sure without more training data.
Another possible approach is to see if you can meaningfully combine features, and remove the input features and replace them with the new, “synthetic” feature. That means you need a good understanding of the problem Domain to know how Features might be related.
You learn situation illustrates the importance of having good training sets: ML can't magically make something out of nothing.
当特征维度大于数据集中的示例数时,你就处于危险境地。
如果将问题视为M个未知数中的N个线性方程组,其中N 机器学习无法神奇地解决这样的问题。
你要么尝试获取更多的训练集数据,要么识别并删除“无用”的功能。我说“无用”是因为如果没有更多的训练数据,你将无法确定。
另一种可能的方法是看看是否可以有意义地组合功能,删除输入功能,并将其替换为新的“合成”功能。这意味着您需要对问题领域有很好的了解,才能知道功能可能是如何关联的。
你所了解的情况说明了拥有好的训练集的重要性:机器学习无法神奇地实现从无到有。
原创翻译:龙腾网 https://www.ltaaa.cn 转载请注明出处
If the problem is looked at as N linear equations in M unknowns, where N < M, the results are under constrained and you'll have a theoretical set of solutions. In practice, the solutions are disturbingly uncertain.
Machine Learning can't magically solve problems like this…
You have to either try to get more training set data or identify and remove “useless” features. I say “useless” because you won't know for sure without more training data.
Another possible approach is to see if you can meaningfully combine features, and remove the input features and replace them with the new, “synthetic” feature. That means you need a good understanding of the problem Domain to know how Features might be related.
You learn situation illustrates the importance of having good training sets: ML can't magically make something out of nothing.
当特征维度大于数据集中的示例数时,你就处于危险境地。
如果将问题视为M个未知数中的N个线性方程组,其中N 机器学习无法神奇地解决这样的问题。
你要么尝试获取更多的训练集数据,要么识别并删除“无用”的功能。我说“无用”是因为如果没有更多的训练数据,你将无法确定。
另一种可能的方法是看看是否可以有意义地组合功能,删除输入功能,并将其替换为新的“合成”功能。这意味着您需要对问题领域有很好的了解,才能知道功能可能是如何关联的。
你所了解的情况说明了拥有好的训练集的重要性:机器学习无法神奇地实现从无到有。
原创翻译:龙腾网 https://www.ltaaa.cn 转载请注明出处
Andrew Morgan, Head of Data Science and Engineering
Andrew Morgan,数据科学与工程主管
Andrew Morgan,数据科学与工程主管
What is high dimensional sparse data?
While the other answer explains the structural answer very well, I’ll try and explain to a non technical audience, how they create high dimensional sparse data all the time, as this is an important concept.
Now, I’m sure you’ve been grocery shopping in a large supermarket, and I’ll assume you’ve taken a look at your receipt.
This is highly dimensional sparse data.
The question is Why?
Each item on the shelves of the shop has a barcode. That’s how you scan your shopping when you pay, right?
The electronic till and the scanner has a lookup table of bar codes, and that lookup table details out the numeric barcode, the human descxtion of the product, the price it is sold at, and perhaps if it incurs sales tax, as well as other things. This could be quite large as it describes everything they sell in the shop.
Now, If I had all the receipt data for All customers, and I wanted to compare your shopping behaviour to other people, I would build the following very very large matrix
(which you can think of as a very big spreadsheet):
什么是高维稀疏数据?
另一个答案很好地解释了结构化的答案,我将尝试向不懂技术的读者解释,他们如何始终创建高维稀疏数据,因为这是一个重要的概念。
我相信你一定在大型超市买过东西,我想你也看过你的收据了吧
这是高维稀疏数据。
而问题是为什么?
商店货架上的每件商品都有条形码。这就是你付款时扫描购物的方式,对吗?
电子收银机和扫描仪有一个条形码查找表,该查找表详细列出了数字条形码、产品的描述、产品售价,以及可能产生的销售税,以及其他事项。这个数据可能是相当大的,因为它描述了商店里出售的所有东西。
现在,如果我有所有客户的所有收据数据,并且我想将你的购物行为与其他人进行比较,我将构建以下非常大的矩阵
你可以认为这是一个非常大的电子表格:
While the other answer explains the structural answer very well, I’ll try and explain to a non technical audience, how they create high dimensional sparse data all the time, as this is an important concept.
Now, I’m sure you’ve been grocery shopping in a large supermarket, and I’ll assume you’ve taken a look at your receipt.
This is highly dimensional sparse data.
The question is Why?
Each item on the shelves of the shop has a barcode. That’s how you scan your shopping when you pay, right?
The electronic till and the scanner has a lookup table of bar codes, and that lookup table details out the numeric barcode, the human descxtion of the product, the price it is sold at, and perhaps if it incurs sales tax, as well as other things. This could be quite large as it describes everything they sell in the shop.
Now, If I had all the receipt data for All customers, and I wanted to compare your shopping behaviour to other people, I would build the following very very large matrix
(which you can think of as a very big spreadsheet):
什么是高维稀疏数据?
另一个答案很好地解释了结构化的答案,我将尝试向不懂技术的读者解释,他们如何始终创建高维稀疏数据,因为这是一个重要的概念。
我相信你一定在大型超市买过东西,我想你也看过你的收据了吧
这是高维稀疏数据。
而问题是为什么?
商店货架上的每件商品都有条形码。这就是你付款时扫描购物的方式,对吗?
电子收银机和扫描仪有一个条形码查找表,该查找表详细列出了数字条形码、产品的描述、产品售价,以及可能产生的销售税,以及其他事项。这个数据可能是相当大的,因为它描述了商店里出售的所有东西。
现在,如果我有所有客户的所有收据数据,并且我想将你的购物行为与其他人进行比较,我将构建以下非常大的矩阵
你可以认为这是一个非常大的电子表格:
We have a column for each barcode. We have a row for each customer. This will be one Very Big spreadsheet! Imagine there are something like 300,000 barcodes in a large Walmart for example (columns). There might be 15 million customers (rows).
Now, getting back to your grocery receipt, imagine you scroll down to find your customer ID in that spreadsheet, then move across the columns putting in a zero if you didn’t buy that product, or putting in the number items you did buy that has that same barcode.
You probably didn’t buy 300,000 things on your trip shopping, so it’s pretty obvious most of the cells would have a zero in them for you, and this is the case for everyone else too. If you bought 4 cans of soup, that one column for your row would have a count of 4 in the cell, meaning you bought 4 cans of that soup.
We use the word Sparse to describe this situation where most cells are zero for everyone.
每个条形码都有一列。我们为每位顾客排好队。这个数据会很庞大!例如,想象一下,在一个大型沃尔玛中有大约30万个条形码(列)。1500万客户(行)。
现在,回到你的购物收据,想象你向下滚动,在电子表格中找到你的客户ID,然后在列之间移动,然后,如果你没有买那个产品,就在这些列中输入0,或者在与你买过的商品相同条形码的列后输入数字。
你在单次购物时不可能会买30万件东西,所以很明显,大多数单元格都是零,其他人也是如此。如果你买了4罐汤,那么在对应列的单元格中会个数字“4”,这意味着你买了4罐汤。
我们用稀疏这个词来描述这种情况,即大多数单元格对每个人来说都是零。
Now, getting back to your grocery receipt, imagine you scroll down to find your customer ID in that spreadsheet, then move across the columns putting in a zero if you didn’t buy that product, or putting in the number items you did buy that has that same barcode.
You probably didn’t buy 300,000 things on your trip shopping, so it’s pretty obvious most of the cells would have a zero in them for you, and this is the case for everyone else too. If you bought 4 cans of soup, that one column for your row would have a count of 4 in the cell, meaning you bought 4 cans of that soup.
We use the word Sparse to describe this situation where most cells are zero for everyone.
每个条形码都有一列。我们为每位顾客排好队。这个数据会很庞大!例如,想象一下,在一个大型沃尔玛中有大约30万个条形码(列)。1500万客户(行)。
现在,回到你的购物收据,想象你向下滚动,在电子表格中找到你的客户ID,然后在列之间移动,然后,如果你没有买那个产品,就在这些列中输入0,或者在与你买过的商品相同条形码的列后输入数字。
你在单次购物时不可能会买30万件东西,所以很明显,大多数单元格都是零,其他人也是如此。如果你买了4罐汤,那么在对应列的单元格中会个数字“4”,这意味着你买了4罐汤。
我们用稀疏这个词来描述这种情况,即大多数单元格对每个人来说都是零。
H?kon Hapnes Strand, CTO and Machine Learning Engineer
Say I have trained a Random Forest and want to deploy it to make predictions online. How do I upxe it with the arrival of new data?
You’re talking about online learning.
It’s not very common in practice. If you have enough training data, you can train once offline.
In practice, models degrade over time, and you might not have as much training data as you’d like yet. That means online learning is a good idea in principle.
The reason it’s not very common is that it’s fairly complicated to implement. It’s very easy to just train your model, deploy it and forget about it.
Like Meir Maor says in his answer, decision trees have their own obstacles that make online learning more difficult. A compromise could be to re-train your model in scheduled batches. That way, you take new data into account without having to invent a way of updating your random forest model incrementally. This approach still requires some sort of model management system and plenty of automation.
假设我训练了一个随机森林算法,并希望将其部署到网上进行预测。如何随着新数据的到来进行更新?
你说的是在线学习。
这在实践中并不常见。如果你有足够的训练数据,你可以离线训练。
实际上,模型会随着时间的推移而退化,而且你可能还没有那么多想要的训练数据。这意味着在线学习原则上是个好主意。
它不太常见的原因是它实现起来相当复杂。仅仅训练模型,部署它,然后忘记它,就很容易。
正如Meir Maor在他的回答中所说,决策树算法有自己的障碍,使在线学习更加困难。一个折衷方案是按计划批量重新训练模型。这样,你就可以考虑新数据,而无需发明一种方法来逐步更新你的随机森林模型。这种方法仍然需要某种模型管理系统和大量自动化。
原创翻译:龙腾网 https://www.ltaaa.cn 转载请注明出处
Say I have trained a Random Forest and want to deploy it to make predictions online. How do I upxe it with the arrival of new data?
You’re talking about online learning.
It’s not very common in practice. If you have enough training data, you can train once offline.
In practice, models degrade over time, and you might not have as much training data as you’d like yet. That means online learning is a good idea in principle.
The reason it’s not very common is that it’s fairly complicated to implement. It’s very easy to just train your model, deploy it and forget about it.
Like Meir Maor says in his answer, decision trees have their own obstacles that make online learning more difficult. A compromise could be to re-train your model in scheduled batches. That way, you take new data into account without having to invent a way of updating your random forest model incrementally. This approach still requires some sort of model management system and plenty of automation.
假设我训练了一个随机森林算法,并希望将其部署到网上进行预测。如何随着新数据的到来进行更新?
你说的是在线学习。
这在实践中并不常见。如果你有足够的训练数据,你可以离线训练。
实际上,模型会随着时间的推移而退化,而且你可能还没有那么多想要的训练数据。这意味着在线学习原则上是个好主意。
它不太常见的原因是它实现起来相当复杂。仅仅训练模型,部署它,然后忘记它,就很容易。
正如Meir Maor在他的回答中所说,决策树算法有自己的障碍,使在线学习更加困难。一个折衷方案是按计划批量重新训练模型。这样,你就可以考虑新数据,而无需发明一种方法来逐步更新你的随机森林模型。这种方法仍然需要某种模型管理系统和大量自动化。
原创翻译:龙腾网 https://www.ltaaa.cn 转载请注明出处
Jeremy Cox, Computer Science
Jeremy Cox,计算机科学
Jeremy Cox,计算机科学
What is a clear explanation of data sparsity?
Sparsity and Density go hand in hand:
If data is meaningful / useful / not random, you will have regions where data points come together and cluster, and you will have areas they avoid coming together.
One way to think of sparsity is how space is empty (60%), whereas 40% of space is dense, or filled.
Now this is important, because as more and more variables are added to a database, more This makes the bubbles smaller and smaller and smaller until the data is totally uniform with even space between every point. At this point, everything looks the same and the statistics say it's the same and you really can't do anything with it.
So Data Scientists will look for ways to maximize sparsity so that they can get good clusters or well defined answers to their questions.
Did you have a specific application or context in mind? It's a little bit of a broad question. We can try to talk specifics.
如何清楚地解释数据稀疏性?
稀疏性和密度密切相关:
如果数据是有意义的/有用的/不是随机的,那么你会有数据点聚集的区域,你会有数据点避免聚集的区域。
理解稀疏性的一种方式是空间是空的(60%),而40%的空间是密集的,或者说是填充的。
这一点很重要,因为随着越来越多的变量被添加到数据库中。这使得气泡越来越小,直到数据完全一致,每个点之间的空间均匀。在这一点上,所有的东西看起来都是一样的,统计数据也表明它是一样的,你真的不能用它做任何事情。
因此,数据科学家将寻找最大化稀疏性的方法,以便他们能够获得好的聚类或获得问题的明确答案。
你有特定的应用或背景吗?这个问题有点宽泛。我们可以试着谈谈细节。
Sparsity and Density go hand in hand:
If data is meaningful / useful / not random, you will have regions where data points come together and cluster, and you will have areas they avoid coming together.
One way to think of sparsity is how space is empty (60%), whereas 40% of space is dense, or filled.
Now this is important, because as more and more variables are added to a database, more This makes the bubbles smaller and smaller and smaller until the data is totally uniform with even space between every point. At this point, everything looks the same and the statistics say it's the same and you really can't do anything with it.
So Data Scientists will look for ways to maximize sparsity so that they can get good clusters or well defined answers to their questions.
Did you have a specific application or context in mind? It's a little bit of a broad question. We can try to talk specifics.
如何清楚地解释数据稀疏性?
稀疏性和密度密切相关:
如果数据是有意义的/有用的/不是随机的,那么你会有数据点聚集的区域,你会有数据点避免聚集的区域。
理解稀疏性的一种方式是空间是空的(60%),而40%的空间是密集的,或者说是填充的。
这一点很重要,因为随着越来越多的变量被添加到数据库中。这使得气泡越来越小,直到数据完全一致,每个点之间的空间均匀。在这一点上,所有的东西看起来都是一样的,统计数据也表明它是一样的,你真的不能用它做任何事情。
因此,数据科学家将寻找最大化稀疏性的方法,以便他们能够获得好的聚类或获得问题的明确答案。
你有特定的应用或背景吗?这个问题有点宽泛。我们可以试着谈谈细节。
很赞 1
收藏