Quora讨论|对于分类变量的缺失值究竟该如何处理
关于分类变量的缺失值究竟如何处理,我也咨询过很多人,包括统计方面的专家。有人说要插补数据,也有人说保留记录,以全局常量的形式插入。咨询完很多答案之后还是没有弄明白,以下是我进一步搜集的相关资料,与大家分享,如果你也遇到了同样的问题,欢迎入群跟我们一起交流。
以下是来自Quora的部分发言结果:
点赞最高的发言
来自[Michiel Van Herwegen], merely curious
第一个要问自己的问题:为什么这些价值观缺失了?实际上,数据很少是MAR(随机丢失),因此,数据丢失的事实具有其自身的含义。
Keeping missing values a a valid value
If you're going to impute (no matter the technique) a categorical variable where the data is not MAR, you are best case missing out on information. Worst-case, you're really clobbering up the variable.
如果您要插补(无论采用哪种技术)数据不是MAR的分类变量,则最好的情况是在信息上漏掉。最坏的情况是,您确实在破坏变量。
For that reason alone, i find it often valuable to consider missing as just another value for the categorical variable.(仅出于这个原因,我发现将缺失作为分类变量的另一个值通常很有价值)
My favorite example here is one i heard in a class, years ago. The department had done a churn project for a large retailer, based on loyalty card data. The best predictor of churn? Whether the customer had filled in his/her email address when registering for the card.
Imputing missing values(缺失值插补)
Sometimes, imputing the value can be the best option though:
-
When there is no reason to expect missing to have a meaning of its own. (缺失值没有意义)
-
When missing values are really rare, it may not be worth it to have an extra value for the categorical value. (缺失值很少的情况下)
-
When you are using a technique which has trouble with high-dimensional data, since an extra categorical value means an extra variable when dummy-encoding.
The common way to impute the missing values then is to use the mode. Main advantage: really simple, really fast. But of course not that nuanced.
Sometimes, a model is used to predict the missing values. I'm less a fan of that, for two reasons:
-
Time constraints. :) It is the scarcest resource you have, so the variable better be really important and the missing values common before spending a lot of time on just imputing it with a model of its own.
这位作者表示,它并不喜欢用模型来预测缺失值,第一个原因就是时间限制,他认为时间是最稀缺的资源。
为了来插补缺失数据要花很多时间来构建一个插补缺失的模型,太耗费时间了。
-
It introduces - to some extent - artificial relations between your predictors. So you better be careful to remember that when interpreting the model afterwards. Also, it is likely to introduce correlation between predictors, which may lead to higher uncertainty in your coefficients.
这样还会导致纳入模型的X之间有人为的相关性,这样对于模型的解释又增加了难度。
Dropping the records 删除缺失记录
Thirdly, you can remove the records with missing values. But that means you lose data. If there are many variables (who can have missing values), you may end up losing a lot of data while actually the total 'cells' with missing data is rather limited. (这种方法如果对于少量数据缺失是可取的,但是如果研究变量缺失过多,做这样的删除你会损失很多数据,可能会导致偏倚。)
On the other hand, if you are really interested in an explanatory model, you may want to get rid of them (again, assuming MAR) in order not to dilute/confound effects.
混合法
Lastly, there is actually yet another 'trick'. You impute, but you also add an extra variable which keeps track of whether you imputed the variable for the record or not. If the fact that it is missing, has a meaning, its impact can then be taken into account by that variable.
Bonus advantage: if someone else uses the data, the fact that you imputed data is not lost on them. :)
其它观点
[Shehroz Khan],ML Researcher, Postdoc @U of Toronto
认为处理方式有三种,包括:
-
Delete the records with missing values 删除有缺失的记录
-
Leave them as is , or 保留
-
Perform imputation 插补
您可以删除缺少值的行或记录,以避免对其进行处理。但是,如果数据中的缺失很大,则会影响模型的预测能力。因此,应避免这种做法。
If you are using decision trees, then you can keep missing attribute values as '?' or any arbitrary value not present in your attributes. Decision trees will take this as a separate attribute value and will give you predictions. You can use the same idea for Bagging, Boosting and Random forest. I think this is the strategy you mentioned above, but you cannot declare a variable as "0", unless you define what "0" means, or else it can mess up your computations. However, the problem with this approach is that if missingness is high in your data, then the predictive ability of your model deteriorate drastically. This is more of a lazy and adhoc approach. Therefore, better approaches are needed. Read Below.
插补意味着将丢失的属性值替换为其他内容。常见的选择是将其替换为属性中所有值的模式,However, a central question is "Is mode a good representation for missing attribute value?". 您可以使用其他分类模型来预测缺少的属性值。但是,尚不清楚在没有完整数据的情况下这些预测模型将如何学习。 KNN may work because in principle it doesn't do any training but it is O(N2)O(N2), so super slow on large datasets and the value of K needs to be optimized. There has been more work done to handle this question in principled manner. Read Below.
多重插补
There are systematic ways to handle missing data by performing Multiple Imputation(http://www.stefvanbuuren.nl/mi/MI.html). Rubin [1] and Joseph L. Schafer [2] 等人在缺失数据处理方面做了很多工作。
-
Chapters 7 and 8 of Schafer's book specifically deals with categorical data imputation. There are some software that imputes categorical data, you can see here Multiple imputation software(http://www.stefvanbuuren.nl/mi/Software.html).
-
Search for "categorical" and you will find CRAN - Package cat(http://cran.r-project.org/web/packages/cat/index.html) which is a R software based on Schaefer's work . Another R software that uses Non-Parametric Bayesian Multiple Imputation for Categorical Data [3].
-
在多重插补中,对一个缺失值进行多次插补. Therefore, once you impute data multiple times, you can perform different things such as create ensembles or averaging different imputations (for categorical data averaging is not the obvious thing to do).
数据科学家[Giuliano Janson]
-
Since nowadays there are plenty models that deal with missing values very well, like Gradient Boosting or Random Forest, I usually just set a special value for missing categorical levels (like 'MISSING', or -999) and let the model figure out if "missingness" has predictive power.
这位作者表示它通常会将缺失值设置为一个特殊值 Missing,然后让模型来看缺失值是否有预测能力。
-
The algorithms I mentioned do not even need one hot encoding, but if you wanted you could encode the 'MISSING' level the same way you encode any other level.
-
Everything else, like median, using another model to predict the missing value, dropping the row,... based on my experience, do not work as well or are way more complex and produce minimal improvement.
作者认为使用中位值,使用其它模型来预测缺失值,删去观测这些方法他都觉得不好,甚至是更复杂的方法也只能有极小的改善。
参考资料
以上讨论仅供学习使用,参考资料见:
https://www.quora.com/How-do-I-handle-missing-categorical-variable-in-an-easy-way