数据类型练习题
1.给出使用文本文件而不是二进制格式存储数据的优点:
a.通过键入文件或使用文本编辑器查看文件,可以轻松地检查文本文件。
b.从系统上和程序上文本文件比二进制文件更合适。
c.文本文件更容易修改,比如用文本编辑器或perl.
2.区别噪声和离群点,确保考虑以下问题。
(a)Is noise ever interesting or desirable? Outliers?
No, by definition. Yes.
噪声是测量误差的随机部分。
离群点是在某种意义上具有不同于数据集中其他大部分数据对象的特征的数据对象,或是相对于该属性的典型值来说不寻常的属性值。
(b) Can noise objects be outliers?
Yes. Random distortion of the data is often responsible for outliers 数据的随机失真通常是导致离群值的原因
(c) Are noise objects always outliers?
No. Random distortion can result in an object or value much like a normal one随机失真会导致一个对象或值非常类似于正常值
(d) Are outliers always noise objects?
No. Often outliers merely represent a class of objects that are different from normal objects.通常,离群值只是表示一类与普通对象不同的对象
(e) Can noise make a typical value into an unusual one, or vice versa?噪音能把一个典型的值变成一个异常值吗? 反之亦然?
Yes.
3.
![](https://img.haomeiwen.com/i20887903/7aca3ab96c2ed4e0.jpg)
(a) Describe the potential problems with this algorithm if there are duplicate objects in the data set. Assume the distance function will only return a distance of 0 for objects that are the same
首先,最近邻列表中重复对象的顺序将取决于算法的详细信息和数据集中对象的顺序。Second, if there are enough duplicates, the nearest neighbor list may consist only of duplicates. Third, an object may not be its own nearest neighbor.
(b) How would you fix this problem?
There are various approaches depending on the situation. One approach is to to keep only one object for each group of duplicate objects. In this case, each neighbor can represent either a single object or a group of duplicate objects.
14. 对亚洲象群的成员测量如下属性:重量,高度,象牙长度,象鼻长度和耳朵面积。基于这些测量,可以使用哪种相似性度量来对这些大象进行比较分组?论证答案并说明特殊情况。
这些属性都是数值的,但是根据测量的规模,有比较大的变化区间。此外,属性不是非对称的,属性的大小很重要。后两个事实排除了余弦和相关度量。欧几里德距离,在将属性标准化为平均值 为0和标准偏差为1后应用,将是合适的。
文档相似性最常用的度量之一:余弦相似度
两个具有二元变量或连续变量的数据对象之间的相关性是对象属性之间线性联系的度量。
5.给定m个对象的集合,这些对象划分成K组,其中第i组大小为mi。如果目标是得到容量为n<m的样本,下面两种抽样方案有什么区别?(假定有放回抽样)
(a) We randomly select n ∗ mi/m elements from each group.
(b) We randomly select n elements from the data set, without regard for the group to which an object belongs.
第一种方案保证了每个组内抽到的数量相同,而第二个方案每个组抽到的对象数量不同。更具体地说,第二个方案只保证平均每组的物体数量为More specifically, the second scheme only guarantes that, on average, the number of objects from each group will be n ∗ mi/m.
6.
![](https://img.haomeiwen.com/i20887903/1f58a9530217d7f5.jpg)
b.转换的目的可能是什么?
This normalization reflects the observation that terms that occur in every document do not have any power to distinguish one document from another, while those that are relatively rare do.每篇文档中都出现的术语没有能力区分一个文档和另一个文档,而那些相对少见的则可以区分。
7. Assume that we apply a square root transformation to a ratio attribute x to obtain the new attribute x∗. As part of your analysis, you identify an interval (a, b) in which x∗ has a linear relationship to another attribute y.
(a) What is the corresponding interval (a, b) in terms of x?
(a^2, b^2)
(b) Give an equation that relates y to x.
In this interval, y = x^2.
10.进一步探讨余弦度量和相关性度量。
a.余弦度量可能得值域是多少?
[−1, 1]. Many times the data has only positive entries and in that case the range is [0, 1].
b.如果两个对象的余弦度量为1,这两个对象相等吗?解释。
Not necessarily. All we know is that the values of their attributes differ by a constant factor.不一定。我们只知道它们的属性值有一个常数因子。
c.What is the relationship of the cosine measure to correlation, if any? (Hint: Look at statistical measures such as mean and standard deviation in cases where cosine and correlation are the same and different.)
For two vectors, x and y that have a mean of 0, corr(x, y) = cos(x, y).在余弦和相关系数相同或不同的情况下,查看平均值和标准差等统计度量。)
对于两个向量,x和y的平均值为0,corr(x,y)=cos(x,y)。
![](https://img.haomeiwen.com/i20887903/7ccbe040f6989b65.jpg)
![](https://img.haomeiwen.com/i20887903/2e8409743ea6a5f3.jpg)
d.Since all the 100,000 points fall on the curve, there is a functional relationship between Euclidean distance and cosine similarity for normalized data. More specifically, there is an inverse relationship between cosine similarity and Euclidean distance. For example, if two data points are identical(相同的), their cosine similarity is one and their Euclidean distance is zero, but if two data points have a high Euclidean distance, their cosine value is close to zero. Note that all the sample data points were from the positive quadrant(四分之一,象限.正象限), i.e., had only positive values. This means that all cosine (and correlation) values will be positive.
e.Same as previous answer, but with correlation substituted for cosine.
f.Let x and y be two vectors where each vector has an L2 length of 1. For such vectors, the variance is just n times the sum of its squared attribute values and the correlation between the two vectors is their dot product divided by n.对于这样的向量,方差仅是其平方属性值之和的n倍,并且两个向量之间的相关性是它们的点积除以n.
![](https://img.haomeiwen.com/i20887903/056569058401d431.jpg)
g.Let x and y be two vectors where each vector has an a mean of 0 and a standard deviation of 1. For such vectors, the variance (standard deviation squared) is just n times the sum of its squared attribute values and the correlation between the two vectors is their dot product divided by n.
![](https://img.haomeiwen.com/i20887903/2d55ce2751372297.jpg)