0%

Machine Learning in Action Chapter 2

这是《机器学习实战》第二章运用k-近邻算法的源码。

约会网站配对预测

通过输入约会网站用户的玩视频游戏所耗时间百分比、每年获得的飞行常客里程数、每周消费的冰淇淋公升数,来预测海伦对对方的喜欢程度。

手写识别系统

需要识别的数字为已经处理好的具有相同大小的32*32黑白图像,且已经转换为文本格式。

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# -*- coding: utf-8 -*-

############################################################
# k-近邻算法
# 1)计算已知类别数据集中的点与当前点之间距离;
# 2)按照距离递增次序排序;
# 3)选取与当前点距离最小的k个点;
# 4)确定前k个点所在类别的出现频率;
# 5)返回前k个点出现频率最高的类别作为大概年前点的预测分类。
############################################################

from numpy import * # 导入科学计算包
import operator # 导入运算符模块

# 创建数据集与标签的函数,这是示例函数,与下面约会网站预无关
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels

# k-近邻算法
def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX,(dataSetSize,1))-dataSet # 距离计算
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k): # 选择距离最小的k个点
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
sortedClassCount = sorted(classCount.iteritems(),\ # 排序,注意运算符模块用法
key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]

# 将文本文件转换为NumPy的解析程序
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
returnMat = zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector

# 归一化特征值
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals-minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals,(m,1))
normDataSet = normDataSet/tile(ranges,(m,1))
return normDataSet,ranges,minVals

# 分类器针对约会网站的测试代码
def datingClassTest():
hoRatio = 0.05
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],\
normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print "the classifier came back with: %d, the real answer is: %d"\
% (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]):
errorCount += 1.0
print "the total error rate is: %f" % (errorCount/float(numTestVecs))

# 约会网站预测函数
def classifyPerson():
resultList = ['not at all','in small doses','in large doses']
percentTats = float(raw_input(\
"percentage of time spent playing video games?"))
ffMiles = float(raw_input("frequent flier miles earned per year?"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = array([ffMiles, percentTats, iceCream])
classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
print "You will probably like this person: ", \
resultList[classifierResult - 1]

# 将图像转换为向量:创建1*1024的NumPy数组,然后打开指定文件,
# 循环读出文件前32行,将每行头32个字符值存储在NumPy数组中,
# 最后返回数组。
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect

# 手写数字识别系统的测试代码
def handwritingClassTest():
hwLabels = []
# 获取目录内容
trainingFileList = listdir('chapter_2/digits/trainingDigits')
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
# 从文件名解析分类数字
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector( \
'chapter_2/digits/trainingDigits/%s' % fileNameStr)
testFileList = listdir('chapter_2/digits/testDigits')
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector( \
'chapter_2/digits/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, \
trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" \
% (classifierResult, classNumStr)
if (classifierResult != classNumStr):
errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))

小结

k-近邻算法是分类数据最简单有效的算法,使用算法时必须有接近实际数据的训练样本数据。如果训练集很大,必须使用大量的存储空间。此外,由于必须对数据集中每个数据计算距离值,实际使用时可能非常耗时。另一个缺陷是无法给出任何数据的基础结构信息,因此无法知晓平均实例样本和典型实例样本具有什么特征。