按照惯例,每个算法之后都要有一个例子来证明这个算法是有用的。。。

一般说到朴素贝叶斯都会举一个垃圾邮件识别的例子,因为在垃圾邮件识别方面,朴素贝叶斯表现的一点也不朴素,很突出。但是垃圾邮件的例子都那么多了,就不举这个例子了吧,换个例子,识别一下这是谁的歌。

这是谁的歌

为了做这个实验,我从网上爬了一些歌词,说到爬虫这里小小的安利一下自己的爬虫框架 cockroach 欢迎各位看官来个小小的 star。广告时间结束。回到话题。

这里选了两个歌手的歌词 汪峰 & 郑钧 ,因为他们两个都是摇滚歌手,所以更加有可比性一点。爬下来的歌词不多,每个歌手 20 首。为了演示效果就不多搞了。

图中就是歌词,每个歌词一个文件。

因为主要说朴素贝叶斯分类器所以爬虫这一块略过。接下来着重说说分类过程。

分类过程

数据准备

要想把数据通过分类器得出想要的结果,首先需要把数据处理成分类器需要的格式。通过上篇文章可以得到我们的分类器一共需要以下几个数据。

好了,知道需要什么数据了,就开始处理歌词信息。

def load_data(path_dir):
    filenames = os.listdir(path_dir)
    result_list = []
    for file in filenames:
        file_path = "{}/{}".format(path_dir,file)
        with open(file_path,encoding="UTF-8") as gc:
            lines = gc.readlines()
            lines = [line.replace("汪峰","") for line in lines]
            lines = [line.replace("郑钧","") for line in lines]
            lines = [list(jieba.cut(line)) for line in lines]
            word_list = list([])
            for line in lines:
                word_list = word_list + list(line)
            word_list = [
                word for word in word_list if word!='\n'
                                              and word != ' '
                                              and word != ':'
                                              and word != ', '
                                              and word != '-'
                                              and word != ', '
                                              and word != ':'
                                              and word != ')'
                                              and word != '('
                                              and word != '《'
                                              and word != '》'
                                              and word != '?'
            ]
            result_list.append(word_list)
    return result_list

解释一下,上边这个方法一共分为以下几步:

  1. 遍历文件夹
  2. 读取文件内容
  3. 使用 jieba 分词对每一行歌词进行分词
  4. 对每一行分词结果进行合并,最终每一首歌得到一个由单词构成的集合
  5. 过滤分词结果(这里可以在分词的部分使用停用词实现)

到这里基本上数据格式化就完成了。

path_wf = "D:/work/code/zhangyingwei/python/python-demos/input/ml/classify/naivebayes/geci/wf"
path_zj = "D:/work/code/zhangyingwei/python/python-demos/input/ml/classify/naivebayes/geci/zj"

data_wf = load_data(path_wf)
data_zj = load_data(path_zj)

上边代码就是把汪峰的歌词与郑钧的歌词分别格式化并加载到内存中。

接下来需要构造分类信息数据集。

target_wf = [0]*len(data_wf)
target_zj = [1]*len(data_zj)

这两行代码的意思是分别构造了一个内容为0与内容为1的集合,集合的长度与数据集的长度相同。即 0 代表汪峰,1 代表郑钧。

target = target_wf + target_zj
data_set = data_wf+data_zj

这里分别将数据集与分类集合合并,得到最终的数据集与分类信息。附上这个方法的全部代码。

def load_data_set():
    path_wf = "D:/work/code/zhangyingwei/python/python-demos/input/ml/classify/naivebayes/geci/wf"
    path_zj = "D:/work/code/zhangyingwei/python/python-demos/input/ml/classify/naivebayes/geci/zj"

    data_wf = load_data(path_wf)
    data_zj = load_data(path_zj)

    target_wf = [0]*len(data_wf)
    target_zj = [1]*len(data_zj)

    target = target_wf + target_zj
    data_set = data_wf+data_zj

    return data_set,target

接下来需要将数据集划分一下,分出一部分来作为训练集,另外一部分作为测试集。

data_set,target = load_data_set()
length = len(data_set)

test_data_size = 2
test_data = data_set[length-test_data_size:]
test_target = target[length-test_data_size:]

test_data2 = data_set[:test_data_size]
test_target2 = target[:test_data_size]

取前两条与后两条数据共四条数据作为测试集合。

到这里准备工作就做好了,开始预测。

model = NaiveBayes()
model.trainNB(data_set=data_set[test_data_size:length-test_data_size], targets=target[test_data_size:length-test_data_size])
for index,line in enumerate(test_data):
    res = model.classify(word_list=line)
    print("res is {} and should be {}".format(res,test_target[index]))

for index, line in enumerate(test_data2):
    res = model.classify(word_list=line)
    print("res is {} and should be {}".format(res, test_target2[index]))

执行之后结果为:

res is 1 and should be 1
res is 1 and should be 1
res is 0 and should be 0
res is 0 and should be 0

所以从结果来看,朴素贝叶斯对于文本二分类效果出奇的好。

附上全部代码:

#!/usr/bin/env python  
# encoding: utf-8  

"""
@version: v1.0 
@author: zhangyw
@site: http://blog.zhangyingwei.com
@software: PyCharm 
@file: naivebayes_test.py 
@time: 2017/12/7 14:14 
"""
import os
import jieba
import numpy as np
from ml.classify.naivebayes.naivebayes import NaiveBayes


def load_data(path_wf):
    filenames = os.listdir(path_wf)
    result_list = []
    for file in filenames:
        file_path = "{}/{}".format(path_wf,file)
        with open(file_path,encoding="UTF-8") as gc:
            lines = gc.readlines()
            lines = [line.replace("汪峰","") for line in lines]
            lines = [line.replace("郑钧","") for line in lines]
            lines = [list(jieba.cut(line)) for line in lines]
            word_list = list([])
            for line in lines:
                word_list = word_list + list(line)
            word_list = [
                word for word in word_list if word!='\n'
                                              and word != ' '
                                              and word != ':'
                                              and word != ', '
                                              and word != '-'
                                              and word != ', '
                                              and word != ':'
                                              and word != ')'
                                              and word != '('
                                              and word != '《'
                                              and word != '》'
                                              and word != '?'
            ]
            result_list.append(word_list)
    return result_list


def load_data_set():
    path_wf = "D:/work/code/zhangyingwei/python/python-demos/input/ml/classify/naivebayes/geci/wf"
    path_zj = "D:/work/code/zhangyingwei/python/python-demos/input/ml/classify/naivebayes/geci/zj"

    data_wf = load_data(path_wf)
    data_zj = load_data(path_zj)

    target_wf = [0]*len(data_wf)
    target_zj = [1]*len(data_zj)

    target = target_wf + target_zj
    data_set = data_wf+data_zj

    return data_set,target



if __name__ == '__main__':
    data_set,target = load_data_set()
    length = len(data_set)

    test_data_size = 2
    test_data = data_set[length-test_data_size:]
    test_target = target[length-test_data_size:]

    test_data2 = data_set[:test_data_size]
    test_target2 = target[:test_data_size]

    model = NaiveBayes()
    model.trainNB(data_set=data_set[test_data_size:length-test_data_size], targets=target[test_data_size:length-test_data_size])
    for index,line in enumerate(test_data):
        res = model.classify(word_list=line)
        print("res is {} and should be {}".format(res,test_target[index]))

    for index, line in enumerate(test_data2):
        res = model.classify(word_list=line)
        print("res is {} and should be {}".format(res, test_target2[index]))