在本节中,我们将解决一些相关问题。
类别预测
在一组文件中,不仅单词而且单词的类别也很重要; 在哪个类别的文本中特定的词落入。 例如,想要预测给定的句子是否属于电子邮件,新闻,体育,计算机等类别。在下面的示例中,我们将使用 tf-idf 来制定特征向量来查找文档的类别。使用 sklearn 的 20 个新闻组数据集中的数据。
导入必要的软件包 -
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
定义分类图。使用五个不同的类别,分别是宗教,汽车,体育,电子和空间。
category_map = {'talk.religion.misc':'Religion','rec.autos''Autos',
'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}
创建训练集 -
training_data = fetch_20newsgroups(subset = 'train',
categories = category_map.keys(), shuffle = True, random_state = 5)
构建一个向量计数器并提取术语计数 -
vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)
tf-idf 转换器的创建过程如下 -
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)
现在,定义测试数据 -
input_data = [
'Discovery was a space shuttle',
'Hindu, Christian, Sikh all are religions',
'We must have to drive safely',
'Puck is a disk made of rubber',
'Television, Microwave, Refrigrated all uses electricity'
]
以上数据将用于训练一个Multinomial朴素贝叶斯分类器 -
classifier = MultinomialNB().fit(train_tfidf, training_data.target)
使用计数向量化器转换输入数据 -
input_tc = vectorizer_count.transform(input_data)
现在,将使用 tfidf 转换器来转换矢量化数据 -
input_tfidf = tfidf.transform(input_tc)
执行上面代码,将预测输出类别 -
predictions = classifier.predict(input_tfidf)
输出结果如下 -
for sent, category in zip(input_data, predictions):
print('\nInput Data:', sent, '\n Category:', \
category_map[training_data.target_names[category]])
类别预测器生成以下输出 -
Dimensions of training data: (2755, 39297)
Input Data: Discovery was a space shuttle
Category: Space
Input Data: Hindu, Christian, Sikh all are religions
Category: Religion
Input Data: We must have to drive safely
Category: Autos
Input Data: Puck is a disk made of rubber
Category: Hockey
Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics
性别发现器
在这个问题陈述中,将通过提供名字来训练分类器以找到性别(男性或女性)。 我们需要使用启发式构造特征向量并训练分类器。这里使用 scikit-learn 软件包中的标签数据。 以下是构建性别查找器的 Python 代码 -
导入必要的软件包 -
import random
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names
现在需要从输入字中提取最后的 N 个字母。 这些字母将作为功能 -
def extract_features(word, N = 2):
last_n_letters = word[-N:]
return {'feature': last_n_letters.lower()}
if __name__=='__main__':
使用 NLTK 中提供的标签名称(男性和女性)创建培训数据 -
male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)
random.seed(5)
random.shuffle(data)
现在,测试数据将被创建如下 -
namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']
使用以下代码定义用于列车和测试的样本数 -
train_sample = int(0.8 * len(data))
现在,需要迭代不同的长度,以便可以比较精度 -
for i in range(1, 6):
print('\nNumber of end letters:', i)
features = [(extract_features(n, i), gender) for (n, gender) in data]
train_data, test_data = features[:train_sample],
features[train_sample:]
classifier = NaiveBayesClassifier.train(train_data)
分类器的准确度可以计算如下 -
accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
print('Accuracy = ' + str(accuracy_classifier) + '%')
现在,可以预测输出结果 -
for name in namesInput:
print(name, '==>', classifier.classify(extract_features(name, i))
上述程序将生成以下输出 -
Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female
Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female
在上面的输出中可以看到,结束字母的最大数量的准确性是两个,并且随着结束字母数量的增加而减少。