NLP理论基础
NLP理论基础
wolfoxliu 发表于9个月前
NLP理论基础
  • 发表于 9个月前
  • 阅读 20
  • 收藏 0
  • 点赞 1
  • 评论 0

腾讯云 新注册用户 域名抢购1元起>>>   

1.NLTK的安装

    pip install nltk

 

2.安装语料库

import nltk
nltk.download()

例如安装布朗大学的语料库:brown

也可以在程序中下载

nltk.download('brown')

 

3.NLTK自带语料库

from nltk.corpus import brown
from pprint import pprint
pprint(brown.categories())
print len(brown.sents())
print len(brown.words())

 

4.文本处理流程

    文本-->分词-->特征工程-->机器学习

 

5.分词

(1)英文分词    

import nltk
nltk.download('punkt')
sentence = 'Hello, Python'
tokens = nltk.word_tokenize(sentence)
print tokens

词性标注

print nltk.pos_tag(tokens)

(2)中文分词

    1)原理:

        HMM模型:http://yanyiwu.com/work/2014/04/07/hmm-segment-xiangjie.html

        CRF模型:http://blog.csdn.net/ifengle/article/details/3849852

    2)中文分词的Python库

        jieba:https://github.com/fxsjy/jieba

        corenlp-python:http://stanfordnlp.github.io/CoreNLP/

                                     https://pypi.python.org/pypi/corenlp-python

 

6.中文分词之jieba

import jieba
sentence = '我来到北京清华大学'
sentence2 = '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'
seg1 = jieba.cut(sentence, cut_all=True)     # 全模式
# print type(seg)                            # 分词后的结果并不是一个list,而是一个generator
print '[全模式:]', '/'.join(seg1)
seg2 = jieba.cut(sentence, cut_all=False)    # 精准模式
print '[精准模式:]', '/'.join(seg2)
seg3 = jieba.cut(sentence)                   # 默认情况是精准模式
print '[默认情况:]', '/'.join(seg3)
seg4 = jieba.cut_for_search(sentence2)        # 搜索引擎模式
print '[搜索引擎模式:]', '/'.join(seg4)

 

7.停用词(stopwords)

from nltk import word_tokenize
from nltk.corpus import stopwords
sentence = 'NetworkX is a Python language software package for studying the complex networks'
words = word_tokenize(sentence)
print words
filtered_words = [word for word in words if word not in stopwords.words('english')]
print filtered_words

 

8.NLP的应用

 (1)情感分析

         1)最简单的就是基于sentiment dictionary的方法,类似于基于关键词的打分机制,目前应用于广告投放领域。存在的问题:无法解决新词,特殊词以及更深层次的情感

from nltk import word_tokenize
from nltk.corpus import stopwords
sentiment_dictionary = {}
for line in open(r'E:\Python\QiyueZaixian\1_lesson\data\AFINN\AFINN-111.txt', 'r'):
    word, score = line.split('\t')
    sentiment_dictionary[word] = int(score)
sentence = 'He is a brave man'
words = word_tokenize(sentence)
filtered_words = [word for word in words if word not in stopwords.words('english')]
total_score = sum(sentiment_dictionary.get(word, 0) for word in filtered_words)
print total_score

     2)基于Machine Learning的情感分析

# 配上Machine Learning的情感分析
from nltk.classify import NaiveBayesClassifier
# 训练集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'


# 文本预处理函数
def pre_process(s):
    return {word: True for word in s.lower().split()}

# 标准化训练集
training_data = [
    [pre_process(s1), 'pos'],
    [pre_process(s2), 'pos'],
    [pre_process(s3), 'neg'],
    [pre_process(s4), 'neg']
]

# 训练
model = NaiveBayesClassifier.train(training_data)

# 测试
test_s = 'This is a great book'
print model.classify(pre_process(test_s))

     

(2)文本相似度:基于词频的独热编码(One-hot encoding),使用FreqDist进行词频统计

from nltk import word_tokenize
from nltk import FreqDist

# 构建词库
sentence = 'this is my sentence this is my life this is the day'
tokens = word_tokenize(sentence)
print tokens
fd = FreqDist(tokens)
print fd['is']
standard_vector = fd.most_common(6)
print standard_vector
size = len(standard_vector)


# 记录下出现次数最多的几个单词的位置信息
def position(st_vec):
    ps_dt = {}
    index = 0
    for word in st_vec:
        ps_dt[word[0]] = index
        index += 1
    return ps_dt
standard_position = position(standard_vector)
print standard_position

# 新的句子
new_sentence = 'this is cool'
freq_list = [0] * size
new_tokens = word_tokenize(new_sentence)
for word in new_tokens:
    try:
        freq_list[standard_position[word]] += 1
    except KeyError:
        continue
print freq_list

    (3)文本分类:TF-IDF(词频-逆文档频率)

        举例:一个文档中有100个单词,单词sheep出现了7次,那么TF(sheep) = 7/100 = 0.07

                   现在我们有10,000,000个文档,出现sheep的文档有10,000个,

                    那么IDF(sheep) = lg(10,000,000/10,000) = 3,

                    那么TF-IDF(sheep) = TF(sheep) * IDF(sheep) = 0.07 * 3 = 0.21

from nltk.text import TextCollection
sentence1 = 'this is sentence one'
sentence2 = 'I love you'
sentence3 = 'He is a brave man'
sentence4 = 'Hello, Python'
corpus = TextCollection([sentence1, sentence2, sentence3])
print corpus.tf('Hello', sentence4)       # tf = sentence4.count('Hello')/ len(sentence4)
print corpus.tf_idf('Hello', sentence4)

 

 

 

 

 

 

 

共有 人打赏支持
粉丝 7
博文 13
码字总数 10108
×
wolfoxliu
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: