ngrams

python - 在 ngrams 上训练朴素贝叶斯分类器

我一直在使用RubyClassifierlibrary至classifyprivacypolicies.我得出的结论是，这个库中内置的简单词袋方法是不够的。为了提高我的分类准确率，除了单个单词之外，我还想在n-gram上训练分类器。我想知道是否有一个库可以预处理文档以获得相关的n-gram(并正确处理标点符号)。一种想法是我可以预处理文档并将伪ngram提供给Ruby分类器，例如:wordone_wordtwo_wordthree或者也许有更好的方法来执行此操作，例如从一开始就内置了基于ngram的朴素贝叶斯分类的库。如果他们能完成工作，我愿意在这里使用Ruby以外的语言(如果需要，P

贝叶朴素 39 section noreferrer python ruby nlp machine-learning classification

python - nltk 语言模型(ngram)从上下文计算一个词的概率

我正在使用Python和NLTK构建如下语言模型:fromnltk.corpusimportbrownfromnltk.probabilityimportLidstoneProbDist,WittenBellProbDistestimator=lambdafdist,bins:LidstoneProbDist(fdist,0.2)lm=NgramModel(3,brown.words(categories='news'),estimator)#Thankstomiku,Ifixedthisproblemprintlm.prob("word",["Thisisacontextwhichg

python ngram context prob word nlp nltk

具有频率的 Ngram 的 Python 列表

我需要从文本中获取最流行的ngram。Ngram的长度必须在1到5个单词之间。我知道如何得到二元组和三元组。例如:bigram_measures=nltk.collocations.BigramAssocMeasures()finder=nltk.collocations.BigramCollocationFinder.from_words(words)finder.apply_freq_filter(3)finder.apply_word_filter(filter_stops)matches1=finder.nbest(bigram_measures.pmi,20)但是，我发现sc

Python Ngram code CountVectorizer strong nltk scikit-learn

unicode - 构建 ngram 频率表并处理多字节 rune

我目前正在学习围棋，并且取得了很大进步。我这样做的一种方法是将过去的项目和原型(prototype)从先前的语言移植到新的语言。现在我正忙于一个“语言检测器”，这是我不久前用Python制作的原型(prototype)。在这个模块中，我生成一个ngram频率表，然后我在其中计算给定文本和已知语料库之间的差异。这允许人们通过返回给定ngram表的两个向量表示的余弦值来有效地确定哪个语料库是最佳匹配。耶。数学。我有一个用Go编写的原型(prototype)，它可以完美地处理纯ascii字符，但我非常希望它可以处理unicode多字节支持。这就是我的工作重点。这是我正在处理的一个简单示例:h

多字并处 section code unicode go rune

python - 从大量 .txt 文件及其频率生成 Ngram(Unigrams、Bigrams 等)

我需要在NLTK中编写一个程序，将语料库(大量txt文件)分解为unigrams、bigrams、trigrams、fourgrams和Fivegrams。我已经编写了代码来将我的文件输入到程序中。输入是300个用英文编写的.txt文件，我想要Ngrams形式的输出，特别是频率计数。我知道NLTK有Bigram和Trigram模块:http://www.nltk.org/_modules/nltk/model/ngram.html但我没有那么先进，无法将它们输入我的程序。输入:txt文件不是单句输出示例:Bigram[('Hi','How'),('How','are'),('are',

Unigrams 及其 39 ngrams corpus python nltk

ruby-on-rails - Elasticsearch 和 Rails : Using ngram to search for part of a word

我正在尝试在我的项目中使用Elasticsearch-Gem。据我了解:现在已经不需要轮胎gem了，还是我错了？在我的项目中，我有一个搜索(很明显)，它目前适用于一个模型。现在我试图避免使用通配符，因为它们不能很好地扩展，但我似乎无法让ngram-Analyzers正常工作。如果我搜索整个单词，搜索仍然有效，但部分无效。classPictures{:analyzer=>{:my_index_analyzer=>{:tokenizer=>"keyword",:filter=>["lowercase","substring"]},:my_search_analyzer=>{:tokeniz

ruby-on-rails Elasticsearch section analyzer 34 ruby search rails-activerecord

1 23