開源一個機器學習文字分析專案

超人汪小建發表於2018-06-01

原文網址 : https://juejin.im/post/5b10920ae51d4506a97d4a2f

TextAnalyzer

A text analyzer which is based on machine learning that can analyze text.

So far, it supports hot word extracting, text classification, part of speech tagging, named entity recognition, chinese word segment, extracting address, synonym, text clustering, word2vec model, edit distance, chinese word segment, sentence similarity.

Features

extracting hot words from a text.

to gather statistics via frequence.
to gather statistics via by tf-idf algorithm
to gather statistics via a score factor additionally.

extracting address from a text.

synonym can be recognized

SVM Classificator

This analyzer supports to classify text by svm. it involves vectoring the text. We can train the samples and then make a classification by the model.

For convenience,the model,tfidf and vector will be stored.

kmeans clustering && xmeans clustering

This analyzer supports to clustering text by kmeans and xmeans.

vsm clustering

This analyzer supports to clustering text by vsm.

part of speech tagging

It's implemented by HMM model and decoder by viterbi algorithm.

google word2vec model

This analyzer supports to use word2vec model.

chinese word segment

This analyzer supports to do chinese word segment.

edit distance

This analyzer supports calculating edit distance on char level or word level.

sentence similarity

This analyzer supports calculating similarity between two sentences.

How To Use

just simple like this

Extracting Hot Words

indexing a document and get a docId.

long docId = TextIndexer.index(text);
複製程式碼

extracting by docId.

 HotWordExtractor extractor = new HotWordExtractor();
 List<Result> list = extractor.extract(0, 20, false);
 if (list != null) for (Result s : list)
    System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());
複製程式碼

a result contains term,frequency and score.

失業證 : 1 : 0.31436604
戶口 : 1 : 0.30099702
單位 : 1 : 0.29152703
提取 : 1 : 0.27927202
領取 : 1 : 0.27581802
職工 : 1 : 0.27381304
勞動 : 1 : 0.27370203
關係 : 1 : 0.27080503
本市 : 1 : 0.27080503
終止 : 1 : 0.27080503
複製程式碼

Extracting Address

String str ="xxxx";
AddressExtractor extractor = new AddressExtractor();
List<String> list = extractor.extract(str);
複製程式碼

SVM Classificator

training the samples.

SVMTrainer trainer = new SVMTrainer();
trainer.train();
複製程式碼

predicting text classification.

double[] data = trainer.getWordVector(text);
trainer.predict(data);
複製程式碼

Kmeans Clustering && Xmeans Clustering

List<String> list = DataReader.readContent(KMeansCluster.DATA_FILE);
int[] labels = new KMeansCluster().learn(list);
複製程式碼

VSM Clustering

List<String> list = DataReader.readContent(VSMCluster.DATA_FILE);
List<String> labels = new VSMCluster().learn(list);
複製程式碼

Part Of Speech Tagging

HMMModel model = new HMMModel();
model.train();
ViterbiDecoder decoder = new ViterbiDecoder(model);
decoder.decode(words);
複製程式碼

Define Your Own Named Entity

MITIE is an information extractor library comes up with MIT NLP term , which github is https://github.com/mit-nlp/MITIE .

train total_word_feature_extractor

Prepare your word set, you can put them into a txt file in the directory of 'data'.

And then do things below:

git clone https://github.com/mit-nlp/MITIE.git
cd tools
cd wordrep
mkdir build
cd build
cmake ..
cmake --build . --config Release
wordrep -e data
複製程式碼

Finally you get the total_word_feature_extractor model.

train ner_model

We can use Java\C++\Python to train the ner model, anyway we must use the total_word_feature_extractor model to train it.

if Java,

NerTrainer nerTrainer = new NerTrainer("model/mitie_model/total_word_feature_extractor.dat");
複製程式碼

if C++,

ner_trainer trainer("model/mitie_model/total_word_feature_extractor.dat");
複製程式碼

if Python,

trainer = ner_trainer("model/mitie_model/total_word_feature_extractor.dat")
複製程式碼

build shared library

Do commands below:

cd mitielib
D:\MITIE\mitielib>mkdir build
D:\MITIE\mitielib>cd build
D:\MITIE\mitielib\build>cmake ..
D:\MITIE\mitielib\build>cmake --build . --config Release --target install
複製程式碼

Then we get these below:

-- Install configuration: "Release"
-- Installing: D:/MITIE/mitielib/java/../javamitie.dll
-- Installing: D:/MITIE/mitielib/java/../javamitie.jar
-- Up-to-date: D:/MITIE/mitielib/java/../msvcp140.dll
-- Up-to-date: D:/MITIE/mitielib/java/../vcruntime140.dll
-- Up-to-date: D:/MITIE/mitielib/java/../concrt140.dll
複製程式碼

Word2vec

we must set the word2vec's path system parameter when startup,just like this -Dword2vec.path=D:\Google_word2vec_zhwiki1710_300d.bin.

Word2Vec vec = Word2Vec.getInstance();
System.out.println("狗|貓: " + vec.wordSimilarity("狗", "貓"));
複製程式碼

Segment

DictSegment segment = new DictSegment();
System.out.println(segment.seg("我是中國人"));

複製程式碼

Edit Distance

char level,

CharEditDistance cdd = new CharEditDistance();
cdd.getEditDistance("what", "where");
cdd.getEditDistance("我們是中國人", "他們是日本人吖，四貴子");
cdd.getEditDistance("是我", "我是");
複製程式碼

word level,

List list1 = new ArrayList<String>();
list1.add(new EditBlock("計算機",""));
list1.add(new EditBlock("多少",""));
list1.add(new EditBlock("錢",""));
List list2 = new ArrayList<String>();
list2.add(new EditBlock("電腦",""));
list2.add(new EditBlock("多少",""));
list2.add(new EditBlock("錢",""));
ed.getEditDistance(list1, list2);
複製程式碼

Sentence Similarity

String s1 = "我們是中國人";
String s2 = "他們是日本人，四貴子";
SentenceSimilarity ss = new SentenceSimilarity();
System.out.println(ss.getSimilarity(s1, s2));
s1 = "我們是中國人";
s2 = "我們是中國人";
System.out.println(ss.getSimilarity(s1, s2));
複製程式碼

-------------推薦閱讀------------

我的2017文章彙總——機器學習篇

我的2017文章彙總——Java及中介軟體

跟我交流，向我提問：