開源一個文字分析專案

超人汪小建發表於2017-06-12

Github

github.com/sea-boat/Te…

TextAnalyzer

a text analizer that can analyze text. so far, it can extract hot words in a text segment by using tf-idf algorithm,at the same time using a score factor to optimize the final score.

also it provides machine learning to make a classification.

Features

extracting hot words from a text.

  1. to gather statistics via frequence.
  2. to gather statistics via by tf-idf algorithm
  3. to gather statistics via a score factor additionally.

synonym can be recognized

SVM Classificator

this analyzer supports to classify text by svm. it involves vectoring the text. we can train the samples and then make a classification by the model.

for convenience,the model,tfidf and vector will be stored.

kmeans clustering && xmeans clustering

this analyzer supports to clustering text by kmeans and xmeans.

vsm clustering

this analyzer supports to clustering text by vsm.

Dependence

github.com/sea-boat/IK…

TODO

  • other ml algorithms.
  • emotion analization.

How to use

just simple like this

extracting hot words

  1. indexing a document and get a docId.
long docId = TextIndexer.index(text);複製程式碼
  1. extracting by docId.
 HotWordExtractor extractor = new HotWordExtractor();
 List<Result> list = extractor.extract(0, 20, false);
 if (list != null) for (Result s : list)
    System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());複製程式碼

a result contains term,frequency and score.

失業證 : 1 : 0.31436604
戶口 : 1 : 0.30099702
單位 : 1 : 0.29152703
提取 : 1 : 0.27927202
領取 : 1 : 0.27581802
職工 : 1 : 0.27381304
勞動 : 1 : 0.27370203
關係 : 1 : 0.27080503
本市 : 1 : 0.27080503
終止 : 1 : 0.27080503複製程式碼

SVM classificator

  1. training the samples.
SVMTrainer trainer = new SVMTrainer();
trainer.train();複製程式碼
  1. predicting text classification.
double[] data = trainer.getWordVector(text);
trainer.predict(data);複製程式碼

kmeans clustering && xmeans clustering

List<String> list = DataReader.readContent(KMeansCluster.DATA_FILE);
int[] labels = new KMeansCluster().learn(list);複製程式碼

vsm clustering

List<String> list = DataReader.readContent(VSMCluster.DATA_FILE);
List<String> labels = new VSMCluster().learn(list);複製程式碼

==========廣告時間==========

鄙人的新書《Tomcat核心設計剖析》已經在京東預售了,有需要的朋友可以到 item.jd.com/12185360.ht… 進行預定。感謝各位朋友。

=========================

歡迎關注:

開源一個文字分析專案

相關文章