開源一個文字分析專案

超人汪小建發表於2017-06-12

Github

TextAnalyzer

a text analizer that can analyze text. so far, it can extract hot words in a text segment by using tf-idf algorithm,at the same time using a score factor to optimize the final score.

also it provides machine learning to make a classification.

Features

extracting hot words from a text.

to gather statistics via frequence.
to gather statistics via by tf-idf algorithm
to gather statistics via a score factor additionally.

synonym can be recognized

SVM Classificator

this analyzer supports to classify text by svm. it involves vectoring the text. we can train the samples and then make a classification by the model.

for convenience,the model,tfidf and vector will be stored.

kmeans clustering && xmeans clustering

this analyzer supports to clustering text by kmeans and xmeans.

vsm clustering

this analyzer supports to clustering text by vsm.

Dependence

github.com/sea-boat/IK…

TODO

other ml algorithms.
emotion analization.

How to use

just simple like this

extracting hot words

indexing a document and get a docId.

long docId = TextIndexer.index(text);複製程式碼

extracting by docId.

 HotWordExtractor extractor = new HotWordExtractor();
 List<Result> list = extractor.extract(0, 20, false);
 if (list != null) for (Result s : list)
    System.out.println(s.getTerm() + " : " + s.getFrequency() + " : " + s.getScore());複製程式碼

a result contains term,frequency and score.

失業證 : 1 : 0.31436604
戶口 : 1 : 0.30099702
單位 : 1 : 0.29152703
提取 : 1 : 0.27927202
領取 : 1 : 0.27581802
職工 : 1 : 0.27381304
勞動 : 1 : 0.27370203
關係 : 1 : 0.27080503
本市 : 1 : 0.27080503
終止 : 1 : 0.27080503複製程式碼

SVM classificator

training the samples.

SVMTrainer trainer = new SVMTrainer();
trainer.train();複製程式碼

predicting text classification.

double[] data = trainer.getWordVector(text);
trainer.predict(data);複製程式碼

kmeans clustering && xmeans clustering

List<String> list = DataReader.readContent(KMeansCluster.DATA_FILE);
int[] labels = new KMeansCluster().learn(list);複製程式碼

vsm clustering

List<String> list = DataReader.readContent(VSMCluster.DATA_FILE);
List<String> labels = new VSMCluster().learn(list);複製程式碼

==========廣告時間==========

鄙人的新書《Tomcat核心設計剖析》已經在京東預售了，有需要的朋友可以到 item.jd.com/12185360.ht… 進行預定。感謝各位朋友。

=========================

歡迎關注：

開源一個機器學習文字分析專案
2018-06-01
機器學習
如何熟悉一個開源專案？
2015-07-27
企業開源指南：建立一個開源專案
2019-04-07
怎樣做好一個開源專案
2022-03-11
一個檔案的開源專案，開啟你的開源之旅
2022-03-23
企業開源指南：啟動一個開源專案
2019-11-28
找個開源專案
2004-01-04
開源專案Running Life 原始碼分析（一）
2016-09-14
原始碼
如何去參與一個開源專案
2021-07-14
我寫了一個開源專案AlphabetPy
2019-01-18
Alphabet
如何才能運作好一個開源專案？
2013-08-16
開源一個線上專案 WeAre-AR相簿
2018-06-08
如何找到並快速上手一個開源專案
2024-07-01
記錄一個開源專案排名網站
2024-04-08
網站
開源專案推薦：提高研發效率的5個開源專案
2023-05-11
Insight API開源專案分析
2020-11-26
API
如何發起並運營一個開源專案
2021-03-02
聊聊第一個開源專案（內網穿透） - CProxy
2022-03-06
內網穿透
一個令人驚豔的ChatGPT專案，開源了！
2023-04-04
ChatGPT
PlantUML 是繪製 uml 的一個開源專案
2024-04-07
分享個 golang 開源小專案
2021-02-21
Golang
成功運作一個開源專案的 15 個要點
2017-11-04
走進開源專案 - urlcat 原始碼分析
2022-03-14
原始碼
Android 開源專案PhotoView原始碼分析
2015-09-14
AndroidView原始碼
開源專案Philm的MVP架構分析
2015-10-14
MVP架構
[譯]過去一個月最 ? 的 10 個 Swift 開源專案
2019-07-08
Swift
如何重構一個過萬Star開源專案—BetterScroll
2020-07-27
開源一個功能完整的SpringBoot專案框架
2020-01-11
Spring Boot框架
我的第一個開源專案 Kiwis2 Mockserver
2021-08-16
MockServer
給你的開源專案加一個綬帶吧
2018-07-11
如何為你的開源專案釋出一個版本
2020-10-29
Hi，我是ChunJun，一個有趣好用的開源專案
2022-08-15
如何做一個真正牛X的開源專案
2013-04-27
Halo 開源專案學習（一）：專案啟動
2022-04-22
今年第一個獨立 App，TKeyboard，也是第一個開源專案
2017-03-20
APP
推薦一個.Ner Core開發的配置中心開源專案
2023-05-12
開始一個專案
2020-10-30
朝花夕拾——更新兩個開源專案
2021-05-03