Ospaf專案-commits詞頻統計模組

李博Garvin發表於2014-08-15

1.背景

最近在搞得ospaf專案（可以移步ospaf中期報告來了解），對於commits資料進行特徵提取的時候發現，因為開源專案的commits的特點有以下兩個主要放面：1.動詞往往出現在第一個字，例如add、revert之類的。2.動詞相對固定，主要也就是那幾種，add、revert、update、merge、remove之類的。

所以要做的工作就比較清晰了。

步驟1.首先是提取每個commit第一個字母

步驟2.因為每個專案有很多contributor，所以大家習慣的寫法也不一樣，如add，有的人會寫成Add、added、Added之類的。

2.演算法與程式碼

1.同型單詞的識別

針對與上述的步驟2，也就是同形單詞的識別問題。我想到了一個演算法（大家有更好的請留言指教），比如單詞A和B。首先將A和B都轉換成小寫a、b，然後找到a和b中較短的單詞，這個較短的單詞長k=min（len(a,b)），如果k是偶數取distance=k/2，如果k是奇數distance=k/2+1。接著將a和b按字母分割，如果a和b的前distance個字母相同，說明A和B同型。這個演算法雖然不夠精準，但是在ospaf專案是夠用了。程式碼如下，如果a=b，返回1。否則返回0

def WordCompare(a,b):
    a_low=a.lower()
    b_low=b.lower()
    a_length=len(a_low)
    b_length=len(b_low)
    distance=min(a_length,b_length)   
    if distance%2 ==0:
        distance_cop=distance/2
    else:
        distance_cop=distance/2+1    
    for i in range(0,distance_cop):
        if a_low[i]==b_low[i]:
              continue
        else:
              return 0
              break   
    return 1

2.記錄詞頻

首先有一個單詞庫KeyWords負責統計需要記錄的單詞，commit是樣例：

'''
compare the different word
@author: Garvin
'''
KeyWords=['add','remove','update']
commite=['Added testh ','removed fae gew','update cewf','add cek','get tawge']

def WordCompare(a,b):
    a_low=a.lower()
    b_low=b.lower()
    a_length=len(a_low)
    b_length=len(b_low)
    distance=min(a_length,b_length)   
    if distance%2 ==0:
        distance_cop=distance/2
    else:
        distance_cop=distance/2+1    
    for i in range(0,distance_cop):
        if a_low[i]==b_low[i]:
              continue
        else:
              return 0
              break   
    return 1

def GetKeyWordFreq(KeyWords,commits):
     WordFreqDic={}
     for i in KeyWords:
        WordFreqDic[i]=0
     for j in commite:
#         j.split()[0] 
        for key in WordFreqDic.keys():
            if  WordCompare(j.split()[0],key)==1:
                 WordFreqDic[key]=WordFreqDic[key]+1
     return WordFreqDic

          
    
if __name__=='__main__':
      print GetKeyWordFreq(KeyWords,commite)
#     print WordCompare('commited','commit')

結果如下：

/********************************

* 本文來自部落格 “李博Garvin“

* 轉載請標明出處:http://blog.csdn.net/buptgshengod

******************************************/

詞語詞頻統計
2020-11-19
詞頻統計
2024-06-26
詞頻統計mapreduce
2024-10-27
python如何統計詞頻
2021-09-11
Python
python實現詞頻統計
2020-12-08
Python
PostgreSQL全文檢索-詞頻統計
2018-04-18
SQL
文字挖掘之語料庫、分詞、詞頻統計
2024-05-20
分詞
詞頻統計任務程式設計實踐
2024-10-14
程式設計
用Python如何統計文字檔案中的詞頻？(Python練習)
2019-11-26
Python
Python統計四六級考試的詞頻
2018-09-10
Python
Swift 專案的模組化
2018-06-19
Swift
Java、Scala、Python ☞ 本地WordCount詞頻統計對比
2018-09-06
JavaPython
布匹瑕疵檢測專案之計米器模組的設計
2020-11-21
使用 Go 模組建立專案（vgo）
2019-09-13
Go
Springboot建立maven多模組專案
2020-09-27
Spring BootMaven
python TK庫統計word文件單詞詞頻程式 UI選擇文件
2020-12-27
PythonUI
基於RDD的Spark應用程式開發案列講解（詞頻統計）
2020-11-12
Spark
工程管理系統專案各模組及其功能點清單
2022-06-07
Javafx-【直方圖】文字頻次統計工具中文/英文單詞統計
2021-11-09
Java直方圖
Maven如何只打包專案某個模組及其依賴模組？
2024-10-29
Maven
springboot模組化開發專案搭建
2024-06-05
Spring Boot
SpringBoot - 多模組專案的搭建教程
2020-12-02
Spring Boot
統計英文名著中單詞出現頻率
2018-06-03
python 計算txt文字詞頻率
2018-07-29
Python
Swift + RxSwift MVVM 模組化專案實踐
2019-04-14
SwiftMVVM
Angular專案中共享模組的實現
2018-05-21
Angular
IDEA建立SpringBoot的多模組專案教程
2020-06-27
IdeaSpring Boot
PyThon模組與專案熱度網站
2020-11-02
Python網站
如何構建多模組的SpringBoot專案
2019-05-13
Spring Boot
Python找不到專案模組解決方法
2024-10-22
Python
Spring Boot + MyBatis 多模組專案搭建教程
2021-11-21
Spring BootMyBatis
vue多專案多模組執行/打包
2021-09-28
Vue
vue專案的網路模組封裝
2021-01-04
Vue封裝
SpringBoot多模組專案中無法注入其他模組中的spring bean
2020-12-04
Spring BootBean
OA系統模組設計方案
2021-11-16
java 版工程管理系統專案各模組及其功能點清單
2022-06-06
Java
How to review diffs between commits
2022-09-07
ViewMIT
python常用標準庫（os系統模組、shutil檔案操作模組）
2022-06-04
Python
SpringBoot學習日記（二）多模組專案
2018-12-03
Spring Boot

Ospaf專案-commits詞頻統計模組

1.背景

2.演算法與程式碼

1.同型單詞的識別

2.記錄詞頻

相關文章