個人專案-論文查重

息息系發表於2024-09-15

原文網址 : https://www.cnblogs.com/J1mmyC/p/18411365

這個作業屬於哪個課程	課程班級
這個作業要求在哪裡	個人專案
這個作業的目標	設計一個論文查重演算法，給出一個原文檔案和一個在這份原文上經過了增刪改的抄襲版論文的檔案，在答案檔案中輸出其重複率。

Github連結

👉👉👉👉 我的Github連結 👈👈👈👈

PSP

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃	30	35
· Estimate	估計這個任務需要多少時間	30	35

Development	開發	400	420
· Analysis	需求分析 (包括學習新技術)	60	80
· Design Spec	生成設計文件	40	45
· Design Review	設計複審	30	30
· Coding Standard	程式碼規範 (為目前的開發制定合適的規範)	20	20
· Design	具體設計	50	60
· Coding	具體編碼	120	130
· Code Review	程式碼複審	30	35
· Test	測試（自我測試，修改程式碼，提交修改）	50	60

Reporting	報告	90	100
· Test Report	測試報告	30	35
· Size Measurement	計算工作量	30	30
· Postmortem & Process Improvement Plan	事後總結, 並提出過程改進計劃	30	35
	合計	520	555

專案結構

/resources
    ├── orig.txt                          # 原始文字
    ├── orig_0.8_add.txt                  # 修改後的文字（新增部分內容）
    ├── orig_0.8_del.txt                  # 修改後的文字（刪除部分內容）
    ├── orig_0.8_dis_1.txt                # 修改後的文字（較少的詞語變動）
    ├── orig_0.8_dis_10.txt               # 修改後的文字（中等程度的詞語變動）
    ├── orig_0.8_dis_15.txt               # 修改後的文字（較多的詞語變動）

/utils
    ├── SimHash.py                        # SimHash 演算法的實現，用於計算文字的雜湊值和相似度
    ├── SimilarityCal.py                  # 相似度計算模組，包含檔案讀取和相似度輸出邏輯
    ├── TextUtil.py                       # 文字處理工具，負責分詞、去除標點符號等操作

main.py                                   # 主程式入口，負責呼叫各模組，計算並輸出文字相似度

test_main.py                              # 單元測試檔案，包含專案的測試用例

requirements.txt                          # 專案依賴的庫和版本

功能實現概述

本專案的主要功能是實現一個論文查重演算法，透過對原文和抄襲文字進行相似度計算，輸出兩者的重複率。該功能的核心演算法基於SimHash，一種用於高效文字相似度計算的演算法。

1. 演算法原理

SimHash 演算法：SimHash 是一種區域性敏感雜湊演算法，能夠將高維文字特徵對映為固定長度的二進位制向量，兩個文字的相似度透過計算它們的雜湊值之間的海明距離來衡量。
- 文字分詞：首先將文字轉換為一系列標記（詞或標點），並且透過演算法刪除部分標點符號。
- 特徵加權：對每個詞計算雜湊值，並根據其權重調整雜湊向量的加權值。
- 雜湊值生成：透過計算加權後的向量，生成固定長度的 SimHash 值。
- 相似度計算：對兩個文字的 SimHash 值計算海明距離，海明距離越小，相似度越高。

2. 呼叫過程簡述

程式啟動：
- 透過命令列呼叫主程式 main.py，傳入原文檔案路徑、抄襲文字檔案路徑和輸出檔案路徑。
- 命令示例：
```
python main.py orig.txt plag.txt output.txt
```
檔案讀取：
- 使用 TextUtil.py 中的 read_file() 函式讀取原文和抄襲文字的內容。如果檔案不存在或為空，程式會丟擲異常並返回錯誤提示。
分詞處理：
- 呼叫 TextUtil.py 中的 tokenize() 函式，對檔案內容進行分詞，並刪除部分標點符號，以生成處理後的詞彙列表。
SimHash 計算：
- 使用 SimHash.py 中的 simhash() 函式，對每個分詞生成雜湊值，並計算出兩個文字的 SimHash 雜湊值。
相似度計算：
- 透過 SimHash.py 中的 hamming_distance() 函式，計算兩個 SimHash 值之間的海明距離。
- 根據海明距離，呼叫 SimilarityCal.py 中的 calculate_similarity() 函式，計算兩個文字的相似度，並輸出百分比格式的重複率（精確到小數點後兩位）。
結果輸出：
- 相似度計算完成後，程式將結果寫入到輸出檔案中，格式為 重複率：百分比。

3. 異常處理

當輸入檔案不存在或為空時，系統會給出提示，並返回 0% 的相似度，確保程式的穩健性。

測試

測試程式碼

import unittest
from utils.SimHash import simhash, hamming_distance
from utils.SimilarityCal import calculate_similarity
from utils.TextUtil import tokenize, read_file
import os

class TestSimHash(unittest.TestCase):

  def test_simhash_identical_text(self):
      tokens = tokenize("Hello world!")
      simhash_value = simhash(tokens)
      self.assertEqual(simhash_value, simhash(tokens))  # SimHash values should be identical

  def test_simhash_different_text(self):
      tokens1 = tokenize("Hello world!")
      tokens2 = tokenize("Goodbye world!")
      self.assertNotEqual(simhash(tokens1), simhash(tokens2))  # Different texts should produce different SimHash values

  def test_hamming_distance(self):
      tokens1 = tokenize("Hello world!")
      tokens2 = tokenize("Goodbye world!")
      simhash1 = simhash(tokens1)
      simhash2 = simhash(tokens2)
      self.assertGreater(hamming_distance(simhash1, simhash2), 0)  # Ensure Hamming distance is non-zero

  def test_empty_file(self):
      with open('empty.txt', 'w') as f:
          f.write("")
      with open('empty.txt', 'r') as f:
          result = f.read()
      self.assertEqual(result, "")  # Empty files should return empty content
      os.remove('empty.txt')

  def test_calculate_similarity_identical(self):
      with open('test1.txt', 'w') as f:
          f.write("This is a test.")
      with open('test2.txt', 'w') as f:
          f.write("This is a test.")
      calculate_similarity('test1.txt', 'test2.txt', 'output.txt')
      with open('output.txt', 'r') as f:
          output = f.read()
      self.assertEqual(output, "100.00")  # Identical files should have 100% similarity
      os.remove('test1.txt')
      os.remove('test2.txt')
      os.remove('output.txt')

  def test_calculate_similarity_different(self):
      with open('test1.txt', 'w') as f:
          f.write("This is a test.")
      with open('test2.txt', 'w') as f:
          f.write("Completely different content.")
      calculate_similarity('test1.txt', 'test2.txt', 'output.txt')
      with open('output.txt', 'r') as f:
          output = f.read()
      self.assertLess(float(output), 100.00)  # Completely different files should have less than 100% similarity
      os.remove('test1.txt')
      os.remove('test2.txt')
      os.remove('output.txt')

  def test_tokenize_with_punctuation(self):
      text = "Hello, world!"
      tokens = tokenize(text)
      self.assertIn("world", tokens)
      self.assertIn("Hello", tokens)

  def test_tokenize_random_punctuation_removal(self):
      text = "This, is a test!"
      tokens = tokenize(text)
      self.assertIn("test", tokens)  # Ensure words remain intact
      # Check that some punctuation is removed, but not all
      self.assertTrue(any(punct in tokens for punct in [',', '!']))

  def test_tokenize_no_punctuation(self):
      text = "Hello world"
      tokens = tokenize(text)
      self.assertEqual(tokens, ["Hello", "world"])

  def test_calculate_similarity_partially_similar(self):
      with open('test1.txt', 'w') as f:
          f.write("Hello world!")
      with open('test2.txt', 'w') as f:
          f.write("Hello universe!")
      calculate_similarity('test1.txt', 'test2.txt', 'output.txt')
      with open('output.txt', 'r') as f:
          output = f.read()
      self.assertGreater(float(output), 0.00)
      self.assertLess(float(output), 100.00)  # Files with some overlap should have similarity between 0 and 100
      os.remove('test1.txt')
      os.remove('test2.txt')
      os.remove('output.txt')


  def test_read_file_nonexistent(self):
      # Test non-existent file case
      with self.assertRaises(ValueError):
          read_file("nonexistent_file.txt")

  def test_calculate_similarity_nonexistent_file(self):
      # Test similarity calculation with nonexistent file
      with self.assertRaises(ValueError):
          calculate_similarity('nonexistent1.txt', 'nonexistent2.txt', 'output.txt')

  def test_calculate_similarity_empty_file(self):
      # Test similarity calculation with empty file
      with open('empty.txt', 'w') as f:
          f.write("")
      with self.assertRaises(ValueError):
          calculate_similarity('empty.txt', 'empty.txt', 'output.txt')
      # os.remove('empty.txt')

  def test_calculate_similarity_invalid_content(self):
      # Case where files have invalid content (e.g., only spaces or empty tokens)
      with open('test_invalid1.txt', 'w') as f:
          f.write("     ")  # Invalid content, only spaces
      with open('test_invalid2.txt', 'w') as f:
          f.write("     ")  # Invalid content, only spaces

      calculate_similarity('test_invalid1.txt', 'test_invalid2.txt', 'output_invalid.txt')

      # Check if the output is "0.00" due to invalid content
      with open('output_invalid.txt', 'r') as f:
          output = f.read()

      self.assertEqual(output, "0.00")  # Expecting 0.00 similarity for invalid content

      # Cleanup
      os.remove('test_invalid1.txt')
      os.remove('test_invalid2.txt')
      os.remove('output_invalid.txt')

測試結果