個人專案：論文查重

lsr0930發表於2024-09-14

原文網址 : https://www.cnblogs.com/Zanama/p/18410334

這個作業屬於哪個課程	https://edu.cnblogs.com/campus/gdgy/CSGrade22-34
這個作業要求在哪裡	https://edu.cnblogs.com/campus/gdgy/CSGrade22-34/homework/13228
這個作業的目標	設計一個論文查重程式，瞭解軟體開發流程
gitHub專案地址	https://github.com/Abaistudy/3122004760

一、PSP表格

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃	30	30
Estimate	估計這個任務需要多少時間	600	1220
Development	開發	300	350
Analysis	需求分析 (包括學習新技術)	120	120
Design Spec	生成設計文件	120	90
Design Review	設計複審	20	30
Coding Standard	程式碼規範 (為目前的開發制定合適的規範)	20	30
Design	具體設計	60	90
Coding	具體編碼	180	150
Code Review	程式碼複審	60	30
Test	測試（自我測試，修改程式碼，提交修改）	60	120
Reporting	報告	120	180
Test Repor	測試報告	20	10
Size Measurement	計算工作量	10	10
Postmortem & Process Improvement Plan	事後總結, 並提出過程改進計劃	10	10
	合計	1120	1220

二、思路分析以及介面設計

題目回顧：

設計一個論文查重演算法，輸入一份原文檔案和一份抄襲版論文的檔案，計算重複率並在指定答案檔案中輸出。注意答案檔案中輸出的答案為浮點型，精確到小數點後兩位
輸入輸出格式：按照傳遞命令列引數的方式提供檔案的位置，需要從指定的位置讀取檔案，並向指定的檔案輸出答案。
```
python main.py [原文檔案] [抄襲版論文的檔案] [答案檔案]
```

執行環境：

PyCharm 2021.1.3 (Professional Edition)

python 3.9

思路分析：

主要介面設計

1.文字預處理（preprocess_text）：去除標點符號、換行符，使用 jieba 庫進行分詞，最後返回以空格分隔的詞語序列。

def preprocess_text(text):
    """
    文字預處理：去除標點符號，換行符，並進行分詞
    """
    # 避免輸入文字為空出錯
    if not text:
        return ""

    # 去除標點符號和換行符
    text = text.translate(str.maketrans('', '', string.punctuation + '\n\r\t'))

    # 使用 jieba 進行分詞
    words = jieba.lcut(text)

    # 返回以空格分隔的詞語序列
    return ' '.join(words)

2.文字向量化（vectorize_texts）：使用 TF-IDF 演算法將兩個文字向量化。

def vectorize_texts(text1, text2):
    """
    使用 TF-IDF 將兩個文字向量化
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    return tfidf_matrix

3.計算餘弦相似度（calculate_cosine_similarity）：可以使用 sklearn 裡面的函式計算向量化文字的餘弦相似度。

def calculate_cosine_similarity(tfidf_matrix):
    """
    計算餘弦相似度
    """
    if tfidf_matrix.shape[1] == 0:  # 如果向量的維度為0，返回0相似度
        return 0.0

    # 計算兩個向量之間的餘弦相似度
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
    return similarity

三、效能分析

使用cproflie庫進行效能分析，生成performance_analysis_result檔案，使用snakeviz程式將效能分析結果進行視覺化。`

cProfile.run("plagiarism_check(orig_file, plagiarism_file, output_file)", filename="performance_analysis_result")

snakeviz.exe -p 8080 .\performance_analysis_result

效能分析結果如圖：

可見在主函式plagiarism_check中,函式preprocess_text執行時間最長。

四、單元測試

測試點分析

1.測試檔案的讀寫功能

測試點：

已存在的檔案能否正確讀取
讀取不存在的檔案時能否提示錯誤
得到的相似度結果能否正確儲存

測試程式碼

    def test_read_file_existing(self):
        """
        1.1: 測試能否正確讀取已存在的檔案
        預期：text非空
        """
        test_file = './test_text/orig.txt'
        text = main.read_file(test_file)
        assert text is not None

    def test_read_file_not_found(self):
        """
        測試點 1.2: 測試讀取不存在的檔案
        預期：報錯，並提示錯誤檔名
        """
        # 404.txt 不存在
        test_file = './test_text/404.txt'
        with pytest.raises(FileNotFoundError):
            main.read_file(test_file)

    def test_save_similarity_to_file(self):
        """
        測試點 1.3: 測試能否正確儲存相似度結果(即輸出內容寫入檔案）
        預期：讀取檔案內容與寫入時一致
        """
        test_output_file = './test_text/test_output.txt'
        main.save_similarity_to_file(test_output_file, 0.75)

        with open(test_output_file, 'r', encoding='utf-8') as f:
            result = f.read()
        assert result == '0.75'
        os.remove(test_output_file)

2.測試文字預處理功能

測試點：

正常文字能否正確處理
空文字能否正確處理
亂碼文字能否正常處理

測試程式碼

 def test_preprocess_text_normal(self):
        """
        測試點 2.1: 測試正常文字預處理
        預期：返回分詞後字串
        """
        text = "你好，世界！這是一個測試。"
        result = main.preprocess_text(text)
        assert result == "你好 世界 這是 一個 測試"

    def test_preprocess_empty_text(self):
        """
        測試點 2.2: 測試空文字處理
        預期：返回空字串
        """
        text = ""
        result = main.preprocess_text(text)
        assert result == ""

3.測試文字向量化功能

測試點：

正常文字能否正確向量化

測試程式碼

def test_vectorize_texts_normal(self):
    """
    測試點 3.1: 測試正常文字向量化
    預期：與計算結果一致
    """
    text1 = "這是一個測試"
    text2 = "這是另一個測試"
    tfidf_matrix = main.vectorize_texts(text1, text2)
    assert tfidf_matrix.shape == (2, 2)  # 2 行（文字），2列（詞彙數）

4.測試餘弦相似度計算功能

測試點：

不同文字的相似度計算是否正確（小於1）
相同文字的相似度計算是否正確（等於1）
完全不同文字(或空文字)的相似度計算是否正確（極小，小於0.1）

測試程式碼

     def test_calculate_similarity_normal(self):
        """
        測試點 4.1: 測試不同文字的相似度計算
        預期：返回一個不大於1的浮點數
        """
        text1 = "這是一個測試"
        text2 = "這是另一個測試"
        text1 = main.preprocess_text(text1)
        text2 = main.preprocess_text(text2)
        tfidf_matrix = main.vectorize_texts(text1, text2)
        similarity = main.calculate_cosine_similarity(tfidf_matrix)
        assert 0 <= similarity <= 1

    def test_calculate_similarity_identical(self):
        """
        測試點 4.2: 測試相同文字的相似度計算
        預期：返回1.0
        """
        text1 = "這是一個測試"
        text2 = "這是一個測試"
        text1 = main.preprocess_text(text1)
        text2 = main.preprocess_text(text2)
        tfidf_matrix = main.vectorize_texts(text1, text2)
        similarity = main.calculate_cosine_similarity(tfidf_matrix)
        similarity = round(similarity, 2)
        assert similarity == 1.0

    def test_calculate_similarity_completely_different(self):
        """
        測試點 4.3: 測試完全不同文字（或空文字）的相似度計算
        預期：返回的餘弦相似度極小，小於0.1
        """
        text1 = "這是一個測試"
        text2 = "完全不同的文字"
        text3 = ""
        tfidf_matrix1 = main.vectorize_texts(text1, text2)
        tfidf_matrix2 = main.vectorize_texts(text1, text3)
        similarity1 = main.calculate_cosine_similarity(tfidf_matrix1)
        similarity2 = main.calculate_cosine_similarity(tfidf_matrix2)
        assert similarity1 < 0.1
        assert similarity2 == 0.0

5.測試主流程功能

測試點：

主函式流程是否能正常執行
缺少命令列引數的情況能否處理
多於3個命令列引數的情況能否處理

測試程式碼

def test_main_flow(self):
    """
    測試點 5.1: 測試主函式
    預期：各個斷言透過
    """
    orig_file = './test_text/orig.txt'
    plagiarism_file = './test_text/orig_0.8_add.txt'
    output_file = './test_text/orig_output.txt'

    orig_file = main.read_file(orig_file)
    plagiarism_file = main.read_file(plagiarism_file)
    assert orig_file is not None
    assert plagiarism_file is not None

    tfidf_matrix = main.vectorize_texts(orig_file, plagiarism_file)

    similarity = main.calculate_cosine_similarity(tfidf_matrix)
    assert 0 <= float(similarity) <= 1

    main.save_similarity_to_file(output_file, similarity)
    with open(output_file, 'r', encoding='utf-8') as f:
        result = f.read()
    assert result == str(round(similarity, 2))

def test_missing_arguments(self):
    """
    測試點 5.2:模擬缺少命令列引數的情況
    預期：退出碼非正常
    """
    orig_file = './test_text/orig.txt'
    plagiarism_file = './test_text/orig_0.8_add.txt'
    output_file = './test_text/orig_output.txt'

    # 使用 os.system 執行命令，少傳遞一個引數來模擬缺少引數的情況
    exit_code = os.system(f'python main.py {orig_file} {plagiarism_file}')
    # 預期程式返回非零狀態碼，因為命令列引數不足
    assert exit_code != 0  # os.system() 返回的非零程式碼表示錯誤

def test_extra_arguments(self):
    """
    測試點 5.3:模擬多於3個命令列引數的情況
    預期：退出碼非正常
    """
    orig_file = './test_text/orig.txt'
    plagiarism_file = './test_text/orig_0.8_add.txt'
    extra_file = "./test_text/orig_0.8/del.txt"
    output_file = './test_text/orig_output.txt'

    # 使用 os.system 執行命令，多傳遞一個引數來模擬過多引數的情況
    exit_code = os.system(f'python main.py {orig_file} {plagiarism_file} {extra_file} {output_file}')
    # 預期程式返回非零狀態碼，因為命令列引數過多
    assert exit_code != 0  # os.system() 返回的非零程式碼表示錯誤

覆蓋率

五、異常處理

1.讀取出錯異常FileNotFoundError

原因分析：檔案路徑錯誤或檔案不存在

def read_file(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"檔案路徑錯誤或檔案不存在: {file_path}")

    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
        raise IOError(f"讀取檔案時出錯: {e}")

2.寫入異常IOError

原因分析：可能是由於許可權問題無法進行寫入

def save_similarity_to_file(output_file, similarity):
    """
    將相似度結果儲存到指定檔案，保留兩位小數
    """
    try:
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(f"{similarity:.2f}")
    except Exception as e:
        raise IOError(f"寫入檔案時出錯: {e}")

六、測試報告

使用allure生成測試報告如圖

# 命令列輸入
pytest main.py --alluredir result   
# 最後一個路徑指定了json檔案輸出的目錄

allure generate result -o resport_allure --clean
# 從json檔案路徑中提取資料，輸出一個html網頁到指定路徑

allure open resport_allure 
# 開啟測試報告網頁

個人專案—論文查重
2024-09-12
個人專案-論文查重
2024-09-15
論文查重
2024-09-13
手機論文查重軟體哪個好？
2019-03-29
論文查重之小白都懂
2024-03-17
軟體工程-論文查重
2024-03-18
軟體工程
java實現論文查重
2024-03-15
Java
免費的論文查重網站
2021-05-13
網站
[Github 專案推薦] 一個更好閱讀和查詢論文的網站
2019-01-13
Github網站
基於tf-idf的論文查重
2024-09-10
考研要求提交論文PDF查重注意事項（知網查重必看！）
2018-10-18
個人 Laravel 論壇專案 (程式碼開源)
2019-10-14
Laravel
聊天機器人資源合集：專案，語聊，論文，教程。
2019-02-16
機器人
第二次作業--論文查重
2024-03-14
第二次作業——論文查重
2024-03-18
個人專案
2024-09-14
總結幾個查詢論文網址
2019-04-06
在Django中查詢重複專案
2024-04-24
Django
論文專題
2024-04-08
論文查詢網站
2019-10-30
網站
JAVA畢設代做（專案+論文+原始碼）
2024-10-03
Java原始碼
個人專案9/12（二）
2024-09-14
論文查重演算法
2024-03-13
演算法
查論文作者的網站
2018-04-26
網站
個人專案開發規範
2019-08-06
個人專案相關問題
2020-09-24
查詢論文原始碼網站
2019-03-18
原始碼網站
第一次個人專案
2024-09-11
寫作論文怎麼查詢文獻資料
2020-01-08
如果一個專案要你重構成前後端分離，你的方法論是什麼？
2024-11-25
後端
近期有哪些值得讀的QA論文？| 專題論文解讀
2018-06-05
計算機論文查詢網站
2018-05-16
計算機網站
2024年1000個計算機畢業設計專案原始碼（原始碼+論文【萬字】）
2024-08-10
計算機原始碼
老專案和人有一個能跑就行
2023-04-24
個人專案管理軟體解決方案
2021-11-09
專案管理
如何將一個spring專案重構成spring-boot專案（僅後臺）
2018-05-27
Springboot
專案管理之方法論
2020-08-22
專案管理
2010.03.16專題：一個開發人員的專案煩惱
2019-04-05

個人專案：論文查重

一、PSP表格

二、思路分析以及介面設計

題目回顧：

執行環境：

思路分析：

主要介面設計

三、效能分析

四、單元測試

測試點分析

1.測試檔案的讀寫功能

2.測試文字預處理功能

3.測試文字向量化功能

4.測試餘弦相似度計算功能

5.測試主流程功能

覆蓋率

五、異常處理

六、測試報告

相關文章