【Python】正規表示式過濾文字中的html標籤、url超連結、img連結

CS青雀發表於2019-02-15

測試文字:

"給大家看看原始文字。。。 ----------------------------<a class="member" href="https://www.aaa.com/people/0970f947b898ecc0ec035f9126dd4e08" data-hash="0970f947b898ecc0ec035f9126dd4e08" data-hovercard="p$b$0970f947b898ecc0ec035f9126dd4e08">@vczh</a> 輪子哥求擴散。</p><figure><noscript><img src="https://pic1.aaa.com/ width="1080" data-original="https://pic1.aaa.com/v2-40f4c4f1bdd70db1c7a34046440e3e64_r.jpg"></noscript>---------------”
 

程式碼:

# coding: utf-8
import re, os

def filter_file(path, filename):

    def filter_text(text):
        re_tag = re.compile('</?\w+[^>]*>')  # HTML標籤
        new_text = re.sub(re_tag, '', text)
        new_text = re.sub(",+", ",", new_text)   # 合併逗號
        new_text = re.sub(" +", " ", new_text)   # 合併空格
        new_text = re.sub("[...|…|。。。]+", "...", new_text)  # 合併句號
        new_text = re.sub("-+", "--", new_text)  # 合併-
        new_text = re.sub("———+", "———", new_text)  # 合併-
        return new_text

    print("Start!")
    filw_path = os.path.join(path, filename)
    with open(filw_path, "r+", encoding="utf-8") as fr:
        data = fr.readlines()
        print(len(data))
    with open(filw_path + ".filter", "w+", encoding="utf-8") as fw:
        for line in data:
            new_line = filter_text(line)
            fw.write(new_line)
    print("Done!")

 

 

相關文章