大規模非同步新聞爬蟲：網頁正文的提取

王平發表於2018-12-03

非同步爬蟲網頁

前面我們實現的新聞爬蟲，執行起來後很快就可以抓取大量新聞網頁，存到資料庫裡面的都是網頁的html程式碼，並不是我們想要的最終結果。最終結果應該是結構化的資料，包含的資訊至少有url，標題、釋出時間、正文內容、來源網站等。

網頁正文抽取的方法

所以，爬蟲不僅要幹下載的活兒，清理、提取資料的活兒也得幹。所以說嘛，寫爬蟲是綜合能力的體現。

一個典型的新聞網頁包括幾個不同區域：

新聞網頁區域

我們要提取的新聞要素包含在：

標題區域
meta資料區域（釋出時間等）
配圖區域（如果想把配圖也提取）
正文區域

而導航欄區域、相關連結區域的文字就不屬於該新聞的要素。

新聞的標題、釋出時間、正文內容一般都是從我們抓取的html裡面提取的。如果僅僅是一個網站的新聞網頁，提取這三個內容很簡單，寫三個正規表示式就可以完美提取了。然而，我們的爬蟲抓來的是成百上千的網站的網頁。對這麼多不同格式的網頁寫正規表示式會累死人的，而且網頁一旦稍微改版，表示式可能就失效，維護這群表示式也是會累死人的。

累死人的做法當然想不通，我們就要探索一下好的演算法來實現。

1. 標題的提取

標題基本上都會出現在html的<title>標籤裡面，但是又被附加了諸如頻道名稱、網站名稱等資訊；

標題還會出現在網頁的“標題區域”。

那麼這兩個地方，從哪裡提取標題比較容易呢？

網頁的“標題區域”沒有明顯的標識，不同網站的“標題區域”的html程式碼部分千差萬別。所以這個區域並不容易提取出來。

那麼就只剩下<title>標籤了，這個標籤很容易提取，無論是正規表示式，還是lxml解析都很容易，不容易的是如何去除頻道名稱、網站名稱等資訊。

先來看看，<title>標籤裡面都是設麼樣子的附加資訊：

上海用“智慧”啟用城市交通脈搏，讓道路更安全更有序更通暢_浦江頭條_澎湃新聞-The Paper
“滬港大學聯盟”今天在復旦大學成立_教育_新民網
三亞老人腳踹司機致公交車失控撞牆被判刑3年_社會
外交部：中美外交安全對話9日在美舉行
進博會：中國行動全球矚目，中國擔當世界點贊_南方觀瀾_南方網
資本市場迎來重大改革設立科創板有何深意？-新華網

觀察這些title不難發現，新聞標題和頻道名、網站名之間都是有一些連線符號的。那麼我就可以通過這些連線符吧title分割，找出最長的部分就是新聞標題了。

這個思路也很容易實現，這裡就不再上程式碼了，留給小猿們作為思考練習題自己實現一下。

2. 釋出時間提取

釋出時間，指的是這個網頁在該網站上線的時間，一般它會出現在正文標題的下方——meta資料區域。從html程式碼看，這個區域沒有什麼特殊特徵讓我們定位，尤其是在非常多的網站版面面前，定位這個區域幾乎是不可能的。這需要我們另闢蹊徑。
跟標題一樣，我們也先看看一些網站的釋出時間都是怎麼寫的：

央視網2018年11月06日 22:22
時間：2018-11-07 14:27:00
2018-11-07 11:20:37 來源：新華網
來源：中國日報網 2018-11-07 08:06:39
2018年11月07日 07:39:19
2018-11-06 09:58 來源：澎湃新聞

這些寫在網頁上的釋出時間，都有一個共同的特點，那就是一個表示時間的字串，年月日時分秒，無外乎這幾個要素。通過正規表示式，我們列舉一些不同時間表達方式（也就那麼幾種）的正規表示式，就可以從網頁文字中進行匹配提取釋出時間了。

這也是一個很容易實現的思路，但是細節比較多，表達方式要涵蓋的儘可能多，寫好這麼一個提取釋出時間的函式也不是那麼容易的哦。小猿們盡情發揮動手能力，看看自己能寫出怎樣的函式實現。這也是留給小猿們的一道練習題。

3. 正文的提取

正文（包括新聞配圖）是一個新聞網頁的主體部分，它在視覺上佔據中間位置，是新聞的內容主要的文字區域。正文的提取有很多種方法，實現上有複雜也有簡單。本文介紹的方法，是結合老猿多年的實踐經驗和思考得出來的一個簡單快速的方法，姑且稱之為“節點文字密度法”。

我們知道，網頁的html程式碼是由不同的標籤（tag）組成了一個樹狀結構樹，每個標籤是樹的一個節點。通過遍歷這個樹狀結構的每個節點，找到文字最多的節點，它就是正文所在的節點。根據這個思路，我們來實現一下程式碼。

3.1 實現原始碼

#!/usr/bin/env python3
#File: maincontent.py
#Author: veelion

import re
import time
import traceback

import cchardet
import lxml
import lxml.html
from lxml.html import HtmlComment

REGEXES = {
    'okMaybeItsACandidateRe': re.compile(
        'and|article|artical|body|column|main|shadow', re.I),
    'positiveRe': re.compile(
        ('article|arti|body|content|entry|hentry|main|page|'
         'artical|zoom|arti|context|message|editor|'
         'pagination|post|txt|text|blog|story'), re.I),
    'negativeRe': re.compile(
        ('copyright|combx|comment|com-|contact|foot|footer|footnote|decl|copy|'
         'notice|'
         'masthead|media|meta|outbrain|promo|related|scroll|link|pagebottom|bottom|'
         'other|shoutbox|sidebar|sponsor|shopping|tags|tool|widget'), re.I),
}



class MainContent:
    def __init__(self,):
        self.non_content_tag = set([
            'head',
            'meta',
            'script',
            'style',
            'object', 'embed',
            'iframe',
            'marquee',
            'select',
        ])
        self.title = ''
        self.p_space = re.compile(r'\s')
        self.p_html = re.compile(r'<html|</html>', re.IGNORECASE|re.DOTALL)
        self.p_content_stop = re.compile(r'正文.*結束|正文下|相關閱讀|宣告')
        self.p_clean_tree = re.compile(r'author|post-add|copyright')

    def get_title(self, doc):
        title = ''
        title_el = doc.xpath('//title')
        if title_el:
            title = title_el[0].text_content().strip()
        if len(title) < 7:
            tt = doc.xpath('//meta[@name="title"]')
            if tt:
                title = tt[0].get('content', '')
        if len(title) < 7:
            tt = doc.xpath('//*[contains(@id, "title") or contains(@class, "title")]')
            if not tt:
                tt =  doc.xpath('//*[contains(@id, "font01") or contains(@class, "font01")]')
            for t in tt:
                ti = t.text_content().strip()
                if ti in title and len(ti)*2 > len(title):
                    title = ti
                    break
                if len(ti) > 20: continue
                if len(ti) > len(title) or len(ti) > 7:
                    title = ti
        return title

    def shorten_title(self, title):
        spliters = [' - ', '–', '—', '-', '|', '::']
        for s in spliters:
            if s not in title:
                continue
            tts = title.split(s)
            if len(tts) < 2:
                continue
            title = tts[0]
            break
        return title

    def calc_node_weight(self, node):
        weight = 1
        attr = '%s %s %s' % (
            node.get('class', ''),
            node.get('id', ''),
            node.get('style', '')
        )
        if attr:
            mm = REGEXES['negativeRe'].findall(attr)
            weight -= 2 * len(mm)
            mm = REGEXES['positiveRe'].findall(attr)
            weight += 4 * len(mm)
        if node.tag in ['div', 'p', 'table']:
            weight += 2
        return weight

    def get_main_block(self, url, html, short_title=True):
        ''' return (title, etree_of_main_content_block)
        '''
        if isinstance(html, bytes):
            encoding = cchardet.detect(html)['encoding']
            if encoding is None:
                return None, None
            html = html.decode(encoding, 'ignore')
        try:
            doc = lxml.html.fromstring(html)
            doc.make_links_absolute(base_url=url)
        except :
            traceback.print_exc()
            return None, None
        self.title = self.get_title(doc)
        if short_title:
            self.title = self.shorten_title(self.title)
        body = doc.xpath('//body')
        if not body:
            return self.title, None
        candidates = []
        nodes = body[0].getchildren()
        while nodes:
            node = nodes.pop(0)
            children = node.getchildren()
            tlen = 0
            for child in children:
                if isinstance(child, HtmlComment):
                    continue
                if child.tag in self.non_content_tag:
                    continue
                if child.tag == 'a':
                    continue
                if child.tag == 'textarea':
                    # FIXME: this tag is only part of content?
                    continue
                attr = '%s%s%s' % (child.get('class', ''),
                                   child.get('id', ''),
                                   child.get('style'))
                if 'display' in attr and 'none' in attr:
                    continue
                nodes.append(child)
                if child.tag == 'p':
                    weight = 3
                else:
                    weight = 1
                text = '' if not child.text else child.text.strip()
                tail = '' if not child.tail else child.tail.strip()
                tlen += (len(text) + len(tail)) * weight
            if tlen < 10:
                continue
            weight = self.calc_node_weight(node)
            candidates.append((node, tlen*weight))
        if not candidates:
            return self.title, None
        candidates.sort(key=lambda a: a[1], reverse=True)
        good = candidates[0][0]
        if good.tag in ['p', 'pre', 'code', 'blockquote']:
            for i in range(5):
                good = good.getparent()
                if good.tag == 'div':
                    break
        good = self.clean_etree(good, url)
        return self.title, good

    def clean_etree(self, tree, url=''):
        to_drop = []
        drop_left = False
        for node in tree.iterdescendants():
            if drop_left:
                to_drop.append(node)
                continue
            if isinstance(node, HtmlComment):
                to_drop.append(node)
                if self.p_content_stop.search(node.text):
                    drop_left = True
                continue
            if node.tag in self.non_content_tag:
                to_drop.append(node)
                continue
            attr = '%s %s' % (
                node.get('class', ''),
                node.get('id', '')
            )
            if self.p_clean_tree.search(attr):
                to_drop.append(node)
                continue
            aa = node.xpath('.//a')
            if aa:
                text_node = len(self.p_space.sub('', node.text_content()))
                text_aa = 0
                for a in aa:
                    alen = len(self.p_space.sub('', a.text_content()))
                    if alen > 5:
                        text_aa += alen
                if text_aa > text_node * 0.4:
                    to_drop.append(node)
        for node in to_drop:
            try:
                node.drop_tree()
            except:
                pass
        return tree

    def get_text(self, doc):
        lxml.etree.strip_elements(doc, 'script')
        lxml.etree.strip_elements(doc, 'style')
        for ch in doc.iterdescendants():
            if not isinstance(ch.tag, str):
                continue
            if ch.tag in ['div', 'h1', 'h2', 'h3', 'p', 'br', 'table', 'tr', 'dl']:
                if not ch.tail:
                    ch.tail = '\n'
                else:
                    ch.tail = '\n' + ch.tail.strip() + '\n'
            if ch.tag in ['th', 'td']:
                if not ch.text:
                    ch.text = '  '
                else:
                    ch.text += '  '
            # if ch.tail:
            #     ch.tail = ch.tail.strip()
        lines = doc.text_content().split('\n')
        content = []
        for l in lines:
            l = l.strip()
            if not l:
                continue
            content.append(l)
        return '\n'.join(content)

    def extract(self, url, html):
        '''return (title, content)
        '''
        title, node = self.get_main_block(url, html)
        if node is None:
            print('\tno main block got !!!!!', url)
            return title, '', ''
        content = self.get_text(node)
        return title, content

3.2 程式碼解析

跟新聞爬蟲一樣，我們把整個演算法實現為一個類：MainContent。

首先，定義了一個全域性變數： REGEXES。它收集了一些經常出現在標籤的class和id中的關鍵詞，這些詞標識著該標籤可能是正文或者不是。我們用這些詞來給標籤節點計算權重，也就是方法calc_node_weight()的作用。

MainContent類的初始化，先定義了一些不會包含正文的標籤 self.non_content_tag，遇到這些標籤節點，直接忽略掉即可。

本演算法提取標題實現在get_title()這個函式裡面。首先，它先獲得<title>標籤的內容，然後試著從<meta>裡面找title，再嘗試從<body>裡面找id和class包含title的節點，最後把從不同地方獲得的可能是標題的文字進行對比，最終獲得標題。對比的原則是：

<meta>, <body>裡面找到的疑似標題如果包含在<title>標籤裡面，則它是一個乾淨（沒有頻道名、網站名）的標題；
如果疑似標題太長就忽略
主要把<title>標籤作為標題

從<title>標籤裡面獲得標題，就要解決標題清洗的問題。這裡實現了一個簡單的方法： clean_title()。

在這個實現中，我們使用了lxml.html把網頁的html轉化成一棵樹，從body節點開始遍歷每一個節點，看它直接包含（不含子節點）的文字的長度，從中找出含有最長文字的節點。這個過程實現在方法：get_main_block()中。其中一些細節，小猿們可以仔細體會一下。

其中一個細節就是，clean_node()這個函式。通過get_main_block()得到的節點，有可能包含相關新聞的連結，這些連結包含大量新聞標題，如果不去除，就會給新聞內容帶來雜質（相關新聞的標題、概述等）。

還有一個細節，get_text()函式。我們從main block中提取文字內容，不是直接使用text_content()，而是做了一些格式方面的處理，比如在一些標籤後面加入換行符合\n，在table的單元格之間加入空格。這樣處理後，得到的文字格式比較符合原始網頁的效果。

爬蟲知識點

1. cchardet模組
用於快速判斷文字編碼的模組

2. lxml.html模組
結構化html程式碼的模組，通過xpath解析網頁的工具，高效易用，是寫爬蟲的居家必備的模組。

3. 內容提取的複雜性
我們這裡實現的正文提取的演算法，基本上可以正確處理90%以上的新聞網頁。
但是，世界上沒有千篇一律的網頁一樣，也沒有一勞永逸的提取演算法。大規模使用本文演算法的過程中，你會碰到奇葩的網頁，這個時候，你就要針對這些網頁，來完善這個演算法類。非常歡迎小猿們把自己的改善程式碼提交到github，群策群力，讓這個演算法越來越棒！

思考題

通過程式碼實現：從<title>標籤的字串裡面去除頻道名稱、網站名稱等雜質而得到乾淨的新聞標題。
通過程式碼實現：從網頁文字中提取釋出時間的函式。（提示：用正規表示式進行提取）

囉嗦了十幾篇文章，最終要說大boss了，下一篇我們講：
實現一個非同步爬蟲

我的公眾號：猿人學 Python 上會分享更多心得體會，敬請關注。

***版權申明:若沒有特殊說明，文章皆是猿人學 yuanrenxue.com 原創，沒有猿人學授權，請勿以任何形式轉載。***

大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
大規模非同步新聞爬蟲的實現思路
2019-05-20
非同步爬蟲
大規模非同步新聞爬蟲的分散式實現
2019-06-10
非同步爬蟲分散式
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
大規模非同步新聞爬蟲：實現一個更好的網路請求函式
2018-12-02
非同步爬蟲函式
大規模非同步新聞爬蟲：讓MySQL 資料庫操作更方便
2018-12-03
非同步爬蟲MySql資料庫
大規模非同步新聞爬蟲：實現功能強大、簡潔易用的網址池(URL Pool)
2018-12-03
非同步爬蟲
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
《網頁爬蟲》
2018-11-26
網頁爬蟲
每秒幾十萬的大規模網路爬蟲是如何煉成的？
2019-02-20
爬蟲
如何抽取上千家新聞網站正文
2020-03-31
網站
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
爬蟲工作原理詳解：從網頁請求到資料提取
2023-11-24
爬蟲網頁
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
Python爬蟲十六式 - 第四式: 使用Xpath提取網頁內容
2019-01-10
Python爬蟲網頁
爬蟲 | 非同步請求aiohttp模組
2024-06-16
爬蟲非同步AIHTTP
爬取網站新聞
2020-09-24
網站
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
大規模爬蟲為什麼要管理DNS快取
2019-06-20
爬蟲DNS快取
Python爬蟲百度新聞標題
2020-11-29
Python爬蟲
網頁爬蟲--未完成
2020-10-04
網頁爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
Python爬蟲教程-18-頁面解析和資料提取
2018-09-06
Python爬蟲
Go秒爬部落格園100頁新聞
2018-08-01
Go
【爬蟲】網頁抓包工具--Fiddler
2018-12-19
爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
基於非同步協程的增量式微博網頁版爬蟲（一）思路篇
2024-05-15
非同步網頁爬蟲
Python爬蟲教程-19-資料提取-正規表示式(re)
2018-09-06
Python爬蟲
大規模爬蟲系統面臨的主要挑戰及解決思路
2023-10-16
爬蟲