python爬蟲如何爬知乎的話題？

文小靜發表於2019-02-16

原文網址 : https://flycode.co/archives/80548

因為要做觀點，觀點的屋子類似於知乎的話題，所以得想辦法把他給爬下來，搞了半天最終還是妥妥的搞定了，程式碼是python寫的，不懂得麻煩自學哈！懂得直接看程式碼，絕對可用

#coding:utf-8
"""
@author:haoning
@create time:2015.8.5
"""
from __future__ import division  # 精確除法
from Queue import Queue
from __builtin__ import False
import json
import os
import re
import platform
import uuid
import urllib
import urllib2
import sys
import time
import MySQLdb as mdb
from bs4 import BeautifulSoup

reload(sys)
sys.setdefaultencoding( "utf-8" )

headers = {
   `User-Agent` : `Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0`,
   `Content-Type`:`application/x-www-form-urlencoded; charset=UTF-8`,
   `X-Requested-With`:`XMLHttpRequest`,
   `Referer`:`https://www.zhihu.com/topics`,
   `Cookie`:`__utma=51854390.517069884.1416212035.1416212035.1416212035.1; q_c1=c02bf44d00d240798bfabcfc95baeb56|1455778173000|1416205243000; _za=b1c8ae35-f986-46a2-b24a-cb9359dc6b2a; aliyungf_tc=AQAAAJ1m71jL1woArKqF22VFnL/wRy6C; _xsrf=9d494558f9271340ab24598d85b2a3c8; cap_id="MDNiMjcwM2U0MTRhNDVmYjgxZWVhOWI0NTA2OGU5OTg=|1455864276|2a4ce8247ebd3c0df5393bb5661713ad9eec01dd"; n_c=1; _alicdn_sec=56c6ba4d556557d27a0f8c876f563d12a285f33a`
}

DB_HOST = `127.0.0.1`
DB_USER = `root`
DB_PASS = `root`

queue= Queue() #接收佇列
nodeSet=set()
keywordSet=set()
stop=0
offset=-20
level=0
maxLevel=7
counter=0
base=""

conn = mdb.connect(DB_HOST, DB_USER, DB_PASS, `zhihu`, charset=`utf8`)
conn.autocommit(False)
curr = conn.cursor()

def get_html(url):
    try:
        req = urllib2.Request(url)
        response = urllib2.urlopen(req,None,3) #在這裡應該加入代理
        html = response.read()
        return html
    except:
        pass
    return None

def getTopics():
    url = `https://www.zhihu.com/topics`
    print url
    try:
        req = urllib2.Request(url)
        response = urllib2.urlopen(req) #鍦ㄨ繖閲屽簲璇ュ姞鍏ヤ唬鐞�
        html = response.read().decode(`utf-8`)
        print html
        soup = BeautifulSoup(html)
        lis = soup.find_all(`li`, {`class` : `zm-topic-cat-item`})
        
        for li in lis:
            data_id=li.get(`data-id`)
            name=li.text
            curr.execute(`select id from classify_new where name=%s`,(name))
            y= curr.fetchone()
            if not y:
                curr.execute(`INSERT INTO classify_new(data_id,name)VALUES(%s,%s)`,(data_id,name))
        conn.commit()
    except Exception as e:
        print "get topic error",e
        

def get_extension(name):  
    where=name.rfind(`.`)
    if where!=-1:
        return name[where:len(name)]
    return None


def which_platform():
    sys_str = platform.system()
    return sys_str

def GetDateString():
    when=time.strftime(`%Y-%m-%d`,time.localtime(time.time()))
    foldername = str(when)
    return foldername 

def makeDateFolder(par,classify):
    try:
        if os.path.isdir(par):
            newFolderName=par + `//` + GetDateString() + `//`  +str(classify)
            if which_platform()=="Linux":
                newFolderName=par + `/` + GetDateString() + "/" +str(classify)
            if not os.path.isdir( newFolderName ):
                os.makedirs( newFolderName )
            return newFolderName
        else:
            return None 
    except Exception,e:
        print "kk",e
    return None 

def download_img(url,classify):
    try:
        extention=get_extension(url)
        if(extention is None):
            return None
        req = urllib2.Request(url)
        resp = urllib2.urlopen(req,None,3)
        dataimg=resp.read()
        name=str(uuid.uuid1()).replace("-","")+"_www.guandn.com"+extention
        top="E://topic_pic"
        folder=makeDateFolder(top, classify)
        filename=None
        if folder is not None:
            filename  =folder+"//"+name
        try:
            if "e82bab09c_m" in str(url):
                return True
            if not os.path.exists(filename):
                file_object = open(filename,`w+b`)
                file_object.write(dataimg)
                file_object.close()
                return `/room/default/`+GetDateString()+`/`+str(classify)+"/"+name
            else:
                print "file exist"
                return None
        except IOError,e1:
            print "e1=",e1
            pass
    except Exception as e:
        print "eee",e
        pass
    return None #如果沒有下載下來就利用原來網站的連結

def getChildren(node,name):
    global queue,nodeSet
    try:
        url="https://www.zhihu.com/topic/"+str(node)+"/hot"
        html=get_html(url)
        if html is None:
            return
        soup = BeautifulSoup(html)
        p_ch=`父話題`
        node_name=soup.find(`div`, {`id` : `zh-topic-title`}).find(`h1`).text
        topic_cla=soup.find(`div`, {`class` : `child-topic`})
        if topic_cla is not None:
            try:
                p_ch=str(topic_cla.text)
                aList = soup.find_all(`a`, {`class` : `zm-item-tag`}) #獲取所有子節點
                if u`子話題` in p_ch:
                    for a in aList:
                        token=a.get(`data-token`)
                        a=str(a).replace(`
`,``).replace(`	`,``).replace(`
`,``)
                        start=str(a).find(`>`)
                        end=str(a).rfind(`</a>`)
                        new_node=str(str(a)[start+1:end])
                        curr.execute(`select id from rooms where name=%s`,(new_node)) #先保證名字絕不相同
                        y= curr.fetchone()
                        if not y:
                            print "y=",y,"new_node=",new_node,"token=",token
                            queue.put((token,new_node,node_name))
            except Exception as e:
                print "add queue error",e
    except Exception as e:
        print "get html error",e
        
    

def getContent(n,name,p,top_id):
    try:
        global counter
        curr.execute(`select id from rooms where name=%s`,(name)) #先保證名字絕不相同
        y= curr.fetchone()
        print "exist?? ",y,"n=",n
        if not y:
            url="https://www.zhihu.com/topic/"+str(n)+"/hot"
            html=get_html(url)
            if html is None:
                return
            soup = BeautifulSoup(html)
            title=soup.find(`div`, {`id` : `zh-topic-title`}).find(`h1`).text
            pic_path=soup.find(`a`,{`id`:`zh-avartar-edit-form`}).find(`img`).get(`src`)
            description=soup.find(`div`,{`class`:`zm-editable-content`})
            if description is not None:
                description=description.text
                
            if (u"未歸類" in title or u"根話題" in title): #允許入庫，避免死迴圈
                description=None
                
            tag_path=download_img(pic_path,top_id)
            print "tag_path=",tag_path
            if (tag_path is not None) or tag_path==True:
                if tag_path==True:
                    tag_path=None
                father_id=2 #預設為雜談
                curr.execute(`select id from rooms where name=%s`,(p))
                results = curr.fetchall()
                for r in results:
                    father_id=r[0]
                name=title
                curr.execute(`select id from rooms where name=%s`,(name)) #先保證名字絕不相同
                y= curr.fetchone()
                print "store see..",y
                if not y:
                    friends_num=0
                    temp = time.time()
                    x = time.localtime(float(temp))
                    create_time = time.strftime("%Y-%m-%d %H:%M:%S",x) # get time now
                    create_time
                    creater_id=None
                    room_avatar=tag_path
                    is_pass=1
                    has_index=0
                    reason_id=None  
                    #print father_id,name,friends_num,create_time,creater_id,room_avatar,is_pass,has_index,reason_id
                    ######################有資格入庫的內容
                    counter=counter+1
                    curr.execute("INSERT INTO rooms(father_id,name,friends_num,description,create_time,creater_id,room_avatar,is_pass,has_index,reason_id)VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(father_id,name,friends_num,description,create_time,creater_id,room_avatar,is_pass,has_index,reason_id))
                    conn.commit() #必須時時進入資料庫，不然找不到父節點
                    if counter % 200==0:
                        print "current node",name,"num",counter
    except Exception as e:
        print "get content error",e       

def work():
    global queue
    curr.execute(`select id,node,parent,name from classify where status=1`)
    results = curr.fetchall()
    for r in results:
        top_id=r[0]
        node=r[1]
        parent=r[2]
        name=r[3]
        try:
            queue.put((node,name,parent)) #首先放入佇列
            while queue.qsize() >0:
                n,p=queue.get() #頂節點出隊
                getContent(n,p,top_id)
                getChildren(n,name) #出隊內容的子節點
            conn.commit()
        except Exception as e:
            print "what`s wrong",e  
            
def new_work():
    global queue
    curr.execute(`select id,data_id,name from classify_new_copy where status=1`)
    results = curr.fetchall()
    for r in results:
        top_id=r[0]
        data_id=r[1]
        name=r[2]
        try:
            get_topis(data_id,name,top_id)
        except:
            pass


def get_topis(data_id,name,top_id):
    global queue
    url = `https://www.zhihu.com/node/TopicsPlazzaListV2`
    isGet = True;
    offset = -20;
    data_id=str(data_id)
    while isGet:
        offset = offset + 20
        values = {`method`: `next`, `params`: `{"topic_id":`+data_id+`,"offset":`+str(offset)+`,"hash_id":""}`}
        try:
            msg=None
            try:
                data = urllib.urlencode(values)
                request = urllib2.Request(url,data,headers)
                response = urllib2.urlopen(request,None,5)
                html=response.read().decode(`utf-8`)
                json_str = json.loads(html)
                ms=json_str[`msg`]
                if len(ms) <5:
                    break
                msg=ms[0]
            except Exception as e:
                print "eeeee",e
            #print msg
            if msg is not None:
                soup = BeautifulSoup(str(msg))
                blks = soup.find_all(`div`, {`class` : `blk`})
                for blk in blks:
                    page=blk.find(`a`).get(`href`)
                    if page is not None:
                        node=page.replace("/topic/","") #將更多的種子入庫
                        parent=name
                        ne=blk.find(`strong`).text
                        try:
                            queue.put((node,ne,parent)) #首先放入佇列
                            while queue.qsize() >0:
                                n,name,p=queue.get() #頂節點出隊
                                size=queue.qsize()
                                if size > 0:
                                    print size
                                getContent(n,name,p,top_id)
                                getChildren(n,name) #出隊內容的子節點
                            conn.commit()
                        except Exception as e:
                            print "what`s wrong",e  
        except urllib2.URLError, e:
            print "error is",e
            pass 
            
        
if __name__ == `__main__`:
    i=0
    while i<400:
        new_work()
        i=i+1

說下資料庫的問題，我這裡就不傳附件了，看欄位自己建立，因為這確實太簡單了，我是用的mysql，你看自己的需求自己建。

有什麼不懂得麻煩去去轉盤網找我，因為這個也是我開發的，上面會及時更新qq群號，這裡不留qq號啥的，以免被系統給K了。

Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
python爬蟲專案（新手教程）之知乎（requests方式）
2018-06-13
Python爬蟲
Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
分散式爬蟲之知乎使用者資訊爬取
2018-08-31
分散式爬蟲
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
Python爬蟲之路-爬蟲基礎知識(理論)
2021-01-04
Python爬蟲
分散式爬蟲很難嗎？用Python寫一個小白也能聽懂的分散式知乎爬蟲
2018-05-04
分散式爬蟲Python
[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印
2021-09-20
Python爬蟲
Python 爬蟲 + 人臉檢測 —— 知乎高顏值圖片抓取
2020-12-21
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
Python爬蟲學習線路圖丨Python爬蟲需要掌握哪些知識點
2018-12-10
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
python的爬蟲功能如何實現
2019-02-28
Python爬蟲
Python爬蟲是如何實現的？
2022-07-15
Python爬蟲
Python爬蟲亂碼問題
2018-05-11
Python爬蟲
如何利用ip住宅代理解決python爬蟲遇到反爬措施的問題？
2023-05-18
Python爬蟲
Python爬蟲的用途
2018-08-16
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
如何入行Python爬蟲工程師
2019-02-28
Python爬蟲工程師
Python相關爬蟲的框架有哪些?Python知識
2020-09-24
Python爬蟲框架
Python爬蟲進階之會話和Cookies
2021-09-11
Python爬蟲會話Cookie
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
Python爬蟲工作好做嗎？爬蟲工作發展前景如何呢？
2019-03-20
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲

python爬蟲如何爬知乎的話題？

相關文章