二期Python爬蟲作業No.1一簡書

weixin_34377065發表於2017-05-23

糗事百科之前爬過類似的：
http://www.jianshu.com/p/a191726ed66d
為了集中注意力，主要爬簡書。簡書之前雖然2組的同學爬的很熱鬧，但我一次都沒有爬過。開營式上向右老師把我拎出來表揚了一下，果然根據人品守恆定律，這麼有代表性而且參考文章無數的網站我都爬的很辛苦...（其實是技能還沒過關）

根據圖片，第一級是jianshu.com，
第二級我選了class=note-list(試過id=list-container)也可以，然後第三級就按照以前爬過其他網站做的迴圈格式，選取了author,title,column等。

000.png

自己寫的程式碼：

url = 'http://www.jianshu.com'
html = requests.get(url, headers=getReqHeaders()).content
selector = etree.HTML(html)
infos= selector.xpath('//*[@class="note-list"]/li')
print(infos)

for info in infos:
     title=info.xpath('//a[@class="title"]/text()')[0]
     author=info.xpath('//div[@class="name"]/text()')[0]
     collection=info.xpath('//div[2]/a[1]/text()')[0]
     print title, '      ',author, '      ',collection

出來倒是出來了，但結果就好像鬼打牆一樣的迴圈...

圖片.png

如果把text後面的[0]依次改為[1]/[2]/[3]，每一項倒是會一行行列出了，但我以前並不是這樣做的，從網頁結構也看不出來為什麼這次需要這樣做才出來。

所以主要還是xpath選取的不對。
後來程工改的程式碼：（我把原來的程式碼註釋在下面）

import random
import requests
from lxml import etree

def getReqHeaders():  # 功能：隨機獲取HTTP_User_Agent
    user_agents = ["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"]
    user_agent = random.choice(user_agents)
    req_headers = {'User-Agent': user_agent}
    return req_headers

url = 'http://www.jianshu.com'
html = requests.get(url, headers=getReqHeaders()).content
selector = etree.HTML(html)
infos=selector.xpath('//div[@id="list-container"]/ul[@class="note-list"]/li')
#infos= selector.xpath('//*[@class="note-list"]/li')
print(infos)

for info in infos:
    author = info.xpath('div/div[1]/div/a/text()')[0]
    #author=info.xpath('//div[@class="name"]/text()')[0]
    authorurl = 'http://www.jianshu.com' + info.xpath('div/div[1]/div/a/@href')[0]
    title = info.xpath('div/a/text()')[0] if len(info.xpath('div/a/text()')) > 0 else ""
    #title=info.xpath('//a[@class="title"]/text()')[0]
    print title, '      ',author

程工的程式碼是直接copy xpath得到的，我開始也試過，但不知道為什麼不行，自己看的又辛苦（因為不很直觀）就改回class=name。總之還是技術不過關。

程工同時又翻出向右老師的參考文：《再談Scrapy抓取結構化資料》
http://www.jianshu.com/p/3d52e6046782
雖然是講scrapy，但也提到了簡書首頁的結構，我再對比一下。（從上到下分別是向右老師，程工，小白本人）

infos = selector.xpath('//ul[@class="note-list"]/li')
infos=selector.xpath('//div[@id="list-container"]/ul[@class="note-list"]/li')
#infos= selector.xpath('//*[@class="note-list"]/li')

for info in infos:
       title = info.xpath('div/a/text()').extract()[0]
       title = info.xpath('div/a/text()')[0]#if省略先
        #title=info.xpath('//a[@class="title"]/text()')[0]

        author = info.xpath('div/div[1]/div/a/text()').extract()[0]
        author = info.xpath('div/div[1]/div/a/text()')[0]
        #author=info.xpath('//div[@class="name"]/text()')[0]

以後還是要儘量多用右鍵copy xpath法，熟悉結構寫法，把各級抓取寫全面一些。
然後beautiful也要繼續熟練，這個月的挑戰還是很大啊！

爬蟲作業一
2020-11-27
爬蟲
圖靈樣書爬蟲 - Python 爬蟲實戰
2017-06-08
圖靈爬蟲Python
Python爬蟲群作業-Week3-BeautifulSoup
2017-05-07
Python爬蟲
Python爬蟲小專案：爬一個圖書網站
2018-11-21
Python爬蟲網站
一個簡單的python爬蟲程式
2016-05-13
Python爬蟲
Python爬蟲教程+書籍分享
2018-11-29
Python爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
【python爬蟲】python爬蟲demo
2018-02-21
Python爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
Python scrapy爬蟲框架簡介
2017-04-06
Python爬蟲框架
Python簡單爬蟲專案
2017-12-26
Python爬蟲
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
用Python寫一個簡單的微博爬蟲
2016-03-03
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
python爬蟲學習01--電子書爬取
2020-07-13
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python爬蟲學習（5）: 簡單的爬取
2016-10-20
Python爬蟲
爬蟲學習之一個簡單的網路爬蟲
2016-07-11
爬蟲
python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
Python爬蟲：一些常用的爬蟲技巧總結
2016-03-29
Python爬蟲
一次簡陋的爬蟲
2019-02-16
爬蟲
分享一個簡易淘寶爬蟲
2017-11-29
爬蟲
一天時間入門python爬蟲，直接寫一個爬蟲案例，分享出來，很簡單
2018-12-02
Python爬蟲
簡單的Python爬蟲就是這麼簡單
2017-12-14
Python爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
爬蟲與反爬蟲技術簡介
2022-09-20
爬蟲
python爬蟲利用requests製作代理池s
2019-12-04
Python爬蟲
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
Python爬蟲教程-21-xpath 簡介
2018-09-06
Python爬蟲
Python爬蟲教程-20-xml 簡介
2018-09-06
Python爬蟲XML
Python爬蟲教程-04-response簡介
2018-09-06
Python爬蟲
求職簡歷-Python爬蟲工程師
2018-07-26
求職Python爬蟲工程師
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
python爬蟲
2024-06-13
Python爬蟲

二期Python爬蟲作業No.1一簡書

相關文章