爬蟲練習--草稿

weixin_34194087發表於2018-04-27

簡書的robots

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /search
Disallow: /convos/
Disallow: /notes/
Disallow: /admin/
Disallow: /adm/
Disallow: /p/0826cf4692f9
Disallow: /p/d8b31d20a867
Disallow: /collections/*/recommended_authors
Disallow: /trial/*
Disallow: /keyword_notes
Disallow: /stats-2017/*

User-agent: trendkite-akashic-crawler
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: YisouSpider
Request-rate: 1/10 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: Cliqzbot
Disallow: /

User-agent: Googlebot
Request-rate: 1/1 # load 1 page per 2 seconds
Crawl-delay: 10

mport urllib.request
import urllib.parse
import re

url="https://www.jianshu.com/c/bd38bd199ec6"
req=urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 '
                                 '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36')
response=urllib.request.urlopen(req)
html=response.read().decode("utf-8")
#print(html)

pattern=re.compile(r'<p class="abstract">\s+(.*)\s+</p>')


result=re.findall(pattern,html)



#for each in result:
#    print(each)
#print(result)
    
print("the length=============",len(result))

print("----------------",result[1])

print("*******",len(result[1]))

爬蟲.png

模仿:Python爬蟲初學（一）—— 爬取段子

還有事情年，還有許多東西需要修改，比如把交友文章下載下來，或者爬取圖片，等等什麼的.
re表示式，我還不是很熟。


<a class="nickname" target="_blank" href="[/u/1195c9b43c46](view-source:https://www.jianshu.com/u/1195c9b43c46)">
大大懶魚</a>  
<span class="time" data-shared-at="2018-04-26T21:15:25+08:00">
</span> 
 <a class="title" target="_blank" href="[/p/a1d691ab1111](view-source:https://www.jianshu.com/p/a1d691ab1111)">
【簡書交友】大大懶魚:愛好服裝搭配的特別能吃麻辣中年少女</a>

這些regular，我還必須寫出來，以及翻葉等。

Scrapy爬蟲-草稿
2018-09-08
爬蟲
爬蟲練習——爬取縱橫中文網
2020-10-19
爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
爬蟲學習筆記：練習爬取多頁天涯帖子
2019-02-16
爬蟲筆記
python爬蟲練習--爬取虎牙主播原畫視訊
2020-11-28
Python爬蟲
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
爬蟲複習
2024-12-04
爬蟲
<node.js學習筆記(5)>koa框架和簡單爬蟲練習
2018-12-12
Node.js筆記框架爬蟲
什麼是爬蟲?學習Python爬蟲難不難?
2019-11-05
爬蟲Python
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
爬蟲學習-初次上路
2020-11-21
爬蟲
selenium爬蟲學習1
2024-08-29
爬蟲
python爬蟲學習1
2020-11-29
Python爬蟲
python爬蟲練習之爬取豆瓣讀書所有標籤下的書籍資訊
2018-07-23
Python爬蟲
Python爬蟲專案100例，附原始碼！100個Python爬蟲練手例項
2021-09-09
Python爬蟲原始碼
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
爬蟲學習日記（六）完成第一個爬蟲任務
2019-01-10
爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
爬蟲學習日記（六）
2019-01-14
爬蟲
Android 淘寶爬蟲學習
2019-03-18
Android爬蟲
爬蟲學習日記（八）
2019-01-18
爬蟲
爬蟲學習日記（七）
2019-01-15
爬蟲
爬蟲學習日記（五）
2018-12-14
爬蟲
爬蟲學習日記（三）
2018-12-07
爬蟲
爬蟲學習日記（二）
2018-11-28
爬蟲
爬蟲學習日記（一）
2018-11-28
爬蟲
11.18爬蟲學習（BeautifulSoup類）
2024-11-18
爬蟲
逆向爬蟲知識學習
2022-03-21
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
python爬蟲—學習筆記-4
2024-04-23
Python爬蟲筆記
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
python爬蟲js逆向學習（二）
2020-07-03
Python爬蟲JS
爬蟲之CSS語法學習
2024-10-23
爬蟲CSS
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記

爬蟲練習--草稿

相關文章