簡單的Python爬蟲就是這麼簡單

meryin發表於2017-12-14

原文網址 : https://juejin.im/post/5a31d7166fb9a0451f30f2e0

python小白筆記。本人用到的是urllib2和正則的方式，爬取簡書首頁的文章列表，並儲存到sqlite3中。

1. 開發工具和用到的庫

Python下載：本人暫時用的2.x版本，下載地址點這裡
編輯器下載：本人用的是PyCharm Community
用到的庫有：urllib2、re和sqlite3。

2. 開始寫程式碼

 import re    
 import urllib2
 import sqlite3
複製程式碼

匯入正則庫，url請求庫以及Python自帶的sqlite3資料庫。正規表示式的學習可以參考這裡。

獲取網頁內容

url = "http://www.jianshu.com"
req = urllib2.urlopen(url)
buf = req.read().decode('utf-8')
print buf
複製程式碼

這樣我們就獲取到了網頁內容。為了告訴伺服器我們不是爬蟲，可以加上header：

url = 'http://www.jianshu.com'
request = urllib2.Request(url)
request.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36')
req = urllib2.urlopen(request)
buf = req.read().decode('utf-8')
print buf
複製程式碼

有些網站是post請求，所以還要新增引數：

#post請求需要新增urllib庫
import urllib
str1 = 'http://xxxx'
params = {'key':'value','key':'value'}
data = urllib.urlencode(params)
request = urllib2.Request(str1,data=data)
req = urllib2.urlopen(request)
buf = req.read()
print buf
複製程式碼

解析網頁，獲取文章列表。我只要了文章列表的作者，時間，標題，簡介，分類，閱讀數，評論數，收藏數和打賞數。

pattern = re.compile('<a.*?blue-link.*?>(.*?)</a>'+'.*?<span.*?data-shared-at="(.*?)">'
            +'.*?<a.*?title.*?>(.*?)</a>'+'.*?<p.*?>(.*?)</p>'+
                             '.*?<a.*?collection-tag.*?>(.*?)</a>'+
                             '.*?</i>(.*?)</a>'+'.*?</i>(.*?)</a>'+'.*?</i>(.*?)</span>',re.S|re.M)
items = re.findall(pattern,buf)
複製程式碼

.**? 非貪婪模式，忽略掉那些不要的內容 (.*?) 分組，獲取到你需要的內容 re.S 使 . 匹配包括換行在內的所有字元 re.M 多行匹配，影響 ^ 和 $

最後儲存到 sqlite3資料庫中

def saveJianShuContent(list):
    coon=sqlite3.connect("jianshu.db")
    c = coon.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS articelList(id INTEGER primary key AUTOINCREMENT,name text,time text,title text,content text
    ,tag text,read text,commond text,love text)''')
    c.executemany('INSERT INTO articelList VALUES(NULL,?,?,?,?,?,?,?,?)',list)
    coon.commit()
    coon.close()
複製程式碼

coon=sqlite3.connect("jianshu.db")如果沒有jianshu.db那麼它會自動生成 c = coon.cursor()獲取遊標 c.execute建立表 id INTEGER primary key AUTOINCREMENT主鍵自動增加 c.executemany('INSERT INTO articelList VALUES(NULL,?,?,?,?,?,?,?,?)',list)把文章列表資料插入表中 ##3. 完整程式碼：

# 抓取簡書首頁
def getContent():
    try:
        str1 = 'http://www.jianshu.com'
        request = urllib2.Request(str1)
        request.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36')
        req = urllib2.urlopen(request)
        buf = req.read().decode('utf-8')
        pattern = re.compile('<a.*?blue-link.*?>(.*?)</a>'+'.*?<span.*?data-shared-at="(.*?)">'
            +'.*?<a.*?title.*?>(.*?)</a>'+'.*?<p.*?>(.*?)</p>'+
                             '.*?<a.*?collection-tag.*?>(.*?)</a>'+
                             '.*?</i>(.*?)</a>'+'.*?</i>(.*?)</a>'+'.*?</i>(.*?)</span>'+
                             '.*?<span>.*?</i>(.*?)</span>',re.S|re.M)
        items = re.findall(pattern,buf)
        saveJianShuContent(items)
        for item in items:
             for item1 in items:
                 print 'name---'+item1[0].encode('utf-8'),'time---'+item1[1],\
                    'title---'+ item1[2].encode('utf-8'),'content---'+item1[3].encode('utf-8'),\
                     'tag---'+item1[4].encode('utf-8'),'read--'+item1[5].encode('utf-8'),\
                     'commond--'+item1[6].encode('utf-8'),'collect--'+item1[7].encode('utf-8'),'money---'+item1[8].encode('utf-8')
    except url.URLError, e:
        if hasattr(e,"code"):
            print e.code
        if hasattr(e,"reason"):
            print e.reason

# 存入資料庫
def saveJianShuContent(list):
    coon= sqlite3.connect("jianshu.db")
    c = coon.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS articelList(id INTEGER primary key AUTOINCREMENT,name text,time text,title text,content text
    ,tag text,read text,commond text,love text,moeny text)''')
    c.executemany('INSERT INTO articelList VALUES(NULL,?,?,?,?,?,?,?,?,?)',list)
    coon.commit()
    coon.close()
if __name__ == '__main__':
    getContent()
複製程式碼

執行結果：

爬蟲，其實本就是這麼簡單
2019-08-19
爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
jdbc就是這麼簡單
2019-03-03
JDBC
WebSocket就是這麼簡單
2019-03-04
Web
WebService就是這麼簡單
2019-03-04
Web
jwt 就是這麼簡單
2018-08-14
JWT
ThreadLocal就是這麼簡單
2018-04-03
thread
SpringBoot就是這麼簡單
2018-03-20
Spring Boot
Activiti就是這麼簡單
2018-03-19
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
promise原理就是這麼簡單
2018-09-10
Promise
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
Spring AOP就是這麼簡單啦
2018-05-24
Spring
建造者模式就是這麼簡單
2019-05-12
模式
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
搞定JVM垃圾回收就是這麼簡單
2019-03-04
JVM
Mybatis【配置檔案】就是這麼簡單
2019-02-25
MyBatis
HashMap就是這麼簡單【原始碼剖析】
2018-04-10
HashMap原始碼
Spring【DAO模組】就是這麼簡單
2018-03-15
Spring
Spring【依賴注入】就是這麼簡單
2018-03-14
Spring依賴注入
Spring【AOP模組】就是這麼簡單
2018-03-14
Spring
包裝模式就是這麼簡單啦
2019-03-04
模式
二叉樹就是這麼簡單
2018-03-24
二叉樹
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲
python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
kotlin代理模式就是這麼簡單(委託)
2018-07-13
Kotlin模式
LinkedHashMap，原始碼解讀就是這麼簡單
2019-09-06
HashMap原始碼
使用plantuml，業務交接就是這麼簡單
2021-12-14
Java多執行緒就是這麼簡單
2020-12-14
Java執行緒
Python3 | 簡單爬蟲分析網頁元素
2018-11-30
Python爬蟲網頁
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
《Python開發簡單爬蟲》實踐筆記
2021-09-09
Python爬蟲筆記

簡單的Python爬蟲 就是這麼簡單

1. 開發工具和用到的庫

2. 開始寫程式碼

相關文章

簡單的Python爬蟲就是這麼簡單