Python爬蟲入門專案

豬哥66發表於2017-12-25

Python爬蟲

Python是什麼

Python是著名的“龜叔”Guido van Rossum在1989年聖誕節期間，為了打發無聊的聖誕節而編寫的一個程式語言。

創始人Guido van Rossum是BBC出品英劇Monty Python’s Flying Circus（中文：蒙提·派森的飛行馬戲團）的狂熱粉絲，因而將自己創造的這門程式語言命名為Python。

人生苦短，我用python，翻譯自"Life is short, you need Python"

Python英式發音：/ˈpaɪθən/ ，中文類似‘拍森’。而美式發音：/ˈpaɪθɑːn/，中文類似‘拍賞’。我看麻省理工授課教授讀的是‘拍賞’，我覺得國內大多是讀‘拍森’吧。

2017年python排第一也無可爭議，比較AI第一語言，在當下人工智慧大資料大火的情況下，python無愧第一語言的稱號，至於C、C++、java都是萬年的老大哥了，在程式碼量比較方面，小編相信java肯定是完爆其它語言的。

不過從這一年的程式語言流行趨勢看，java依然是傳播最多的，比較無論app、web、雲端計算都離不開，而其相對python而言，學習路徑更困難一點，想要轉行程式設計，而且追趕潮流，python已然是最佳語言。

許多大型網站就是用Python開發的，國內：豆瓣、搜狐、金山、騰訊、盛大、網易、百度、阿里、淘寶、熱酷、土豆、新浪、果殼…；國外：谷歌、NASA、YouTube、Facebook、工業光魔、紅帽…

Python將被納入高考內容

浙江省資訊科技課程改革方案已經出臺，Python確定進入浙江省資訊科技高考，從2018年起浙江省資訊科技教材程式語言將會從vb更換為Python。其實不止浙江，教育大省北京和山東也確定要把Python程式設計基礎納入資訊科技課程和高考的內容體系，Python語言課程化也將成為孩子學習的一種趨勢。尤其山東省最新出版的小學資訊科技六年級教材也加入了Python內容，小學生都開始接觸Python語言了！！

再不學習，又要被小學生完爆了。。。

Python入門教程

Python教程 - 廖雪峰的官方網站
Python官網
Python 100例 | 菜鳥教程
Python中文社群
微信公眾號：裸睡的豬
微信跳一跳的python外掛專案：https://github.com/wangshub/wechat_jump_game
微信機器人、爬取京東充氣娃娃、豆瓣影評、新浪話題、模擬登入淘寶，爬取淘寶案例，都在vx公眾號「裸睡的豬」
500G 百度雲Python入門資料免費送，公眾號回覆：python入門

Python能做什麼

網路爬蟲
Web應用開發
系統網路運維
科學與數字計算
圖形介面開發
網路程式設計
自然語言處理（NLP）
人工智慧
區塊鏈
多不勝舉。。。

Python入門爬蟲

這是我的第一個python專案，在這裡與大家分享出來~

需求
- 我們目前正在開發一款產品其功能大致是：使用者收到簡訊如：購買了電影票或者火車票機票之類的事件。然後app讀取簡訊，解析簡訊，獲取時間地點，然後後臺自動建立一個備忘錄，在事件開始前1小時提醒使用者。
設計
- 開始我們將解析的功能放在了服務端，但是後來考慮到使用者隱私問題。後來將解析功能放到了app端，服務端只負責收集資料，然後將新資料傳送給app端。
- 關於服務端主要是分離出兩個功能，一、響應app端請求返回資料。二、爬取資料，存入資料庫。
- 響應請求返回資料使用java來做，而爬取資料存入資料庫使用python來做，這樣分別使用不同語言來做是因為這兩種語言各有優勢，java效率比python高些，適合做web端，而爬取資料並不是太追求效能且python語言和大量的庫適合做爬蟲。
程式碼
- 本專案使用python3的版本
- 獲取原始碼：掃描下方關注微信公眾號「裸睡的豬」回覆：爬蟲入門獲取
- 瞭解這個專案你只需要有簡單的python基礎，能瞭解python語法就可以。其實我自己也是python沒學完，然後就開始寫，遇到問題就百度，邊做邊學這樣才不至於很枯燥，因為python可以做一些很有意思的事情，比如模擬連續登入掙積分，比如我最近在寫一個預定模範出行車子的python指令碼。推薦看廖雪峰的python入門教程
- 首先帶大家看看我的目錄結構，開始我打算是定義一個非常好非常全的規範，後來才發現由於自己不熟悉框架，而是剛入門級別，所以就放棄了。從簡而入：
- 下面我們們按照上圖中的順序，從上往下一個一個檔案的講解init.py包的標識檔案，python包就是資料夾，當改資料夾下有一個init.py檔案後它就成為一個package，我在這個包中引入一些py供其他py呼叫。

init.py

# -*- coding: UTF-8 -*-  

# import need manager module  
import MongoUtil  
import FileUtil  
import conf_dev  
import conf_test  
import scratch_airport_name  
import scratch_flight_number  
import scratch_movie_name  
import scratch_train_number  
import scratch_train_station  
import MainUtil

下面兩個是配置檔案，第一個是開發環境的（windows），第二個是測試環境的（linux），然後再根據不同系統啟用不同的配置檔案

conf_dev.py

# -*- coding: UTF-8 -*-  
# the configuration file of develop environment  

# path configure  
data_root_path = 'E:/APK98_GNBJ_SMARTSERVER/Proj-gionee-data/smart/data'  

# mongodb configure  
user = "cmc"  
pwd = "123456"  
server = "localhost"  
port = "27017"  
db_name = "smartdb"

conf_test.py

# -*- coding: UTF-8 -*-  
# the configuration file of test environment  

#path configure  
data_root_path = '/data/app/smart/data'  

#mongodb configure  
user = "smart"  
pwd = "123456"  
server = "10.8.0.30"  
port = "27017"  
db_name = "smartdb"

下面檔案是一個util檔案，主要是讀取原檔案的內容，還有將新內容寫入原檔案。

FileUtil.py

# -*- coding: UTF-8 -*-  
import conf_dev  
import conf_test  
import platform  


# configure Multi-confronment  
# 判斷當前系統，並引入相對的配置檔案
platform_os = platform.system()  
config = conf_dev  
if (platform_os == 'Linux'):  
    config = conf_test  
# path  
data_root_path = config.data_root_path  


# load old data  
def read(resources_file_path, encode='utf-8'):  
    file_path = data_root_path + resources_file_path  
    outputs = []  
    for line in open(file_path, encoding=encode):  
        if not line.startswith("//"):  
            outputs.append(line.strip('\n').split(',')[-1])  
    return outputs  


# append new data to file from scratch  
def append(resources_file_path, data, encode='utf-8'):  
    file_path = data_root_path + resources_file_path  
    with open(file_path, 'a', encoding=encode) as f:  
        f.write(data)  
    f.close

下面這個main方法控制著執行流程，其他的執行方法呼叫這個main方法

MainUtil.py

# -*- coding: UTF-8 -*-  

import sys  
from datetime import datetime  
import MongoUtil  
import FileUtil  

# @param resources_file_path 資原始檔的path  
# @param base_url 爬取的連線  
# @param scratch_func 爬取的方法  
def main(resources_file_path, base_url, scratch_func):  
    old_data = FileUtil.read(resources_file_path)   #讀取原資源  
    new_data = scratch_func(base_url, old_data)     #爬取新資源  
    if new_data:        #如果新資料不為空  
        date_new_data = "//" + datetime.now().strftime('%Y-%m-%d') + "\n" + "\n".join(new_data) + "\n"      #在新資料前面加上當前日期  
        FileUtil.append(resources_file_path, date_new_data)     #將新資料追加到檔案中  
        MongoUtil.insert(resources_file_path, date_new_data)    #將新資料插入到mongodb資料庫中  
    else:   #如果新資料為空，則列印日誌  
        print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'), '----', getattr(scratch_func, '__name__'), ": nothing to update ")

將更新的內容插入mongodb中

MongoUtil.py

# -*- coding: UTF-8 -*-  

import platform  
from pymongo import MongoClient  
from datetime import datetime, timedelta, timezone  
import conf_dev  
import conf_test  

# configure Multi-confronment  
platform_os = platform.system()  
config = conf_dev  
if (platform_os == 'Linux'):  
    config = conf_test  
# mongodb  
uri = 'mongodb://' + config.user + ':' + config.pwd + '@' + config.server + ':' + config.port + '/' + config.db_name  


# 將資料寫入mongodb  
# @author chenmc  
# @param uri connect to mongodb  
# @path save mongodb field  
# @data save mongodb field  
# @operation save mongodb field default value 'append'  
# @date 2017/12/07 16:30  
# 先在mongodb中插入一條自增資料 db.sequence.insert({ "_id" : "version","seq" : 1})  

def insert(path, data, operation='append'):  
    client = MongoClient(uri)  
    resources = client.smartdb.resources  
    sequence = client.smartdb.sequence  
    seq = sequence.find_one({"_id": "version"})["seq"]      #獲取自增id  
    sequence.update_one({"_id": "version"}, {"$inc": {"seq": 1}})       #自增id+1  
    post_data = {"_class": "com.gionee.smart.domain.entity.Resources", "version": seq, "path": path,  
                 "content": data, "status": "enable", "operation": operation,  
                 "createtime": datetime.now(timezone(timedelta(hours=8)))}  
    resources.insert(post_data)     #插入資料

專案引入的第三方庫，可使用pip install -r requirements.txt下載第三方庫

requirements.txt

# need to install module# need to install module  
bs4  
pymongo  
requests  
json

下面真正的執行方法來了，這五個py分別表示爬取五種資訊：機場名、航班號、電影名、列車號、列車站。他們的結構都差不多，如下:

第一部分：定義查詢的url；
第二部分：獲取並與舊資料比較，返回新資料；
第三部分：main方法，執行寫入新資料到檔案和mongodb中；

scratch_airport_name.py：爬取全國機場

# -*- coding: UTF-8 -*-  
import requests  
import bs4  
import json  
import MainUtil  

resources_file_path = '/resources/airplane/airportNameList.ini'  
scratch_url_old = 'https://data.variflight.com/profiles/profilesapi/search'  
scratch_url = 'https://data.variflight.com/analytics/codeapi/initialList'  
get_city_url = 'https://data.variflight.com/profiles/Airports/%s'  


#傳入查詢網頁的url和舊資料，然後本方法會比對原資料中是否有新的條目，如果有則不加入，如果沒有則重新加入，最後返回新資料
def scratch_airport_name(scratch_url, old_airports):  
    new_airports = []  
    data = requests.get(scratch_url).text  
    all_airport_json = json.loads(data)['data']  
    for airport_by_word in all_airport_json.values():  
        for airport in airport_by_word:  
            if airport['fn'] not in old_airports:  
                get_city_uri = get_city_url % airport['id']  
                data2 = requests.get(get_city_uri).text  
                soup = bs4.BeautifulSoup(data2, "html.parser")  
                city = soup.find('span', text="城市").next_sibling.text  
                new_airports.append(city + ',' + airport['fn'])  
    return new_airports  

 #main方法，執行這個py，預設呼叫main方法，相當於java的main
if __name__ == '__main__':  
    MainUtil.main(resources_file_path, scratch_url, scratch_airport_name)

scratch_flight_number.py：爬取全國航班號

#!/usr/bin/python  
# -*- coding: UTF-8 -*-  

import requests  
import bs4  
import MainUtil  

resources_file_path = '/resources/airplane/flightNameList.ini'  
scratch_url = 'http://www.variflight.com/sitemap.html?AE71649A58c77='  


def scratch_flight_number(scratch_url, old_flights):  
    new_flights = []  
    data = requests.get(scratch_url).text  
    soup = bs4.BeautifulSoup(data, "html.parser")  
    a_flights = soup.find('div', class_='list').find_all('a', recursive=False)  
    for flight in a_flights:  
        if flight.text not in old_flights and flight.text != '國內航段列表':  
            new_flights.append(flight.text)  
    return new_flights  


if __name__ == '__main__':  
    MainUtil.main(resources_file_path, scratch_url, scratch_flight_number)

scratch_movie_name.py：爬取最近上映的電影

#!/usr/bin/python  
# -*- coding: UTF-8 -*-  
import re  
import requests  
import bs4  
import json  
import MainUtil  

# 相對路徑，也是需要將此路徑存入資料庫  
resources_file_path = '/resources/movie/cinemaNameList.ini'  
scratch_url = 'http://theater.mtime.com/China_Beijing/'  


# scratch data with define url  
def scratch_latest_movies(scratch_url, old_movies):  
    data = requests.get(scratch_url).text  
    soup = bs4.BeautifulSoup(data, "html.parser")  
    new_movies = []  
    new_movies_json = json.loads(  
        soup.find('script', text=re.compile("var hotplaySvList")).text.split("=")[1].replace(";", ""))  
    coming_movies_data = soup.find_all('li', class_='i_wantmovie')  
    # 上映的電影  
    for movie in new_movies_json:  
        move_name = movie['Title']  
        if move_name not in old_movies:  
            new_movies.append(movie['Title'])  
    # 即將上映的電影  
    for coming_movie in coming_movies_data:  
        coming_movie_name = coming_movie.h3.a.text  
        if coming_movie_name not in old_movies and coming_movie_name not in new_movies:  
            new_movies.append(coming_movie_name)  
    return new_movies  


if __name__ == '__main__':  
    MainUtil.main(resources_file_path, scratch_url, scratch_latest_movies)

scratch_train_number.py：爬取全國列車號

#!/usr/bin/python  
# -*- coding: UTF-8 -*-  
import requests  
import bs4  
import json  
import MainUtil  

resources_file_path = '/resources/train/trainNameList.ini'  
scratch_url = 'http://www.59178.com/checi/'  


def scratch_train_number(scratch_url, old_trains):  
    new_trains = []  
    resp = requests.get(scratch_url)  
    data = resp.text.encode(resp.encoding).decode('gb2312')  
    soup = bs4.BeautifulSoup(data, "html.parser")  
    a_trains = soup.find('table').find_all('a')  
    for train in a_trains:  
        if train.text not in old_trains and train.text:  
            new_trains.append(train.text)  
    return new_trains  


if __name__ == '__main__':  
    MainUtil.main(resources_file_path, scratch_url, scratch_train_number)

scratch_train_station.py：爬取全國列車站

#!/usr/bin/python  
# -*- coding: UTF-8 -*-  
import requests  
import bs4  
import random  
import MainUtil  

resources_file_path = '/resources/train/trainStationNameList.ini'  
scratch_url = 'http://www.smskb.com/train/'  


def scratch_train_station(scratch_url, old_stations):  
    new_stations = []  
    provinces_eng = (  
        "Anhui", "Beijing", "Chongqing", "Fujian", "Gansu", "Guangdong", "Guangxi", "Guizhou", "Hainan", "Hebei",  
        "Heilongjiang", "Henan", "Hubei", "Hunan", "Jiangsu", "Jiangxi", "Jilin", "Liaoning", "Ningxia", "Qinghai",  
        "Shandong", "Shanghai", "Shanxi", "Shanxisheng", "Sichuan", "Tianjin", "Neimenggu", "Xianggang", "Xinjiang",  
        "Xizang",  
        "Yunnan", "Zhejiang")  
    provinces_chi = (  
        "安徽", "北京", "重慶", "福建", "甘肅", "廣東", "廣西", "貴州", "海南", "河北",  
        "黑龍江", "河南", "湖北", "湖南", "江蘇", "江西", "吉林", "遼寧", "寧夏", "青海",  
        "山東", "上海", "陝西", "山西", "四川", "天津", "內蒙古", "香港", "新疆", "西藏",  
        "雲南", "浙江")  
    for i in range(0, provinces_eng.__len__(), 1):  
        cur_url = scratch_url + provinces_eng[i] + ".htm"  
        resp = requests.get(cur_url)  
        data = resp.text.encode(resp.encoding).decode('gbk')  
        soup = bs4.BeautifulSoup(data, "html.parser")  
        a_stations = soup.find('left').find('table').find_all('a')  
        for station in a_stations:  
            if station.text not in old_stations:  
                new_stations.append(provinces_chi[i] + ',' + station.text)  
    return new_stations  


if __name__ == '__main__':  
    MainUtil.main(resources_file_path, scratch_url, scratch_train_station)

將專案放到測試伺服器(centos7系統)中執行起來，我寫了一個crontab，定時呼叫他們，下面貼出crontab。

/etc/crontab

SHELL=/bin/bash  
PATH=/sbin:/bin:/usr/sbin:/usr/bin  
MAILTO=root  

# For details see man 4 crontabs  

# Example of job definition:  
# .---------------- minute (0 - 59)  
# |  .------------- hour (0 - 23)  
# |  |  .---------- day of month (1 - 31)  
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...  
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat  
# |  |  |  |  |  
# *  *  *  *  * user-name  command to be executed  
  0  0  *  *  * root python3 /data/app/smart/py/scratch_movie_name.py    >> /data/logs/smartpy/out.log 2>&1  
  0  1  *  *  1 root python3 /data/app/smart/py/scratch_train_station.py >> /data/logs/smartpy/out.log 2>&1  
  0  2  *  *  2 root python3 /data/app/smart/py/scratch_train_number.py  >> /data/logs/smartpy/out.log 2>&1  
  0  3  *  *  4 root python3 /data/app/smart/py/scratch_flight_number.py >> /data/logs/smartpy/out.log 2>&1  
  0  4  *  *  5 root python3 /data/app/smart/py/scratch_airport_name.py  >> /data/logs/smartpy/out.log 2>&1

後續

目前專案已經正常執行了三個多月啦。。。

有問題反饋

在閱讀與學習中有任何問題，歡迎反饋給我，可以用以下聯絡方式跟我交流

微信公眾號：裸睡的豬
在下面留言
直接給我私信

關於此公眾號

後期或提供各種軟體的免費啟用碼
推送python，java等程式設計技術文章和麵試技巧
當然你們可以將你們感興趣的東西直接送給我
謝謝你們真誠的關注，此公眾號以後獲得的收益將全部通過抽獎的形式送給大家
以後如果博主要創業的話，也會在此公眾號中挑選小夥伴哦~
希望大家分享出去，讓更多想學習python的朋友看到~

不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
Python爬蟲入門
2020-11-30
Python爬蟲
Scrapy入門-第一個爬蟲專案
2018-07-23
爬蟲
Java爬蟲入門(一)——專案介紹
2018-08-06
Java爬蟲
如何入門 Python 爬蟲？
2015-04-14
Python爬蟲
python-爬蟲入門
2024-09-22
Python爬蟲
專案之爬蟲入門（豆瓣TOP250）
2020-11-19
爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
（python）爬蟲----八個專案帶你進入爬蟲的世界
2021-07-17
Python爬蟲
python爬蟲學習筆記 4.2 （Scrapy入門案例（建立專案））
2020-04-30
Python爬蟲筆記
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
Python爬蟲入門（2）：爬蟲基礎瞭解
2015-04-25
Python爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
python3 爬蟲入門
2021-09-09
Python爬蟲
Python爬蟲入門指導
2017-05-16
Python爬蟲
Python爬蟲專案整理
2017-04-15
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
爬蟲入門
2024-04-13
爬蟲
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
Python3爬蟲入門(一)
2020-12-05
Python爬蟲
Python爬蟲入門（1）：綜述
2015-04-25
Python爬蟲
python專案開發例項-Python專案案例開發從入門到實戰——爬蟲、遊戲
2020-10-28
Python爬蟲遊戲
專案－－python網路爬蟲
2020-08-15
Python爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
33個Python爬蟲專案
2017-12-11
Python爬蟲
Python簡單爬蟲專案
2017-12-26
Python爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
python爬蟲例項專案大全-GitHub 上有哪些優秀的 Python 爬蟲專案？
2020-10-30
Python爬蟲Github
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
Node 爬蟲入門
2017-05-31
爬蟲
爬蟲專案
2019-06-07
爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲

Python爬蟲入門專案

Python是什麼

Python入門教程

500G 百度雲Python入門資料免費送，公眾號回覆：python入門

Python能做什麼

Python入門爬蟲

獲取原始碼：掃描下方關注微信公眾號「裸睡的豬」回覆：爬蟲入門獲取

後續

有問題反饋

關於此公眾號

相關文章

Python爬蟲入門專案

Python是什麼

Python入門教程

500G 百度雲Python入門資料免費送，公眾號回覆：python入門

Python能做什麼

Python入門爬蟲

獲取原始碼：掃描下方關注微信公眾號「裸睡的豬」回覆：爬蟲入門 獲取

後續

有問題反饋

關於此公眾號

相關文章

獲取原始碼：掃描下方關注微信公眾號「裸睡的豬」回覆：爬蟲入門獲取