這個作業屬於哪個課程 | <首頁 - 2024資料採集與融合技術實踐 - 福州大學 - 班級部落格 - 部落格園 (cnblogs.com)> |
---|---|
這個作業要求在哪裡 | <作業3 - 作業 - 2024資料採集與融合技術實踐 - 班級部落格 - 部落格園 (cnblogs.com)> |
學號 | <102202126> |
一、作業內容
作業①
-
要求:指定一個網站,爬取這個網站中的所有的所有圖片,例如:中國氣象網(http://www.weather.com.cn)。使用scrapy框架分別實現單執行緒和多執行緒的方式爬取。務必控制總頁數(學號尾數2位)、總下載的圖片數量(尾數後3位)等限制爬取的措施。
-
程式碼如下
work.py import scrapy from Practical_work3.items import work1_Item class Work1Spider(scrapy.Spider): name = 'work1' # allowed_domains = ['www.weather.com.cn'] start_urls = ['http://www.weather.com.cn/'] def parse(self, response): data = response.body.decode() selector=scrapy.Selector(text=data) img_datas = selector.xpath('//a/img/@src') for img_data in img_datas: item = work1_Item() item['img_url'] = img_data.extract() yield item items.py import scrapy class work1_Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() img_url = scrapy.Field() pipelines.py import threading from itemadapter import ItemAdapter import urllib.request import os import pathlib from Practical_work3.items import work1_Item class work1_Pipeline: count = 0 desktopDir = str(pathlib.Path.home()).replace('\\','\\\\') + '\\Desktop' threads = [] def open_spider(self,spider): picture_path=self.desktopDir+'\\images' if os.path.exists(picture_path): # 判斷資料夾是否存在 for root, dirs, files in os.walk(picture_path, topdown=False): for name in files: os.remove(os.path.join(root, name)) # 刪除檔案 for name in dirs: os.rmdir(os.path.join(root, name)) # 刪除資料夾 os.rmdir(picture_path) # 刪除資料夾 os.mkdir(picture_path) # 建立資料夾 # # 單執行緒 # def process_item(self, item, spider): # url = item['img_url'] # print(url) # img_data = urllib.request.urlopen(url=url).read() # img_path = self.desktopDir + '\\images\\' + str(self.count)+'.jpg' # with open(img_path, 'wb') as fp: # fp.write(img_data) # self.count = self.count + 1 # return item #多執行緒 def process_item(self, item, spider): if isinstance(item,work1_Item): url = item['img_url'] print(url) T=threading.Thread(target=self.download_img,args=(url,)) T.setDaemon(False) T.start() self.threads.append(T) return item def download_img(self,url): img_data = urllib.request.urlopen(url=url).read() img_path = self.desktopDir + '\\images\\' + str(self.count)+'.jpg' with open(img_path, 'wb') as fp: fp.write(img_data) self.count = self.count + 1 def close_spider(self,spider): for t in self.threads: t.join()
-
輸出資訊:
-
Gitee資料夾連結:陳家凱第三次實踐作業
-
心得體會:
- 單執行緒爬取:簡單易懂,適合初學者。能夠逐步掌握爬蟲的基本邏輯,控制請求的頻率,避免對目標網站造成過大壓力。
- 多執行緒爬取:顯著提升了爬取速度,但需要注意執行緒安全問題,以及對目標網站的請求頻率控制,防止被封IP。
作業②
-
要求:熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法;使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取股票相關資訊。
-
程式碼如下
work.py from typing import Iterable import scrapy from scrapy.http import Request import re import json from Practical_work3.items import work2_Item class Work2Spider(scrapy.Spider): name = 'work2' # allowed_domains = ['25.push2.eastmoney.com'] start_urls = ['http://25.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124021313927342030325_1696658971596&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f2,f3,f4,f5,f6,f7,f12,f14,f15,f16,f17,f18&_=1696658971636'] def parse(self, response): data = response.body.decode() item = work2_Item() data = re.compile('"diff":\[(.*?)\]',re.S).findall(data) columns={'f2':'最新價','f3':'漲跌幅(%)','f4':'漲跌額','f5':'成交量','f6':'成交額','f7':'振幅(%)','f12':'程式碼','f14':'名稱','f15':'最高', 'f16':'最低','f17':'今開','f18':'昨收'} for one_data in re.compile('\{(.*?)\}',re.S).findall(data[0]): data_dic = json.loads('{' + one_data + '}') for k,v in data_dic.items(): item[k] = v yield item items.py import scrapy class work2_Item(scrapy.Item): f2 = scrapy.Field() f3 = scrapy.Field() f4 = scrapy.Field() f5 = scrapy.Field() f6 = scrapy.Field() f7 = scrapy.Field() f12 = scrapy.Field() f14 = scrapy.Field() f15 = scrapy.Field() f16 = scrapy.Field() f17 = scrapy.Field() f18 = scrapy.Field() pipelines.py # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface import threading from itemadapter import ItemAdapter import urllib.request import os import pathlib import pymysql from Practical_work3.items import work2_Item class work2_Pipeline: def open_spider(self, spider): try: self.db = pymysql.connect(host='127.0.0.1', user='root', passwd='Cjkmysql.', port=3306, charset='utf8', database='chenoojkk') self.cursor = self.db.cursor() self.cursor.execute('DROP TABLE IF EXISTS stock') sql = """CREATE TABLE stock(Latest_quotation Double,Chg Double,up_down_amount Double,turnover Double,transaction_volume Double, amplitude Double,id varchar(12) PRIMARY KEY,name varchar(32),highest Double, lowest Double,today Double,yesterday Double)""" self.cursor.execute(sql) except Exception as e: print(e) def process_item(self, item, spider): if isinstance(item, work2_Item): sql = """INSERT INTO stock VALUES (%f,%f,%f,%f,%f,%f,"%s","%s",%f,%f,%f,%f)""" % ( item['f2'], item['f3'], item['f4'], item['f5'], item['f6'], item['f7'], item['f12'], item['f14'], item['f15'], item['f16'], item['f17'], item['f18']) self.cursor.execute(sql) self.db.commit() return item def close_spider(self, spider): self.cursor.close() self.db.close()
-
輸出資訊:
-
Gitee資料夾連結:陳家凱第三次實踐作業
-
心得體會:
- 在儲存資料時,我學習瞭如何使用Python的MySQL庫連線資料庫,執行插入和更新操作。合理的資料庫設計(如表結構和索引)能顯著提升資料存取效率。
- 在爬取過程中,遇到了一些問題,如網路請求失敗或資料格式不符。透過設定異常處理機制和除錯工具,我能夠快速定位問題並進行修復,提高了爬蟲的穩定性
作業③
-
要求:熟練掌握 scrapy 中 Item、Pipeline 資料的序列化輸出方法;使用scrapy框架+Xpath+MySQL資料庫儲存技術路線爬取外匯網站資料。
-
程式碼如下(使用scrapy框架+Xpath+MySQL資料庫)
work.py import scrapy from Practical_work3.items import work3_Item class Work3Spider(scrapy.Spider): name = 'work3' # allowed_domains = ['www.boc.cn'] start_urls = ['https://www.boc.cn/sourcedb/whpj/'] def parse(self, response): data = response.body.decode() selector=scrapy.Selector(text=data) data_lists = selector.xpath('//table[@align="left"]/tr') for data_list in data_lists: datas = data_list.xpath('.//td') if datas != []: item = work3_Item() keys = ['name','price1','price2','price3','price4','price5','date'] str_lists = datas.extract() for i in range(len(str_lists)-1): item[keys[i]] = str_lists[i].strip('<td class="pjrq"></td>').strip() yield item items.py import scrapy class work3_Item(scrapy.Item): name = scrapy.Field() price1 = scrapy.Field() price2 = scrapy.Field() price3 = scrapy.Field() price4 = scrapy.Field() price5 = scrapy.Field() date = scrapy.Field() pipelines.py # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface import threading from itemadapter import ItemAdapter import urllib.request import os import pathlib import pymysql from Practical_work3.items import work3_Item class work3_Pipeline: def open_spider(self,spider): try: self.db = pymysql.connect(host='127.0.0.1', user='root', passwd='Cjkmysql.', port=3306,charset='utf8',database='chenoojkk') self.cursor = self.db.cursor() self.cursor.execute('DROP TABLE IF EXISTS bank') sql = """CREATE TABLE bank(Currency varchar(32),p1 varchar(17),p2 varchar(17),p3 varchar(17),p4 varchar(17),p5 varchar(17),Time varchar(32))""" self.cursor.execute(sql) except Exception as e: print(e) def process_item(self, item, spider): if isinstance(item,work3_Item): sql = 'INSERT INTO bank VALUES ("%s","%s","%s","%s","%s","%s","%s")' % (item['name'],item['price1'],item['price2'], item['price3'],item['price4'],item['price5'],item['date']) self.cursor.execute(sql) self.db.commit() return item def close_spider(self,spider): self.cursor.close() self.db.close()
-
輸出資訊:
-
Gitee資料夾連結:陳家凱第三次實踐作業
-
心得體會:
-
精準提取資料:XPath是一種非常靈活的選擇器,能夠高效地從HTML文件中提取所需的資訊。在爬取外匯網站時,我透過編寫XPath表示式準確抓取匯率、變動幅度等關鍵資訊。
-
連線與資料儲存:使用Python的MySQL庫建立資料庫連線,並執行SQL操作。設定適當的表結構和索引能夠提高資料訪問速度,特別是在處理大量資料時。
-