Python 爬基金資料

右介發表於2018-02-28

爬科學基金共享服務網中基金資料

#coding=utf-8
import json
import requests
from lxml import etree
from HTMLParser import HTMLParser
from pymongo import MongoClient

data = {'pageSize':10,'currentPage':1,'fundingProject.projectNo':'','fundingProject.name':'','fundingProject.person':'','fundingProject.org':'',
'fundingProject.applyCode':'','fundingProject.grantCode':'','fundingProject.subGrantCode':'','fundingProject.helpGrantCode':'','fundingProject.keyword':'',
'fundingProject.statYear':'','checkCode':'%E8%AF%B7%E8%BE%93%E5%85%A5%E9%AA%8C%E8%AF%81%E7%A0%81'}
url = 'http://npd.nsfc.gov.cn/fundingProjectSearchAction!search.action'
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.9',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'340',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'JSESSIONID=8BD27CE37366ED8022B42BFC68FF82D4',
'Host':'npd.nsfc.gov.cn',
'Origin':'http://npd.nsfc.gov.cn',
'Referer':'http://npd.nsfc.gov.cn/fundingProjectSearchAction!search.action',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

def main():
    client = MongoClient('localhost', 27017)
    db = client.ScienceFund
    db.authenticate("","")
    collection=db.science_fund
    for i in range(1, 43184):
        print i
        data['currentPage'] = i
        result = requests.post(url, data = data, headers = headers)
        html = result.text
        tree = etree.HTML(html)
        table = tree.xpath("//dl[@class='time_dl']")
        for item in table:
            content = etree.tostring(item, method='html')
            content =  HTMLParser().unescape(content)
            # print content
            bson = jiexi(content)
            collection.insert(bson)

        
def jiexi(content):
    # 標題
    title1 = content.find('">', 20)
    title2 = content.find('</')
    title = content[title1+2:title2]
    # print title
    # 批准號
    standard_no1 = content.find(u'批准號', title2)
    standard_no2 = content.find('</dd>', standard_no1)
    standard_no = content[standard_no1+4:standard_no2].strip()
    # print standard_no
    # 專案類別
    standard_type1 = content.find(u'專案類別', standard_no2)
    standard_type2 = content.find('</dd>', standard_type1)
    standard_type = content[standard_type1+5:standard_type2].strip()
    # print standard_type
    # 依託單位
    supporting_institution1 = content.find(u'依託單位', standard_type2)
    supporting_institution2= content.find('</dd>', supporting_institution1)
    supporting_institution = content[supporting_institution1+5:supporting_institution2].strip()
    # print supporting_institution
    # 專案負責人
    project_principal1 = content.find(u'專案負責人', supporting_institution2)
    project_principal2 = content.find('</dd>', project_principal1)
    project_principal = content[project_principal1+6:project_principal2].strip()
    # print project_principal
    # 資助經費
    funds1 = content.find(u'資助經費', project_principal2)
    funds2 = content.find('</dd>', funds1)
    funds = content[funds1+5:funds2].strip()
    # print funds
    # 批准年度
    year1 = content.find(u'批准年度', funds2)
    year2 = content.find('</dd>', year1)
    year = content[year1+5:year2].strip()
    # print year
    # 關鍵詞
    keywords1 = content.find(u'關鍵詞', year2)
    keywords2 = content.find('</dd>', keywords1)
    keywords = content[keywords1+4:keywords2].strip()
    # print keywords
    dc = {}
    dc['title'] = title
    dc['standard_no'] = standard_no
    dc['standard_type'] = standard_type
    dc['supporting_institution'] = supporting_institution
    dc['project_principal'] = project_principal
    dc['funds'] = funds
    dc['year'] = year
    dc['keywords'] = keywords
    return dc

if __name__ == '__main__':
    main()

python爬取基金股票最新資料，並用excel繪製樹狀圖
2021-03-02
PythonExcel
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
利用Python爬蟲爬取天氣資料
2018-02-06
Python爬蟲
Python：爬取疫情每日資料
2020-02-17
Python
使用Python爬取代理資料
2021-01-02
Python
python爬取股票資料並存到資料庫
2021-03-29
Python資料庫
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
Python爬取CSDN部落格資料
2019-01-03
Python
Python爬蟲之資料解析（XPath）
2018-12-18
Python爬蟲
Python【爬蟲實戰】提取資料
2020-11-17
Python爬蟲
Python爬取噹噹網APP資料
2020-10-21
PythonAPP
【機器學習】資料準備--python爬蟲
2022-06-22
機器學習Python爬蟲
使用 Python 爬取網站資料
2024-07-27
Python網站
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
Python3爬蟲資料入資料庫---把爬取到的資料存到資料庫，帶資料庫去重功能
2018-10-22
Python爬蟲資料庫
Python 爬取 baidu 股票市值資料
2019-02-16
PythonAI
python爬取58同城一頁資料
2018-08-04
Python
Python網路資料採集（爬蟲）
2017-10-15
Python爬蟲
python初學-爬取網頁資料
2015-12-31
Python網頁
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
python爬蟲總是爬不到資料，你需要解決反爬蟲了
2020-06-26
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
利用python編寫爬蟲爬取淘寶奶粉部分資料.1
2021-09-09
Python爬蟲
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
Python爬蟲之使用MongoDB儲存資料
2019-02-16
Python爬蟲MongoDB
最新Python爬蟲和資料視覺化
2020-12-12
Python爬蟲視覺化
python爬蟲利用代理IP分析大資料
2020-12-01
Python爬蟲大資料
如何用Python爬資料？（一）網頁抓取
2018-06-27
Python網頁
利用python爬取58同城簡歷資料
2016-05-08
Python
利用python爬取某殼的房產資料
2024-05-05
Python
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
Python 爬蟲實戰之爬拼多多商品並做資料分析
2023-10-17
Python爬蟲
python爬蟲抓取資料時失敗_python爬蟲大佬請教下為什麼爬取的資料有時能爬到有時有爬不到，程式碼如下：...
2020-12-04
Python爬蟲
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
[python爬蟲] BeautifulSoup爬取+CSV儲存貴州農產品資料
2017-10-29
Python爬蟲
一個月入門Python爬蟲，輕鬆爬取大規模資料
2017-12-26
Python爬蟲

Python 爬基金資料

相關文章