ajax+php實現爬蟲功能

alansleep發表於2013-02-18

原文網址 : http://www.ituring.com.cn/article/30952

本文綜合php,ajax的基礎知識來做次很酷的事-爬蟲,代理. 作為練習,這裡僅僅實現抓去別人頁面上的圖片顯示在自己的頁面上---這已經很酷了.
1.域安全限制:
在js裡是訪問不了不同域下面的檔案的.除非那目標網站是你的,而且你能對要訪問的頁面進行修改.但是任何網站都不會限制客戶端的正常訪問請求.所以,我們先在自己的網站實現一個代理頁面.這東西和原來經常出現的翻牆工具一樣.
2.php代理:

<?php
//dl.php
require_once("simple_html_dom.php");
$user_agent= "Mozilla/4.0"; //瀏覽器
$url = '';

if ($_GET['p']) {
    $url = $_GET['p'];
    $url = base64_decode($url);
}

if($url)
{
    $session = curl_init();
    curl_setopt($session,CURLOPT_URL,$url);
    curl_setopt($session, CURLOPT_HEADER, false);
    curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
    $lofter = curl_exec($session);
    curl_close($session);

    $html = str_get_html($lofter);
    $url = $html->find('textarea[name=js]', 0)->innertext;
    $html->clear();
    $ibpos = stripos ($url,'\'');
    $iepos = stripos($url,'\'', $ibpos + 1);
    $url = substr($url,$ibpos + 1,$iepos - $ibpos  - 1);
    echo '<img src = "'.$url.'"/>';
}
?>

第1行是引入一個很好用的dom工具類.有他可以很方便的獲取修改各個節點的資訊.名字就是simplehtmldom.很簡單方便.把他想象成c/cpp經常用的tinyxml也可以.
6-9行是看通過get方法傳進來的結果base64編碼過的網站地址.
13-18通過curl抓去頁面.
20-22通過simplehtmldom獲取一個節點的內容.
後面通過字串操作修剪得到自己要的內容.這裡是得到其中的一個圖片地址.你完全可以做到得到裡面你想要的任何資料,入庫等.

ok,先訪問這個頁面,傳入正確的引數,如:http://epics.cn/dl.php?p=aHR0cDovL3d3dy5sb2Z0ZXIuY29tLw== .還不錯的樣子.

3,自己的頁面lofter.html

<html>
<head>
     <title>Lofter</title>
     <script src="lofter.js"></script>
</head>
<body>
     <p>
        <a id="makeImgRequest" href="#">Get</a>   
     </p>
     <div id="ImgArea"> </div>
</body>
</html>

這個頁面就一個連結和準備顯示抓來的圖片. 重要的工作是在js裡做的.

4,ajax~:
ajax有很多好處,最醒目的估計就是讓web程式看起來更像本地應用程式的風格-反應快.原理也簡單,搜尋就知道了.
lofter.js

window.onload = initAll; 
var xhr = false;

function initAll() {
     document.getElementById("makeImgRequest").onclick = getNewFile;
}

function getNewFile() {
     makeRequest();
     return false;
}

function makeRequest() {
     if (window.XMLHttpRequest) {
        xhr = new XMLHttpRequest();
     }
     else {
        if (window.ActiveXObject) {
           try {
              xhr = new ActiveXObject("Microsoft.XMLHTTP");
           }
           catch (e) { }
        }
     }

     if (xhr) {
        xhr.onreadystatechange = showContents;
        xhr.open("GET", 'dl.php?p=aHR0cDovL3d3dy5sb2Z0ZXIuY29tLw==', true);
        xhr.send(null);
     }
     else {
        document.getElementById("ImgArea").innerHTML = "Sorry, but I couldn't create an XMLHttpRequest";
     }
}

function showContents() {
    var outMsg = "";
    if (xhr.readyState == 4) {
        if (xhr.status == 200) {
            if (xhr.responseXML && xhr.responseXML.childNodes.length > 0) {
                var bg = xhr.responseXML.getElementsById ("bg");
                var outMsg = getText(xhr.responseXML.getElementsById ("bg"));
            }
            else {
                var outMsg = xhr.responseText;

            }
        }
        else {
            var outMsg = "There was a problem with the request " + xhr.status;
        }
        document.getElementById("ImgArea").innerHTML = outMsg;
     }

     function getText(inVal) {
        if (inVal.textContent) {
           return inVal.textContent;
        }
        return inVal.text;
     }
}

程式碼都是套路.過程就是非同步獲取到同域下的檔案,還記得前面說的'一般情況下js不能訪問不同域的檔案'嗎.通過代理獲取並且分析過濾得到的一個圖片地址.然後更新到頁面的imgarea節點下.

效果如:http://epics.cn/lofter.html over~

我的部落格

python的爬蟲功能如何實現
2019-02-28
Python爬蟲
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
Python爬蟲更多的功能
2023-11-24
Python爬蟲
Python爬蟲的兩套解析方法和四種爬蟲實現
2018-07-03
Python爬蟲
爬蟲——爬取貴陽房價（Python實現）
2022-02-09
爬蟲Python
nodejs + koa2 實現爬蟲
2019-02-16
NodeJS爬蟲
Python爬蟲是如何實現的？
2022-07-15
Python爬蟲
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
Python爬蟲教程-05-python爬蟲實現百度翻譯
2018-09-06
Python爬蟲
Python爬蟲教程-06-爬蟲實現百度翻譯(requests)
2018-09-06
Python爬蟲
Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
Python 爬蟲IP代理池的實現
2018-12-17
Python爬蟲
多執行緒爬蟲實現（上）
2018-05-26
執行緒爬蟲
大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
爬蟲實現：根據IP地址反查域名
2019-04-13
爬蟲
Disruptor 實踐：整合到現有的爬蟲框架
2018-12-05
爬蟲框架
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
python爬蟲實現成語接龍1.0
2020-10-06
Python爬蟲
.NET使用分散式網路爬蟲框架DotnetSpider快速開發爬蟲功能
2023-12-08
分散式爬蟲框架IDE
關於爬蟲平臺的架構實現和框架的選型(二)--scrapy的內部實現以及實時爬蟲的實現
2019-07-16
爬蟲架構框架
大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
實用爬蟲-01-檢測爬蟲的 IP
2018-09-08
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
我爬取了爬蟲崗位薪資，分析後發現爬蟲真香
2020-12-09
爬蟲
用雲函式快速實現圖片爬蟲
2018-11-02
函式爬蟲
Python之分散式爬蟲的實現步驟
2018-08-29
Python分散式爬蟲
Java實現網路爬蟲案例程式碼
2022-11-22
Java爬蟲
JavaScript爬蟲程式實現自動化爬取tiktok資料教程
2023-10-18
JavaScript爬蟲
python 爬蟲實現增量去重和定時爬取例項
2020-03-06
Python爬蟲
實用爬蟲-02-爬蟲真正使用代理 ip
2018-09-08
爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
誰說爬蟲只能Python？看我用C#快速簡單實現爬蟲開發和演示！
2024-05-27
爬蟲PythonC#
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站

ajax+php實現爬蟲功能

相關文章