Linux企業級專案實踐之網路爬蟲（4）——主程式流程

尹成發表於2014-08-28

原文網址 : https://blog.csdn.net/itcastcpp/article/details/38887017

Linux爬蟲

當我們設計好程式框架之後就要開始實現它了。第一步當然是要實現主程式的流程框架。之後我們逐漸填充每個流程的細節和其需要呼叫的模組。

主程式的流程如下：

1、解析命令列引數，並根據引數跳轉到相應的處理分支

2、解析配置檔案

3、載入處理模組

4、載入種子URL

5、啟動抓取任務

主程式的程式碼如下：

int main(int argc, void *argv[])
{
   struct epoll_event events[10];
   int daemonized = 0;
   char ch;
 
   while ((ch = getopt(argc, (char* const*)argv, "vhd")) != -1) {
        switch(ch) {
            case 'v':
                version();
                break;
            case 'd':
                daemonized = 1;
                break;
            case 'h':
            case '?':
            default:
                usage();
        }
   }
 
   g_conf = initconfig();
   loadconfig(g_conf);
 
   set_nofile(1024);
 
   vector<char *>::iterator it = g_conf->modules.begin();
   for(; it != g_conf->modules.end(); it++) {
        dso_load(g_conf->module_path, *it);
   }
 
   if (g_conf->seeds == NULL) {
        SPIDER_LOG(SPIDER_LEVEL_ERROR, "Wehave no seeds, Buddy!");
   } else {
        int c = 0;
        char ** splits =strsplit(g_conf->seeds, ',', &c, 0);
        while (c--) {
            Surl * surl = (Surl*)malloc(sizeof(Surl));
            surl->url =url_normalized(strdup(splits[c]));
            surl->level = 0;
            surl->type = TYPE_HTML;
            if (surl->url != NULL)
                push_surlqueue(surl);
       }
   }       
 
   if (daemonized)
        daemonize();
 
   chdir("download");
 
   int err = -1;
   if ((err = create_thread(urlparser, NULL, NULL, NULL)) < 0) {
        SPIDER_LOG(SPIDER_LEVEL_ERROR,"Create urlparser thread fail: %s", strerror(err));
   }
 
   int try_num = 1;
   while(try_num < 8 && is_ourlqueue_empty())
        usleep((10000 << try_num++));
 
   if (try_num >= 8) {
        SPIDER_LOG(SPIDER_LEVEL_ERROR, "NOourl! DNS parse error?");
   }
 
   if (g_conf->stat_interval > 0) {
        signal(SIGALRM, stat);
        set_ticker(g_conf->stat_interval);
   }
 
   int ourl_num = 0;
   g_epfd = epoll_create(g_conf->max_job_num);
 
   while(ourl_num++ < g_conf->max_job_num) {
        if (attach_epoll_task() < 0)
            break;
   }
 
   int n, i;
   while(1) {
        n = epoll_wait(g_epfd, events, 10,2000);
        printf("epoll:%d\n",n);
        if (n == -1)
            printf("epollerrno:%s\n",strerror(errno));
        fflush(stdout);
 
        if (n <= 0) {
            if (g_cur_thread_num <= 0&& is_ourlqueue_empty() && is_surlqueue_empty()) {
                sleep(1);
                if (g_cur_thread_num <= 0&& is_ourlqueue_empty() && is_surlqueue_empty())
                    break;
            }
        }
 
        for (i = 0; i < n; i++) {
            evso_arg * arg = (evso_arg*)(events[i].data.ptr);
            if ((events[i].events &EPOLLERR) ||
                (events[i].events &EPOLLHUP) ||
                (!(events[i].events &EPOLLIN))) {
                SPIDER_LOG(SPIDER_LEVEL_WARN,"epoll fail, close socket %d",arg->fd);
                close(arg->fd);
                continue;
            }
            epoll_ctl(g_epfd, EPOLL_CTL_DEL,arg->fd, &events[i]); /* del event */
 
            printf("helloepoll:event=%d\n",events[i].events);
            fflush(stdout);
            create_thread(recv_response, arg,NULL, NULL);
        }
   }
 
   SPIDER_LOG(SPIDER_LEVEL_DEBUG, "Task done!");
   close(g_epfd);
   return 0;
}

網路爬蟲專案
2022-01-29
爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
企業資料爬蟲專案
2018-10-05
爬蟲
2019最新《網路爬蟲JAVA專案實戰》
2019-05-09
爬蟲Java
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
企業資料爬蟲專案（二）
2018-10-06
爬蟲
網路爬蟲流程總結
2023-03-09
爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
python網路爬蟲--專案實戰--scrapy嵌入selenium，晶片廠級聯評論爬取（6）
2020-10-23
Python爬蟲晶片
[網路爬蟲] 網路爬蟲實踐：大麥網演唱會預約搶票【待續】
2024-05-04
爬蟲
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
Python爬蟲開發與專案實踐（3）
2020-10-26
Python爬蟲
最新《30小時搞定Python網路爬蟲專案實戰》
2020-02-18
Python爬蟲
網路爬蟲——Urllib模組實戰專案（含程式碼）爬取你的第一個網站
2020-02-12
爬蟲網站
網路爬蟲專案開發日誌（三）：爬蟲上線準備
2022-02-02
爬蟲
精通 Python 網路爬蟲：核心技術、框架與專案實戰
2018-11-06
Python爬蟲框架
104個實用網路爬蟲專案資源整理（超全）
2019-04-16
爬蟲
React17+React Hook+TS4 最佳實踐仿 Jira 企業級專案
2021-07-04
ReactHook
課程設計：python_網路爬蟲專案
2021-03-09
Python爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
Python靜態網頁爬蟲專案實戰
2020-05-01
Python網頁爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
視訊教程-Python網路爬蟲開發與專案實戰-Python
2020-05-28
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
Java實現網路爬蟲案例程式碼
2022-11-22
Java爬蟲
網路爬蟲技術是什麼，網路爬蟲的基本工作流程是什麼？
2019-03-03
爬蟲
爬蟲專案:大麥網分析
2019-08-22
爬蟲
網路爬蟲之抓取郵箱
2018-06-18
爬蟲
爬蟲專案
2019-06-07
爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
入門須知之網路爬蟲的基本流程及抓取策略
2018-11-10
爬蟲

Linux企業級專案實踐之網路爬蟲（4）——主程式流程

相關文章