搜尋引擎Nutch 0.7.2 試用筆記

gudesheng發表於2008-01-03

筆記

在Google裡面搜尋了一些Nutch的資料，還真不多。今天畫了兩個小時搞了一通，把一些心得紀錄下來。

我的jdk 是1.5.x ，Tomcat是5.0.x

1 下載0.7.2 版本的包：裡面已經包含了war檔案，所以不需要Ant編譯了

2 安裝Cygwin，這個沒什麼好說的。

3 把Nutch借壓縮到D:/nutch

4 在D:/nutch下面建立一個檔案 urls （沒有字尾）

http://www.xxx.edu.cn/

5 在D:/nutch/conf/crawl-urlfilter.txt裡面，加入

+^http://www.xxx.edu.cn/

6 環境變數加入

NUTCH_JAVA_HOME = D:/jdk1.5.0_06

7 開啟cygwin視窗，

cd D:/nutch

在D:/nutch下面執行

bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log

這個命令會在當前目錄下建立一個crawled的資料夾，然後對剛才的www.xxx.edu.cn的網站開始檢索。

層數是3層，一般最好10層。然後結果輸出在crawl.log裡面

8 將nutch-0.7.2.war拷貝到tomcat/webapps下面，改名nutch

9 在D:/tomcat/conf/Catalina/localhost/ 建立nutch.xml

10 啟動tomcat，等war解開以後，開啟

D:/tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml

修改如下



    searcher.dir
    D:/nutch/crawled/

11 D:/tomcat/webapps/nutch/zh/include 下面新建header.jsp，內容就是複製header.html，但是

前面加上

<%@ page 
  contentType="text/html; charset=UTF-8"
  pageEncoding="UTF-8"
%>

在D:/tomcat/webapps/nutch/search.jsp裡面，找到並修改為

"/>

順便把下面js註釋掉

function queryfocus() {

//search.query.focus();
  }

12 D:/tomcat/conf/server.xml 找到以下段，並修改

                   maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               debug="0" connectionTimeout="20000" 
               disableUploadTimeout="true" 
               URIEncoding="UTF-8" useBodyEncodingForURI="true" />

好了，到此，重啟tomcat，訪問 http://localhost:8080 就可以看到搜尋主頁了，而且搜尋支援中文

和分詞，雖然分得不是很好。

13 接下來一個重要的功能就是增量更新索引。

在D:/nutch下建立recrawl.sh 。內容為

#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
  bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

在cygwin裡面，執行

./recrawl crawled 8

意思是更新索引，目錄是crawled,層數8 。還可以把天數跟在後面。執行完畢後要重啟Tomcat，因為

Nutch.war裡面對查詢有快取。

最後可以把這個指令碼放到crontab裡面，每天夜裡2點執行，執行完畢後重新啟動Tomcat。

Trackback: http://tb.blog.csdn.net/TrackBack.aspx?PostId=725937

搜尋引擎-03-搜尋引擎原理
2024-04-04
海量資料搜尋---搜尋引擎
2018-11-13
Django單元測試與搜尋引擎
2018-05-07
Django
用elasticsearch和nuxtjs搭建bt搜尋引擎
2018-10-02
ElasticsearchUXJS
折半搜尋學習筆記
2024-03-31
筆記
搜尋引擎es-分詞與搜尋
2024-08-27
分詞
sphinx 全文搜尋引擎
2019-02-16
高效利用搜尋引擎
2018-08-17
ElasticSearch全文搜尋引擎
2019-07-29
Elasticsearch
高效的使用搜尋引擎
2018-11-07
搜尋引擎與前端SEO
2018-05-24
前端
python 寫的搜尋引擎
2019-08-31
Python
Shodan搜尋引擎介紹
2020-08-19
搜尋引擎優化（SEO）
2020-05-17
優化
BTFILM電影搜尋引擎
2019-05-11
Django整合搜尋引擎Elasticserach
2019-06-04
DjangoAST
搜尋引擎框架介紹
2019-05-13
框架
認識搜尋引擎 Elasticsearch
2021-07-15
Elasticsearch
直播開發app，實時搜尋、搜尋引擎框
2022-03-29
APP
57_初識搜尋引擎_分散式搜尋引擎核心解密之query phase
2024-10-02
分散式解密
用 google 作為你的預設搜尋引擎
2018-05-03
Go
ES 筆記十七：結構化搜尋
2019-11-06
筆記
Nebula 基於 ElasticSearch 的全文搜尋引擎的文字搜尋
2021-06-17
Elasticsearch
Mac上神奇的內建搜尋引擎——Spotlight(聚焦搜尋)
2020-12-14
Mac
搜尋引擎-01-概覽
2024-04-02
設定搜尋引擎遮蔽 CSDN
2022-02-15
知乎搜尋/（引擎）的故事
2021-02-28
用Python實現一個大資料搜尋引擎
2019-02-25
Python大資料
bt引擎-螞蟻磁力搜尋教程大全，如何使用搜尋引擎查詢磁力連結
2024-07-10
使用Google百度等搜尋引擎的常用搜尋技巧
2019-03-18
Go
PJzhang:搜尋引擎高階語法與滲透測試
2019-05-24
拜託，面試請不要再問我分散式搜尋引擎的架構原理！【石杉的架構筆記】
2019-01-25
面試分散式架構筆記
127盤搜網網盤資源搜尋引擎
2019-05-11
OpenAI新AI搜尋將顛覆谷歌等傳統搜尋引擎
2024-02-22
OpenAI谷歌
開放搜尋開源相容版，支援Elasticsearch做搜尋召回引擎
2021-09-25
Elasticsearch
python 寫的搜尋引擎 - 原始碼
2019-08-20
Python原始碼
聚合搜尋引擎第二版
2019-05-11
各類主流網盤搜尋引擎
2019-05-11
黑帽SEO之搜尋引擎劫持
2019-02-20

搜尋引擎Nutch 0.7.2 試用筆記

相關文章