ubuntu安裝Scrapy

鴨脖發表於2013-05-22

Scrapy是Python開發的一個快速,高層次的螢幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛,可以用於資料探勘、監測和自動化測試。 官網網站http://www.scrapy.org/
1、安裝如下軟體

sudo apt-get install build-essential;
sudo apt-get install python-dev;
sudo apt-get install libxml2-dev;
sudo apt-get install libxslt1-dev;
sudo apt-get install python-setuptools;

2、安裝Scrapy

sudo easy_install Scrapy;
wang@ubuntu:/usr/local/lib/python2.7/dist-packages$ sudo easy_install Scrapy
Searching for Scrapy
Best match: Scrapy 0.16.1
Processing Scrapy-0.16.1-py2.7.egg
Scrapy 0.16.1 is already the active version in easy-install.pth
Installing scrapy script to /usr/local/bin
 
Using /usr/local/lib/python2.7/dist-packages/Scrapy-0.16.1-py2.7.egg
Processing dependencies for Scrapy
Searching for lxml
Reading http://pypi.python.org/simple/lxml/
Reading http://codespeak.net/lxml
Best match: lxml 3.0.1
Downloading http://pypi.python.org/packages/source/l/lxml/lxml-3.0.1.tar.gz#md5=0f2b1a063ab3b6b0944cbc4a9a85dcfa
Processing lxml-3.0.1.tar.gz
Running lxml-3.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-qibAzL/lxml-3.0.1/egg-dist-tmp-mSvUVN
Building lxml version 3.0.1.
Building without Cython.
Using build configuration of libxslt 1.1.26
Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu
warning: no files found matching '*.txt' under directory 'src/lxml/tests'
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree__getFilenameForFile’:
src/lxml/lxml.etree.c:26310:7: warning: variable ‘__pyx_clineno’ set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c:26309:15: warning: variable ‘__pyx_filename’ set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c:26308:7: warning: variable ‘__pyx_lineno’ set but not used [-Wunused-but-set-variable]
src/lxml/lxml.etree.c: In function ‘__pyx_pf_4lxml_5etree_4XSLT_18__call__’:
src/lxml/lxml.etree.c:132608:81: warning: passing argument 1 of ‘__pyx_f_4lxml_5etree_12_XSLTContext__copy’ from incompatible pointer type [enabled by default]
src/lxml/lxml.etree.c:130569:52: note: expected ‘struct __pyx_obj_4lxml_5etree__XSLTContext *’ but argument is of type ‘struct __pyx_obj_4lxml_5etree__BaseContext *’
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree__copyXSLT’:
src/lxml/lxml.etree.c:133997:79: warning: passing argument 1 of ‘__pyx_f_4lxml_5etree_12_XSLTContext__copy’ from incompatible pointer type [enabled by default]
src/lxml/lxml.etree.c:130569:52: note: expected ‘struct __pyx_obj_4lxml_5etree__XSLTContext *’ but argument is of type ‘struct __pyx_obj_4lxml_5etree__BaseContext *’
src/lxml/lxml.etree.c: At top level:
src/lxml/lxml.etree.c:12128:13: warning: ‘__pyx_f_4lxml_5etree_displayNode’ defined but not used [-Wunused-function]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFile’:
src/lxml/lxml.etree.c:86715:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDoc’:
src/lxml/lxml.etree.c:86403:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseUnicodeDoc’:
src/lxml/lxml.etree.c:86093:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
src/lxml/lxml.etree.c: In function ‘__pyx_f_4lxml_5etree_11_BaseParser__parseDocFromFilelike’:
src/lxml/lxml.etree.c:86925:3: warning: ‘__pyx_r’ may be used uninitialized in this function [-Wuninitialized]
Adding lxml 3.0.1 to easy-install.pth file
 
Installed /usr/local/lib/python2.7/dist-packages/lxml-3.0.1-py2.7-linux-x86_64.egg
Searching for w3lib>=1.2
Reading http://pypi.python.org/simple/w3lib/
Reading http://github.com/scrapy/w3lib
Best match: w3lib 1.2
Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=f929d5973a9fda59587b09a72f185a9e
Processing w3lib-1.2.tar.gz
Running w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-ZAXTgy/w3lib-1.2/egg-dist-tmp-aU3vpc
zip_safe flag not set; analyzing archive contents...
Adding w3lib 1.2 to easy-install.pth file
 
Installed /usr/local/lib/python2.7/dist-packages/w3lib-1.2-py2.7.egg
Searching for Twisted>=8.0
Reading http://pypi.python.org/simple/Twisted/
Reading http://www.twistedmatrix.com
Reading http://twistedmatrix.com/products/download
Reading http://twistedmatrix.com/
Reading http://tmrc.mit.edu/mirror/twisted/Twisted/9.0/
Reading http://tmrc.mit.edu/mirror/twisted/Twisted/10.0/
Reading http://twistedmatrix.com/projects/core/
Reading http://tmrc.mit.edu/mirror/twisted/Twisted/8.2/
Reading http://tmrc.mit.edu/mirror/twisted/Twisted/8.1/
Best match: Twisted 12.2.0
Downloading http://pypi.python.org/packages/source/T/Twisted/Twisted-12.2.0.tar.bz2#md5=9a321b904d01efd695079f8484b37861
Processing Twisted-12.2.0.tar.bz2
Running Twisted-12.2.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kw897y/Twisted-12.2.0/egg-dist-tmp-sZWFYb
In file included from /usr/include/python2.7/Python.h:8:0,
                 from twisted/internet/_sigchld.c:9:
/usr/include/python2.7/pyconfig.h:1161:0: warning: "_POSIX_C_SOURCE" redefined [enabled by default]
/usr/include/features.h:215:0: note: this is the location of the previous definition
twisted/internet/_sigchld.c: In function ‘got_signal’:
twisted/internet/_sigchld.c:15:13: warning: variable ‘ignored_result’ set but not used [-Wunused-but-set-variable]
Adding Twisted 12.2.0 to easy-install.pth file
Installing mailmail script to /usr/local/bin
Installing conch script to /usr/local/bin
Installing pyhtmlizer script to /usr/local/bin
Installing twistd script to /usr/local/bin
Installing lore script to /usr/local/bin
Installing tkconch script to /usr/local/bin
Installing tapconvert script to /usr/local/bin
Installing ckeygen script to /usr/local/bin
Installing tap2rpm script to /usr/local/bin
Installing manhole script to /usr/local/bin
Installing trial script to /usr/local/bin
Installing cftp script to /usr/local/bin
Installing tap2deb script to /usr/local/bin
 
Installed /usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg
Finished processing dependencies for Scrapy

表示安裝成功。

3、測試

scrapy shell http://ziki.cn

獲取所有a標籤

hxs.select('//a').extract()

參考資料

http://doc.scrapy.org/en/latest/intro/install.html
http://doc.scrapy.org/en/latest/intro/tutorial.html

原創文章,轉載請註明: 轉載自海波無痕