Scrapy原始碼閱讀分析_2_啟動流程

擒賊先擒王發表於2019-02-19

 

From:https://blog.csdn.net/weixin_37947156/article/details/74436333

 

使用 PyCharm 開啟下載好的 Scrapy 原始碼(github:https://github.com/scrapy/scrapy

 

 

scrapy命令

 

當用 scrapy 寫好一個爬蟲後,使用 scrapy crawl <spider_name> 命令就可以執行這個爬蟲,那麼這個過程中到底發生了什麼? scrapy 命令 從何而來?

實際上,當你成功安裝 scrapy 後,使用如下命令,就能找到這個命令:

$ which scrapy
/usr/local/bin/scrapy

使用 vim 或其他編輯器開啟它:$ vim /usr/local/bin/scrapy

其實它就是一個 python 指令碼,而且程式碼非常少。

#!/usr/bin/python3

# -*- coding: utf-8 -*-
import re
import sys

from scrapy.cmdline import execute

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(execute())

安裝 scrapy 後,為什麼入口點是這裡呢? 原因是在 scrapy 的安裝檔案 setup.py 中,宣告瞭程式的入口處:

from os.path import dirname, join
from pkg_resources import parse_version
from setuptools import setup, find_packages, __version__ as setuptools_version


with open(join(dirname(__file__), 'scrapy/VERSION'), 'rb') as f:
    version = f.read().decode('ascii').strip()


def has_environment_marker_platform_impl_support():
    """Code extracted from 'pytest/setup.py'
    https://github.com/pytest-dev/pytest/blob/7538680c/setup.py#L31

    The first known release to support environment marker with range operators
    it is 18.5, see:
    https://setuptools.readthedocs.io/en/latest/history.html#id235
    """
    return parse_version(setuptools_version) >= parse_version('18.5')


extras_require = {}

if has_environment_marker_platform_impl_support():
    extras_require[':platform_python_implementation == "PyPy"'] = [
        'PyPyDispatcher>=2.1.0',
    ]


setup(
    name='Scrapy',
    version=version,
    url='https://scrapy.org',
    description='A high-level Web Crawling and Web Scraping framework',
    long_description=open('README.rst').read(),
    author='Scrapy developers',
    maintainer='Pablo Hoffman',
    maintainer_email='pablo@pablohoffman.com',
    license='BSD',
    packages=find_packages(exclude=('tests', 'tests.*')),
    include_package_data=True,
    zip_safe=False,
    entry_points={
        'console_scripts': ['scrapy = scrapy.cmdline:execute']
    },
    classifiers=[
        'Framework :: Scrapy',
        'Development Status :: 5 - Production/Stable',
        'Environment :: Console',
        'Intended Audience :: Developers',
        'License :: OSI Approved :: BSD License',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.4',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Programming Language :: Python :: 3.7',
        'Programming Language :: Python :: Implementation :: CPython',
        'Programming Language :: Python :: Implementation :: PyPy',
        'Topic :: Internet :: WWW/HTTP',
        'Topic :: Software Development :: Libraries :: Application Frameworks',
        'Topic :: Software Development :: Libraries :: Python Modules',
    ],
    python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*',
    install_requires=[
        'Twisted>=13.1.0',
        'w3lib>=1.17.0',
        'queuelib',
        'lxml',
        'pyOpenSSL',
        'cssselect>=0.9',
        'six>=1.5.2',
        'parsel>=1.5',
        'PyDispatcher>=2.0.5',
        'service_identity',
    ],
    extras_require=extras_require,
)

entry_points 指明瞭入口是 cmdline.pyexecute 方法,在安裝過程中,setuptools 這個包管理工具,就會把上述那一段程式碼生成放在可執行路徑下。

這裡也有必要說一下,如何用 python 編寫一個可執行檔案,其實非常簡單,只需要以下幾步即可完成:

  • 編寫一個帶有 main 方法的 python 模組(首行必須註明 python 執行路徑)
  • 去掉.py字尾名
  • 修改許可權為可執行:chmod +x 指令碼

這樣,你就可以直接使用檔名執行此指令碼了,而不用通過 python <file.py> 的方式去執行,是不是很簡單?

 

 

入口(execute.py)

 

既然現在已經知道了 scrapy 的入口是 scrapy/cmdline.py 的 execute 方法,我們來看一下這個方法。

主要的執行流程已經加好註釋,這裡我總結出了每個流程執行過程:

 

 

流程解析

 

初始化專案配置

這個流程比較簡單,主要是根據環境變數和 scrapy.cfg 初始化環境,最終生成一個 Settings 例項,來看程式碼get_project_settings 方法(from scrapy.utils.project import inside_project, get_project_settings):

def get_project_settings():
    # 環境變數中是否有SCRAPY_SETTINGS_MODULE配置
    if ENVVAR not in os.environ:
        project = os.environ.get('SCRAPY_PROJECT', 'default')

        # 初始化環境,找到使用者配置檔案settings.py,設定到環境變數SCRAPY_SETTINGS_MODULE中
        init_env(project)

    # 載入預設配置檔案default_settings.py,生成settings例項
    settings = Settings()

    # 取得使用者配置檔案
    settings_module_path = os.environ.get(ENVVAR)

    # 更新配置,使用者配置覆蓋預設配置
    if settings_module_path:
        settings.setmodule(settings_module_path, priority='project')

    # XXX: remove this hack
    # 如果環境變數中有其他scrapy相關配置則覆蓋
    pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
    if pickled_settings:
        settings.setdict(pickle.loads(pickled_settings), priority='project')

    # XXX: deprecate and remove this functionality
    env_overrides = {k[7:]: v for k, v in os.environ.items() if
                     k.startswith('SCRAPY_')}
    if env_overrides:
        settings.setdict(env_overrides, priority='project')

    return settings

這個過程中進行了 Settings 配置初始化 (from scrapy.settings import Settings)

class Settings(BaseSettings):
    """
    This object stores Scrapy settings for the configuration of internal
    components, and can be used for any further customization.

    It is a direct subclass and supports all methods of
    :class:`~scrapy.settings.BaseSettings`. Additionally, after instantiation
    of this class, the new object will have the global default settings
    described on :ref:`topics-settings-ref` already populated.
    """

    def __init__(self, values=None, priority='project'):
        # Do not pass kwarg values here. We don't want to promote user-defined
        # dicts, and we want to update, not replace, default dicts with the
        # values given by the user
        
        # 呼叫父類構造初始化
        super(Settings, self).__init__()

        # 把default_settings.py的所有配置set到settings例項中
        self.setmodule(default_settings, 'default')
        # Promote default dictionaries to BaseSettings instances for per-key
        # priorities
        # 把attributes屬性也set到settings例項中
        for name, val in six.iteritems(self):
            if isinstance(val, dict):
                self.set(name, BaseSettings(val, 'default'), 'default')
        self.update(values, priority)

程式 載入 預設配置檔案 default_settings.py 中的所有配置項設定到 Settings 中,且這個配置是有優先順序的。

這個預設配置檔案 default_settings.py 是非常重要的,個人認為還是有必要看一下里面的內容,這裡包含了所有預設的配置例如:排程器類、爬蟲中介軟體類、下載器中介軟體類、下載處理器類等等。

在這裡就能隱約發現,scrapy 的架構是非常低耦合的,所有元件都是可替換的。什麼是可替換呢?

例如:你覺得預設的排程器功能不夠用,那麼你就可以按照它定義的介面標準,自己實現一個排程器,然後在自己的配置檔案中,註冊自己寫的排程器模組,那麼 scrapy 的執行時就會用上你新寫的排程器模組了!(scrapy-redis 就是替換 scrapy 中的模組 來實現分散式

只要在預設配置檔案中配置的模組,都是可替換的。

 

檢查環境是否在專案中

 

def inside_project():
    # 檢查此環境變數是否存在(上面已設定)
    scrapy_module = os.environ.get('SCRAPY_SETTINGS_MODULE')
    if scrapy_module is not None:
        try:
            import_module(scrapy_module)
        except ImportError as exc:
            warnings.warn("Cannot import scrapy settings module %s: %s" % (scrapy_module, exc))
        else:
            return True
    # 如果環境變數沒有,就近查詢scrapy.cfg,找得到就認為是在專案環境中
    return bool(closest_scrapy_cfg())

scrapy 命令有的是依賴專案執行的,有的命令則是全域性的,不依賴專案的。這裡主要通過就近查詢 scrapy.cfg 檔案來確定是否在專案環境中。

 

獲取可用命令並組裝成名稱與例項的字典

 

def _get_commands_dict(settings, inproject):
    # 匯入commands資料夾下的所有模組,生成{cmd_name: cmd}的字典集合
    cmds = _get_commands_from_module('scrapy.commands', inproject)
    cmds.update(_get_commands_from_entry_points(inproject))

    # 如果使用者自定義配置檔案中有COMMANDS_MODULE配置,則載入自定義的命令類
    cmds_module = settings['COMMANDS_MODULE']
    if cmds_module:
        cmds.update(_get_commands_from_module(cmds_module, inproject))
    return cmds


def _get_commands_from_module(module, inproject):
    d = {}

    # 找到這個模組下所有的命令類(ScrapyCommand子類)
    for cmd in _iter_command_classes(module):
        if inproject or not cmd.requires_project:
            # 生成{cmd_name: cmd}字典
            cmdname = cmd.__module__.split('.')[-1]
            d[cmdname] = cmd()
    return d


def _iter_command_classes(module_name):
    # TODO: add `name` attribute to commands and and merge this function with
    # 迭代這個包下的所有模組,找到ScrapyCommand的子類
    # scrapy.utils.spider.iter_spider_classes
    for module in walk_modules(module_name):
        for obj in vars(module).values():
            if inspect.isclass(obj) and \
                    issubclass(obj, ScrapyCommand) and \
                    obj.__module__ == module.__name__ and \
                    not obj == ScrapyCommand:
                yield obj

這個過程主要是,匯入 commands 資料夾下的所有模組,生成 {cmd_name: cmd} 字典集合,如果使用者在配置檔案中配置了自定義的命令類,也追加進去。也就是說,自己也可以編寫自己的命令類,然後追加到配置檔案中,之後就可以使用自己自定義的命令了。

 

解析執行的命令並找到對應的命令例項

 

def _pop_command_name(argv):
    i = 0
    for arg in argv[1:]:
        if not arg.startswith('-'):
            del argv[i]
            return arg
        i += 1

這個過程就是解析命令列,例如 scrapy crawl <spider_name>,解析出 crawl,通過上面生成好的命令字典集合,就能找到commands 模組下的 crawl.py 下的 Command 的例項。

 

scrapy命令例項解析命令列引數

 

找到對應的命令例項後,呼叫 cmd.process_options 方法(例如 scrapy/commands/crawl.py):

class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return "[options] <spider>"

    def short_desc(self):
        return "Run a spider"

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
                          help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE",
                          help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT",
                          help="format to use for dumping items with -o")

    def process_options(self, args, opts):
        # 首先呼叫了父類的process_options,解析統一固定的引數
        ScrapyCommand.process_options(self, args, opts)
        try:
            opts.spargs = arglist_to_dict(opts.spargs)
        except ValueError:
            raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
        if opts.output:
            if opts.output == '-':
                self.settings.set('FEED_URI', 'stdout:', priority='cmdline')
            else:
                self.settings.set('FEED_URI', opts.output, priority='cmdline')
            feed_exporters = without_none_values(
                self.settings.getwithbase('FEED_EXPORTERS'))
            valid_output_formats = feed_exporters.keys()
            if not opts.output_format:
                opts.output_format = os.path.splitext(opts.output)[1].replace(".", "")
            if opts.output_format not in valid_output_formats:
                raise UsageError("Unrecognized output format '%s', set one"
                                 " using the '-t' switch or as a file extension"
                                 " from the supported list %s" % (opts.output_format,
                                                                  tuple(valid_output_formats)))
            self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')

    def run(self, args, opts):
        if len(args) < 1:
            raise UsageError()
        elif len(args) > 1:
            raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
        spname = args[0]

        self.crawler_process.crawl(spname, **opts.spargs)
        self.crawler_process.start()

        if self.crawler_process.bootstrap_failed:
            self.exitcode = 1

這個過程就是解析命令列其餘的引數,固定引數 解析交給 父類 處理,例如輸出位置等。其餘不同的引數由不同的命令類解析。

 

初始化CrawlerProcess

 

最後初始化 CrawlerProcess 例項,然後執行對應命令例項的 run 方法。

cmd.crawler_process = CrawlerProcess(settings)
_run_print_help(parser, _run_command, cmd, args, opts)

如果執行命令是 scrapy crawl <spider_name>,則執行的就是 commands/crawl.py 的 run看上面程式碼中 run 方法

run 方法中呼叫了 CrawlerProcess 例項的 crawl 和 start,就這樣整個爬蟲程式就會執行起來了。

先來看 CrawlerProcess 初始化:(scrapy/crawl.py)

class CrawlerProcess(CrawlerRunner):


    def __init__(self, settings=None, install_root_handler=True):
        # 呼叫父類初始化
        super(CrawlerProcess, self).__init__(settings)

        # 訊號和log初始化
        install_shutdown_handlers(self._signal_shutdown)
        configure_logging(self.settings, install_root_handler)
        log_scrapy_info(self.settings)

構造方法中呼叫了父類 CrawlerRunner 的構造:

class CrawlerRunner(object):

    def __init__(self, settings=None):
        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)
        self.settings = settings
        
        # 獲取爬蟲載入器
        self.spider_loader = _get_spider_loader(settings)
        self._crawlers = set()
        self._active = set()
        self.bootstrap_failed = False

初始化時,呼叫了  _get_spider_loader 方法:

def _get_spider_loader(settings):
    """ Get SpiderLoader instance from settings """

    # 讀取配置檔案中的SPIDER_MANAGER_CLASS配置項
    if settings.get('SPIDER_MANAGER_CLASS'):
        warnings.warn(
            'SPIDER_MANAGER_CLASS option is deprecated. '
            'Please use SPIDER_LOADER_CLASS.',
            category=ScrapyDeprecationWarning, stacklevel=2
        )
    cls_path = settings.get('SPIDER_MANAGER_CLASS',
                            settings.get('SPIDER_LOADER_CLASS'))
    loader_cls = load_object(cls_path)
    try:
        verifyClass(ISpiderLoader, loader_cls)
    except DoesNotImplement:
        warnings.warn(
            'SPIDER_LOADER_CLASS (previously named SPIDER_MANAGER_CLASS) does '
            'not fully implement scrapy.interfaces.ISpiderLoader interface. '
            'Please add all missing methods to avoid unexpected runtime errors.',
            category=ScrapyDeprecationWarning, stacklevel=2
        )
    return loader_cls.from_settings(settings.frozencopy())

預設配置檔案中的 spider_loader 配置是 spiderloader.SpiderLoader(scrapy/spiderloader.py)

@implementer(ISpiderLoader)
class SpiderLoader(object):
    """
    SpiderLoader is a class which locates and loads spiders
    in a Scrapy project.
    """
    def __init__(self, settings):
        # 配置檔案獲取存放爬蟲指令碼的路徑
        self.spider_modules = settings.getlist('SPIDER_MODULES')
        self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
        self._spiders = {}
        self._found = defaultdict(list)

        # 載入所有爬蟲
        self._load_all_spiders()

    def _check_name_duplicates(self):
        dupes = ["\n".join("  {cls} named {name!r} (in {module})".format(
                                module=mod, cls=cls, name=name)
                           for (mod, cls) in locations)
                 for name, locations in self._found.items()
                 if len(locations)>1]
        if dupes:
            msg = ("There are several spiders with the same name:\n\n"
                   "{}\n\n  This can cause unexpected behavior.".format(
                        "\n\n".join(dupes)))
            warnings.warn(msg, UserWarning)

    def _load_spiders(self, module):
        for spcls in iter_spider_classes(module):
            self._found[spcls.name].append((module.__name__, spcls.__name__))
            self._spiders[spcls.name] = spcls

    def _load_all_spiders(self):
        # 組裝成{spider_name: spider_cls}的字典
        for name in self.spider_modules:
            try:
                for module in walk_modules(name):
                    self._load_spiders(module)
            except ImportError as e:
                if self.warn_only:
                    msg = ("\n{tb}Could not load spiders from module '{modname}'. "
                           "See above traceback for details.".format(
                                modname=name, tb=traceback.format_exc()))
                    warnings.warn(msg, RuntimeWarning)
                else:
                    raise
        self._check_name_duplicates()

    @classmethod
    def from_settings(cls, settings):
        return cls(settings)

    def load(self, spider_name):
        """
        Return the Spider class for the given spider name. If the spider
        name is not found, raise a KeyError.
        """
        try:
            return self._spiders[spider_name]
        except KeyError:
            raise KeyError("Spider not found: {}".format(spider_name))

    def find_by_request(self, request):
        """
        Return the list of spider names that can handle the given request.
        """
        return [name for name, cls in self._spiders.items()
                if cls.handles_request(request)]

    def list(self):
        """
        Return a list with the names of all spiders available in the project.
        """
        return list(self._spiders.keys())

爬蟲載入器會載入所有的爬蟲指令碼,最後生成一個 {spider_name: spider_cls} 的字典。

 

執行 crawl 和 start 方法

 

CrawlerProcess 初始化完之後,呼叫 crawl 方法:

class CrawlerRunner(object):

    def __init__(self, settings=None):
        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)
        self.settings = settings
        self.spider_loader = _get_spider_loader(settings)
        self._crawlers = set()
        self._active = set()
        self.bootstrap_failed = False

    @property
    def spiders(self):
        warnings.warn("CrawlerRunner.spiders attribute is renamed to "
                      "CrawlerRunner.spider_loader.",
                      category=ScrapyDeprecationWarning, stacklevel=2)
        return self.spider_loader

    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # 建立crawler
        crawler = self.create_crawler(crawler_or_spidercls)
        return self._crawl(crawler, *args, **kwargs)

    def _crawl(self, crawler, *args, **kwargs):
        self.crawlers.add(crawler)
        # 呼叫Crawler的crawl方法
        d = crawler.crawl(*args, **kwargs)
        self._active.add(d)

        def _done(result):
            self.crawlers.discard(crawler)
            self._active.discard(d)
            self.bootstrap_failed |= not getattr(crawler, 'spider', None)
            return result

        return d.addBoth(_done)

    def create_crawler(self, crawler_or_spidercls):
        # 如果是字串,則從spider_loader中載入這個爬蟲類
        if isinstance(crawler_or_spidercls, Crawler):
            return crawler_or_spidercls
        # 否則建立Crawler
        return self._create_crawler(crawler_or_spidercls)

    def _create_crawler(self, spidercls):
        if isinstance(spidercls, six.string_types):
            spidercls = self.spider_loader.load(spidercls)
        return Crawler(spidercls, self.settings)

    def stop(self):
        """
        Stops simultaneously all the crawling jobs taking place.

        Returns a deferred that is fired when they all have ended.
        """
        return defer.DeferredList([c.stop() for c in list(self.crawlers)])

    @defer.inlineCallbacks
    def join(self):
        """
        join()

        Returns a deferred that is fired when all managed :attr:`crawlers` have
        completed their executions.
        """
        while self._active:
            yield defer.DeferredList(self._active)

這個過程會建立 Cralwer 例項,然後呼叫它的 crawl 方法:(scrapy/crawl.py 中 class Crawler )

    @defer.inlineCallbacks
    def crawl(self, *args, **kwargs):
        assert not self.crawling, "Crawling already taking place"
        self.crawling = True

        try:
            # 到現在,才是例項化一個爬蟲例項
            self.spider = self._create_spider(*args, **kwargs)

            # 建立引擎
            self.engine = self._create_engine()

            # 呼叫爬蟲類的start_requests方法
            start_requests = iter(self.spider.start_requests())

            # 執行引擎的open_spider,並傳入爬蟲例項和初始請求
            yield self.engine.open_spider(self.spider, start_requests)
            yield defer.maybeDeferred(self.engine.start)
        except Exception:
            # In Python 2 reraising an exception after yield discards
            # the original traceback (see https://bugs.python.org/issue7563),
            # so sys.exc_info() workaround is used.
            # This workaround also works in Python 3, but it is not needed,
            # and it is slower, so in Python 3 we use native `raise`.
            if six.PY2:
                exc_info = sys.exc_info()

            self.crawling = False
            if self.engine is not None:
                yield self.engine.close()

            if six.PY2:
                six.reraise(*exc_info)
            raise

最後呼叫 start 方法:

    def start(self, stop_after_crawl=True):
        """
        This method starts a Twisted `reactor`_, adjusts its pool size to
        :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache based
        on :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.

        If `stop_after_crawl` is True, the reactor will be stopped after all
        crawlers have finished, using :meth:`join`.

        :param boolean stop_after_crawl: stop or not the reactor when all
            crawlers have finished
        """
        if stop_after_crawl:
            d = self.join()
            # Don't start the reactor if the deferreds are already fired
            if d.called:
                return
            d.addBoth(self._stop_reactor)

        reactor.installResolver(self._get_dns_resolver())

        # 配置reactor的池子大小(可修改REACTOR_THREADPOOL_MAXSIZE調整)
        tp = reactor.getThreadPool()
        tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))
        reactor.addSystemEventTrigger('before', 'shutdown', self.stop)

        # 開始執行
        reactor.run(installSignalHandlers=False)  # blocking call

reactor 是個什麼東西呢?它是 Twisted 模組的 事件管理器,只要把需要執行的事件方法註冊到 reactor 中,然後呼叫它的 run 方法,它就會幫你執行註冊好的事件方法,如果遇到 網路IO 等待,它會自動幫你切換可執行的事件方法,非常高效。

大家不用在意 reactor 是如何工作的,你可以把它想象成一個執行緒池,只是採用註冊回撥的方式來執行事件。

到這裡,爬蟲的之後排程邏輯就交由引擎 ExecuteEngine 處理了。

在每次執行 scrapy 命令 時,主要經過環境、配置初始化,載入命令類 和 爬蟲模組,最終例項化執行引擎,交給引擎排程處理的流程,下篇文章會講解執行引擎是如何排程和管理各個元件工作的。

 

 

 

相關文章