爬虫框架Scrapy的学习记录,主要是对官网的Scrapy 1.6 documentation学习整合。

关于Scrapy的介绍性文字不做多述,
以及假定诸君的OS上已经安装好相应的框架。

# 创建项目

照着文档来,这边也是在Quotes to Scrape上爬取页面上的信息。
那么,首先是创建相应的Project:

scrapy startproject tutorial

这边得注意下,当前计算机上若存在多Python编译器时,比如安装了Anaconda多env情况,得记得切换到安装有Scrapy的环境下,来创建Project。

创建好的项目结构为:

tutorial/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

# 构建爬虫

tutorial/spiders文件夹下新建quotes_spider.py

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

其中,name当前Spider的标识符,具有唯一性。start_requests(),需返回一组可被Spider爬取的Request,其对象由URL和回调函数构成。parse()就是Request对象中的回调函数,用以解析每一个Request之后的Response,简言之就是处理返回来的页面元素,将数据抽取为字典或者找到新的URL来创建对应的Request从而继续爬取信息。

这边的QuotesSpider类中,start_requests()方法对两个url创建了两个被爬取的Request对象,并通过yield生成器进行迭代返回。parse()则是对Request的返回内容Response进行解析,这边就是先获得了当前页面page(1 and 2),定义html文件名,然后将response.body的内容写入两个html中,最后日志输出打印。

# 运行爬虫

接下来就是运行Spider,在tutorial project的根目录下运行如下命令,其中quotes就是我们前面所添加的Spider名:

scrapy crawl quotes

Terminal上输出部分内容如下,显示两个html已保存,检查文件夹内,确实有这两文件:

...
2019-07-14 14:30:24 [scrapy.core.engine] INFO: Spider opened
2019-07-14 14:30:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-14 14:30:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-14 14:30:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-07-14 14:30:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2019-07-14 14:30:26 [quotes] DEBUG: Saved file quotes-2.html
2019-07-14 14:30:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2019-07-14 14:30:27 [quotes] DEBUG: Saved file quotes-1.html
2019-07-14 14:30:27 [scrapy.core.engine] INFO: Closing spider (finished)
......
2019-07-14 14:30:27 [scrapy.core.engine] INFO: Spider closed (finished)

# 简写start_requests

start_requests方法的一种简写方式如下所示:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

start_urls类变量来定义URL列表,它将会被Spider中关于start_requests()的默认实现调用以创建对应的Requests。parse()方法由于是Scrapy的默认回调方法,所以可在无明确指定回调函数时自动去处理各个URL对应的request。

# 提取数据

Scrapy提供了命令行的方式对数据提取进行调试,以scrapy shell命令进入。 这边可以先来看下

scrapy shell "http://quotes.toscrape.com/page/1/"

输出:

[··· log here ···]
2019-07-14 17:14:18 [scrapy.core.engine] INFO: Spider opened
2019-07-14 17:14:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-07-14 17:14:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x00000212DD3EAC50>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x00000212DFB00898>
[s]   spider     <DefaultSpider 'default' at 0x212dfe0b898>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

CSS提取器

In [1]: response.css('title')
Out[1]: [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
In [2]: response.css('title::text').getall()
Out[2]: ['Quotes to Scrape']
In [3]: response.css('title').getall()
Out[3]: ['<title>Quotes to Scrape</title>']
In [4]: response.css('title::text').get()
Out[4]: 'Quotes to Scrape'
In [5]: response.css('title::text')[0].get()
Out[5]: 'Quotes to Scrape'
In [6]: response.css('title::text').re(r'Quotes.*')
Out[6]: ['Quotes to Scrape']
In [7]: response.css('title::text').re(r'Q\w+')
Out[7]: ['Quotes']
In [8]: response.css('title::text').re(r'(\w+) to (\w+)')
Out[8]: ['Quotes', 'Scrape']

XPath

In [9]: response.xpath('//title')
Out[9]: [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
In [10]: response.xpath('//title/text()').get()
Out[10]: 'Quotes to Scrape'

上面展示了关于数据提取的几个样例,下面开始对我们所需要爬取的页面进行数据提取。
原始页面上的HTML元素样例如下:

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > 
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>

调试

scrapy shell "http://quotes.toscrape.com"
In [1]: response.css('div.quote')
Out[1]:
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>]
In [2]: quote = response.css("div.quote")[0]
In [3]: title = quote.css("span.text::text").get()
In [4]: title
Out[4]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [7]: author = quote.css("small.author::text").get()
In [8]: author
Out[8]: 'Albert Einstein'
In [9]: tags = quote.css("div.tags a.tag::text").getall()
In [10]: tags
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
In [11]: for quote in response.css("div.quote"):
    ...:     text = quote.css("span.text::text").get()
    ...:     author = quote.css("small.author::text").get()
    ...:     tags = quote.css("div.tags a.tag::text").getall()
    ...:     print(dict(text=text, author=author, tags=tags))
    ...:
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}

# Spider中提取数据

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

输出,这边只列了一部分:

2019-07-14 18:44:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2019-07-14 18:44:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}

# 存储数据

# json

一种方式是保存到json文件上,下面的命令会生成一json文件,该文件中包含了所有爬取到的数据:

scrapy crawl quotes -o quotes.json

但由于特殊原因,Scrapy是往json文件中追加爬取到的数据信息,而不是去覆盖更新,所以若是执行上述命令两次,则会得到一个损坏的json文件。

# json lines

另一种方式是保存到json lines格式文件上。json lines是另一种json格式的定义,其将数据存储为多个对象,通过逐行读取方式来减轻内存占用压力。

scrapy crawl quotes -o quotes.jl

jl文件的部分内容为:

······
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]}
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": ["life", "love"]}
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]}
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]}
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}

对于小型项目来说,使用json lines的方式来保存数据已经足够;但是,若是面临较大规模的项目时,这就需要Item Pipeline了(这在创建项目时已自动生成),其在后续将会学习到。

# 提取下一页链接

大多数网站我们最开始都是从一级页面获得相关信息,这其中就包括了跳转到二级页面的链接,然后到二级页面继续获取所需信息。 或者是信息数据存在于多页的情况,需要访问比如几十页的网页来获得所有的数据。
在这儿,依然以http://quotes.toscrape.com/为例,来走一走爬取的流程。

首先得确定自己需要获取的链接位置,就如下边页面元素所示,我们确定了href对应的值是我们所需要的。

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

然后在shell上进行调试

In [1]: response.css('li.next a').get()
Out[1]: '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
In [2]: response.css('li.next a::attr(href)').get()
Out[2]: '/page/2/'

最后在项目中实施

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

parse方法的前半部分就是获取当前response中的数据,然后获得下一页的相对路径并保存到next_page变量中,之后通过response.urljoin(next_page)方法得到绝对路径(拼接);最后再通过该绝对路径URL去生成一个Request对象并返回加入爬虫队列中。

一种创建Requests的简写方式是使用response.follow,其可以直接使用相对路径。具体如下所示:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)