scrapy爬虫尝鲜
scrapy现在已经完美支持python3+,所以后面的实例我都会使用python3+的环境。首先我们来尝下鲜,下面的代码是scrapy官方文档中的一段演示代码,就这么几行代码就完成了对http://quotes.toscrape.com/tag/humor/ 的爬取解析存储,可以一窥scrapy的强大。
#quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes”
start_urls = [
‘http://quotes.toscrape.com/tag/humor/’,
]
def parse(self, response):
for quote in response.css(‘div.quote’):
yield {
‘text’: quote.css(‘span.text::text’).extract_first(),
‘author’: quote.xpath(‘span/small/text()’).extract_first(),
}
next_page = response.css(‘li.next a::attr(“href”)’).extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
运行scrapy runspider quotes_spider.py -o quotes.json
运行后的数据存储在quotes.json文件中
[
{“text”: “u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d”, “author”: “Jane Austen”},
{“text”: “u201cA day without sunshine is like, you know, night.u201d”, “author”: “Steve Martin”},
{“text”: “u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.u201d”, “author”: “Garrison Keillor”},
{“text”: “u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.u201d”, “author”: “Jim Henson”},
{“text”: “u201cAll you need is love. But a little chocolate now and then doesn‘t hurt.u201d”, “author”: “Charles M. Schulz”},
{“text”: “u201cRemember, we’re madly in love, so it‘s all right to kiss me anytime you feel like it.u201d”, “author”: “Suzanne Collins”},
{“text”: “u201cSome people never go crazy. What truly horrible lives they must lead.u201d”, “author”: “Charles Bukowski”},
{“text”: “u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.u201d”, “author”: “Terry Pratchett”},
{“text”: “u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!u201d”, “author”: “Dr. Seuss”},
{“text”: “u201cThe reason I talk to myself is because Iu2019m the only one whose answers I accept.u201d”, “author”: “George Carlin”},
{“text”: “u201cI am free of all prejudice. I hate everyone equally. u201d”, “author”: “W.C. Fields”},
{“text”: “u201cA lady’s imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.u201d”, “author”: “Jane Austen”}
]