scrapy 实战操作

it2022-05-05  80

Date: 2019-07-07

Author: Sun

1. Pycharm调试scrapy代码流程

​ 由于Pycharm本身是没有自带scrapy代码包的,所以正常情况是不好调试scrapy代码的,那我们想要学习scrapy,调试scrapy时,会怎么处理呢?

​ 本节给你带来处理方法:

本节以建立爬取 http://books.toscrape.com/ 网站为例

(1)创建scrapy工程

​ scrapy startproject books_toscrape

(2) 创建爬虫

​ cd books_toscrape

​ scrapy genspider toscrape

此时会在spiders目录下产生 toscrape.py的爬虫spider

(3) 在工程目录下创建调试文件main.py

books_toscrape/main.py

内容如下:

# -*- coding: utf-8 -*- __author__ = 'sun' __date__ = '2019/07/07 下午9:04' import os, sys from scrapy.cmdline import execute sys.path.append(os.path.dirname(os.path.abspath(__file__))) #当前main.py的文件夹路径 SPIDER_NAME = "toscrape" #此名称是我们采用 scrapy genspider spider_name 指定的spider_name execute(["scrapy", "crawl", SPIDER_NAME])

(4) 配置文件settings.py中的修改

# Obey robots.txt rules ROBOTSTXT_OBEY = False

(5) 开始调试

进入main.py文件,点击右键调试,进入调试模式。

在spiders/toscrape.py文件中的parse函数中设置断点,尝试采用xpath解析此页面中的部分书籍数据。

开始进入调试模式,就可以进入scrapy了

2. 案例分析

采用scrapy分析并爬取http://books.toscrape.com/ 网站书籍信息

(1)创建项目

​ scrapy startproject BookToscrape

(2) 创建爬虫

​ 创建一个基于basic模板的爬虫

​ scrapy genspider toscrape books.toscrape.com

​ 此时会在spiders目录下产生一个爬虫文件toscrape.py

(3) 修改配置文件 settings.py

​ 修改两个选项USER_AGENT和ROBOTSTXT_OBEY,具体配置文件选项说明见day02

# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False

(4) 编写爬虫文件逻辑

​ spiders/toscrape.py

​ 内容如下:

class ToscrapeSpider(scrapy.Spider): name = 'toscrape' #spider name 爬虫名称 allowed_domains = ['books.toscrape.com'] #爬虫的作用域,爬取范围 start_urls = ['http://books.toscrape.com/'] #待爬取的初始化URL地址 def parse(self, response): ''' start_urls 被基类爬虫scrapy.Spider进行遍历后,封装成Request(url, callback=parse) 发射给sheduler ---》 downloader ---》 parse :param response: :return: ''' article_list = response.xpath('//article[@class="product_pod"]') for article in article_list: book_title = article.xpath("./h3/a/text()").extract_first() book_detail_url = article.xpath("./h3/a/@href").extract_first() if p_book_detail.match(book_detail_url) == None: book_detail = 'http://books.toscrape.com/' + 'catalogue/' + book_detail_url else: book_detail = 'http://books.toscrape.com/' + book_detail_url book_image = article.xpath("./div[@class='image_container']/a/img/@src").extract_first() if p_img_pre.match(book_image) == None: book_image = self.start_urls[0] + book_image else: book_image = book_image.split("../")[-1] book_image = self.start_urls[0] + book_image book_price = article.xpath("./div[@class='product_price']/p/text()").extract_first() book_price = p_price.findall(book_price)[0] print(f"book_title:{book_title}, book_detail:{book_detail}, book_image:{book_image}," f" book_price:{book_price}")

(5)引入上述调试文件 books_toscrape/main.py

​ 设置断点调试并运行此爬虫系统

转载于:https://www.cnblogs.com/sunBinary/p/11148341.html


最新回复(0)