Python爬虫框架Scrapy实例(爬取腾讯社招信息并保存为excel)

it2022-05-05 217

前言：

在学习python爬虫的时候，曾经爬取过腾讯社招的网站，很久很久没有写爬虫，心血来潮打算爬一个练手，想起之前爬过腾讯社招网站，打开一看网页变了，行动，重新写一遍。这个网站相对简单，做了简单测试没有设置反爬，比较适合初学者拿来练手。

搜索页面：点击列表中的某个职位后，会跳转到下面页面，我们需要爬取跳转后的页面的数据。爬取结果我们最终将数据写入到excel中。详细步骤直接上代码：开发工具用的PyCharm，自行添加相关依赖。一、创建scrapy项目。 scrapy startproject Tencent 在Terminal中运行上面命令，scrapy会自动创建项目，目录结构如下(忽略红线内的两个文件，那是手动创建的)：二、在spiders文件夹下创建careerDesc.py 的爬虫文件，代码如下:

# -*- coding: utf-8 -*- import scrapy import json import collections from ..items import TencentItem class CareerdescSpider(scrapy.Spider,): #爬虫名称 name = 'careerDesc' allowed_domains = ['careers.tencent.com'] offset = 1 #通过浏览器开发者工具定位出json文件的链接，以下链接是job list的链接，我们需要的到访问职位详情链接里面的postid，这里我们设置每次读取100 条信息 url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageSize=100&pageIndex=' #爬取postid 的url start_urls = [url + str(offset)] #设置此爬虫的管道文件，这个属于个人习惯，如果scrapy中只有一个爬虫文件不需要设置，如果有多个爬虫文件，需要设置一下。 custom_settings = {'ITEM_PIPELINES': {'Tencent.pipelines.TencentPipeline': 300}} #爬取每一条职位详情的url url_careerDESC = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?postId=' def parse(self, response): ''' 我们需要在此爬虫方法中获取到每一条职位特有的postid，并callback一下详情页的爬虫方法 ''' #通过json文件获取到所有职位的数量，方便进行页数判断 countNum = json.loads(response.text)['Data']['Count'] #通过json文件获取到postid postidjson = json.loads(response.text)['Data']['Posts'] for each in postidjson: #循环获取到的postid 并且组合成url 取调用详情页的爬虫desc_parse yield scrapy.Request(self.url_careerDESC + str(each['PostId']), callback=self.desc_Parse) #页数判断 if countNum % 100 == 0: page = countNum / 100 if self.offset < page: self.offset += 1 else: page = countNum / 100 + 1 if self.offset < page: self.offset += 1 yield scrapy.Request(self.url + str(self.offset), callback=self.parse) def desc_Parse(self, response): ''' 爬取详情页的信息并返回item。 ''' descjson = json.loads(response.text)['Data'] #因为item是一个dict，dict是无序的，输出的时候也是无序的，但我们需要按照我们制定的顺序输出， #所以将item转化成orderdict，这样会按照我们设定的顺序输出，但是orderdict占用的内存是普通dict的两倍，暂时没有想到更好的解决方法 item = collections.OrderedDict(TencentItem()) item['ARecruitPostName'] = descjson['RecruitPostName'] item['BLocationName'] = descjson['LocationName'] item['CategoryName'] = descjson['CategoryName'] item['DResponsibility'] = descjson['Responsibility'] item['ERequirement'] = descjson['Requirement'] item['FLastUpdateTime'] = descjson['LastUpdateTime'] item['GPostURL'] = descjson['PostURL'] yield item

三、编写pipeline文件

import sys reload(sys) sys.setdefaultencoding('utf-8') import xlrd from xlutils.copy import copy class TencentPipeline(object): def process_item(self, item, spider): workbook = xlrd.open_workbook('tencentposition.xls') sheets = workbook.sheet_names() worksheet = workbook.sheet_by_name(sheets[0]) rows_count = worksheet.nrows new_workbook = copy(workbook) new_worksheet = new_workbook.get_sheet(0) cols = 0 for v in item.values(): new_worksheet.write(rows_count,cols,v) cols += 1 new_workbook.save('tencentposition.xls') return item

四、编写setting文件

ITEM_PIPELINES = { 'Tencent.pipelines.TencentPipeline': 300, }

五、为了方便，我单独编写了一个文件创建Excel，方便修改。

#!/usr/bin/env python # -*- coding: utf-8 -*- #------------------------------------------------------------------------------- # Name: createExcel #------------------------------------------------------------------------------- import xlwt import sys reload(sys) sys.setdefaultencoding('utf-8') def CreateExcel(): file = xlwt.Workbook(encoding='utf-8') table = file.add_sheet('TencentPosition',cell_overwrite_ok=True) table_head = ['职位名称','工作地点','职位类型','岗位职责','工作要求','LastUpdate','PostURL'] for i in range(len(table_head)): table.write(0,i,table_head[i]) file.save('tencentposition.xls') print 'created successful' if __name__ == '__main__': CreateExcel()

六、设置一个程序按照先创建Excel再执行爬虫的顺序执行。

#!/usr/bin/env python # -*- coding: utf-8 -*- #------------------------------------------------------------------------------- # Name: entrypoint #------------------------------------------------------------------------------- import os import time os.system('python createExcel.py') time.sleep(5) os.system('scrapy crawl careerDesc')

我们直接运行entryPoint.py 即可运行程序至此，一个简单的scrapy爬虫就完成了。才疏学浅，共同交流。

专利

最新回复(0)