看文档学爬虫(10)——多进程爬虫

it2022-05-06  1

信息来源与多进程爬虫1加以整理

多进程介绍

一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序 可以提升爬虫效率

进程与线程的区别
线程是程序执行的最小单位,而进程是操作系统分配资源的最小单位;一个进程由一个或多个线程组成,线程是一个进程中代码的不同执行路线进程之间相互独立,但同一进程下的各个线程之间共享程序的内存空间(包括代码段,数据集,堆等)及一些进程级的资源(如打开文件和信 号等),某进程内的线程在其他进程不可见;调度和切换:线程上下文切换比进程上下文切换要快得多
多进程的使用方法

使用multiprocessing模块的Pool构建进程池,pool.map(func,iterable)方法进行使用,func为函数名,iterable为一个迭代器对象,如list等。

from multiprocessing import Pool pool = Pool(processes=4) pool.map(func,iterable)

实例

爬取https://www.qiushibaike.com/hot/page/页面的内容,包括用户名,内容,点赞数,评论数(注意:原作者的链接显示主要是视频内容,获取方式改变较大)

导入模块并设置header
import re import time from multiprocessing import Pool import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0' }

使用正则的方式查找html文件的对应内容:

def re_scraper(url): res = requests.get(url, headers=headers) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S) laughs = re.findall('<i class="number">(\d+)</i> 好笑', res.text, re.S) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): info = { 'name': name.strip(), 'content': content.strip(), 'laugh': laugh, 'comment': comment } infos.append(info) return infos

对应地,使用CSS方式查找内容代码为:

def CSS_scraper(url): res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') names = soup.select('.article.block.untagged.mb15.typs_hot > .author.clearfix > a > h2') contents = soup.select('.content > span') laughs = soup.select('.stats-vote > i') comments = soup.select('.stats-comments > a > i') infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): data = { 'name': name.get_text().strip(), 'content': content.get_text().strip(), 'laugh': laugh.get_text(), 'comment': comment.get_text() } infos.append(data) return infos

生成一个url的list就可以送入Pool.map中了,使用的是正则当时进行的,如果需要改为CSS方式,改变函数名字即可,完整代码为:

import re import time from multiprocessing import Pool import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0' } def re_scraper(url): res = requests.get(url, headers=headers) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S) laughs = re.findall('<i class="number">(\d+)</i> 好笑', res.text, re.S) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): info = { 'name': name.strip(), 'content': content.strip(), 'laugh': laugh, 'comment': comment } infos.append(info) return infos def CSS_scraper(url): res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') names = soup.select('.article.block.untagged.mb15.typs_hot > .author.clearfix > a > h2') contents = soup.select('.content > span') laughs = soup.select('.stats-vote > i') comments = soup.select('.stats-comments > a > i') infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): data = { 'name': name.get_text().strip(), 'content': content.get_text().strip(), 'laugh': laugh.get_text(), 'comment': comment.get_text() } infos.append(data) return infos if __name__ == "__main__": urls = ['https://www.qiushibaike.com/hot/page/{}/'.format(str(i)) for i in range(1, 36)] start_1 = time.time() for url in urls: re_scraper(url) end_1 = time.time() print('串行爬虫耗时:', end_1 - start_1) start_2 = time.time() pool = Pool(processes=2) pool.map(re_scraper, urls) end_2 = time.time() print('2进程爬虫耗时:', end_2 - start_2) start_3 = time.time() pool = Pool(processes=4) pool.map(re_scraper, urls) end_3 = time.time() print('4进程爬虫耗时:', end_3 - start_3) """ for url in urls: CSS_scraper(url) """

结果为:

串行爬虫耗时: 9.279427289962769 2进程爬虫耗时: 5.588379383087158 4进程爬虫耗时: 2.7787296772003174

明显较少了运行时间!


http://f61be319.wiz03.com/share/s/3S6-cp1BIQ952yXKyj02PIM40aYCtH24pQqN2jqDzA055fdG ↩︎


最新回复(0)