看文档学爬虫(10)——多进程爬虫

it2022-05-06 10

信息来源与多进程爬虫¹加以整理

多进程介绍

一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序可以提升爬虫效率

进程与线程的区别

线程是程序执行的最小单位，而进程是操作系统分配资源的最小单位；一个进程由一个或多个线程组成，线程是一个进程中代码的不同执行路线进程之间相互独立，但同一进程下的各个线程之间共享程序的内存空间(包括代码段，数据集，堆等)及一些进程级的资源(如打开文件和信号等)，某进程内的线程在其他进程不可见；调度和切换：线程上下文切换比进程上下文切换要快得多

多进程的使用方法

使用multiprocessing模块的Pool构建进程池，pool.map(func,iterable)方法进行使用，func为函数名，iterable为一个迭代器对象，如list等。

from multiprocessing import Pool pool = Pool(processes=4) pool.map(func,iterable)

实例

爬取https://www.qiushibaike.com/hot/page/页面的内容，包括用户名，内容，点赞数，评论数（注意：原作者的链接显示主要是视频内容，获取方式改变较大）

导入模块并设置header

import re import time from multiprocessing import Pool import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0' }

使用正则的方式查找html文件的对应内容：

def re_scraper(url): res = requests.get(url, headers=headers) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S) laughs = re.findall('<i class="number">(\d+)</i> 好笑', res.text, re.S) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): info = { 'name': name.strip(), 'content': content.strip(), 'laugh': laugh, 'comment': comment } infos.append(info) return infos

对应地，使用CSS方式查找内容代码为：

def CSS_scraper(url): res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') names = soup.select('.article.block.untagged.mb15.typs_hot > .author.clearfix > a > h2') contents = soup.select('.content > span') laughs = soup.select('.stats-vote > i') comments = soup.select('.stats-comments > a > i') infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): data = { 'name': name.get_text().strip(), 'content': content.get_text().strip(), 'laugh': laugh.get_text(), 'comment': comment.get_text() } infos.append(data) return infos

生成一个url的list就可以送入Pool.map中了，使用的是正则当时进行的，如果需要改为CSS方式，改变函数名字即可，完整代码为：

import re import time from multiprocessing import Pool import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0' } def re_scraper(url): res = requests.get(url, headers=headers) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S) laughs = re.findall('<i class="number">(\d+)</i> 好笑', res.text, re.S) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): info = { 'name': name.strip(), 'content': content.strip(), 'laugh': laugh, 'comment': comment } infos.append(info) return infos def CSS_scraper(url): res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') names = soup.select('.article.block.untagged.mb15.typs_hot > .author.clearfix > a > h2') contents = soup.select('.content > span') laughs = soup.select('.stats-vote > i') comments = soup.select('.stats-comments > a > i') infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): data = { 'name': name.get_text().strip(), 'content': content.get_text().strip(), 'laugh': laugh.get_text(), 'comment': comment.get_text() } infos.append(data) return infos if __name__ == "__main__": urls = ['https://www.qiushibaike.com/hot/page/{}/'.format(str(i)) for i in range(1, 36)] start_1 = time.time() for url in urls: re_scraper(url) end_1 = time.time() print('串行爬虫耗时:', end_1 - start_1) start_2 = time.time() pool = Pool(processes=2) pool.map(re_scraper, urls) end_2 = time.time() print('2进程爬虫耗时:', end_2 - start_2) start_3 = time.time() pool = Pool(processes=4) pool.map(re_scraper, urls) end_3 = time.time() print('4进程爬虫耗时:', end_3 - start_3) """ for url in urls: CSS_scraper(url) """

结果为：

串行爬虫耗时: 9.279427289962769 2进程爬虫耗时: 5.588379383087158 4进程爬虫耗时: 2.7787296772003174

明显较少了运行时间！

http://f61be319.wiz03.com/share/s/3S6-cp1BIQ952yXKyj02PIM40aYCtH24pQqN2jqDzA055fdG ↩︎

专利

最新回复(0)