信息来源与多进程爬虫1加以整理
一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序 可以提升爬虫效率
使用multiprocessing模块的Pool构建进程池,pool.map(func,iterable)方法进行使用,func为函数名,iterable为一个迭代器对象,如list等。
from multiprocessing import Pool pool = Pool(processes=4) pool.map(func,iterable)爬取https://www.qiushibaike.com/hot/page/页面的内容,包括用户名,内容,点赞数,评论数(注意:原作者的链接显示主要是视频内容,获取方式改变较大)
使用正则的方式查找html文件的对应内容:
def re_scraper(url): res = requests.get(url, headers=headers) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S) laughs = re.findall('<i class="number">(\d+)</i> 好笑', res.text, re.S) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): info = { 'name': name.strip(), 'content': content.strip(), 'laugh': laugh, 'comment': comment } infos.append(info) return infos对应地,使用CSS方式查找内容代码为:
def CSS_scraper(url): res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') names = soup.select('.article.block.untagged.mb15.typs_hot > .author.clearfix > a > h2') contents = soup.select('.content > span') laughs = soup.select('.stats-vote > i') comments = soup.select('.stats-comments > a > i') infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): data = { 'name': name.get_text().strip(), 'content': content.get_text().strip(), 'laugh': laugh.get_text(), 'comment': comment.get_text() } infos.append(data) return infos生成一个url的list就可以送入Pool.map中了,使用的是正则当时进行的,如果需要改为CSS方式,改变函数名字即可,完整代码为:
import re import time from multiprocessing import Pool import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0' } def re_scraper(url): res = requests.get(url, headers=headers) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S) laughs = re.findall('<i class="number">(\d+)</i> 好笑', res.text, re.S) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): info = { 'name': name.strip(), 'content': content.strip(), 'laugh': laugh, 'comment': comment } infos.append(info) return infos def CSS_scraper(url): res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') names = soup.select('.article.block.untagged.mb15.typs_hot > .author.clearfix > a > h2') contents = soup.select('.content > span') laughs = soup.select('.stats-vote > i') comments = soup.select('.stats-comments > a > i') infos = list() for name, content, laugh, comment in zip(names, contents, laughs, comments): data = { 'name': name.get_text().strip(), 'content': content.get_text().strip(), 'laugh': laugh.get_text(), 'comment': comment.get_text() } infos.append(data) return infos if __name__ == "__main__": urls = ['https://www.qiushibaike.com/hot/page/{}/'.format(str(i)) for i in range(1, 36)] start_1 = time.time() for url in urls: re_scraper(url) end_1 = time.time() print('串行爬虫耗时:', end_1 - start_1) start_2 = time.time() pool = Pool(processes=2) pool.map(re_scraper, urls) end_2 = time.time() print('2进程爬虫耗时:', end_2 - start_2) start_3 = time.time() pool = Pool(processes=4) pool.map(re_scraper, urls) end_3 = time.time() print('4进程爬虫耗时:', end_3 - start_3) """ for url in urls: CSS_scraper(url) """结果为:
串行爬虫耗时: 9.279427289962769 2进程爬虫耗时: 5.588379383087158 4进程爬虫耗时: 2.7787296772003174明显较少了运行时间!
http://f61be319.wiz03.com/share/s/3S6-cp1BIQ952yXKyj02PIM40aYCtH24pQqN2jqDzA055fdG ↩︎