感谢一下eqblog大佬,我也学会写爬虫了,哈哈哈
本帖最后由 流量之神 于 2018-6-4 01:08 编辑lol.gif
水平还是很菜的,哈哈,不过爬虫跑起来了挺有意思,找了一个简单的网站练手http://www.mdyuepai.com/
[*]import re
[*]import requests
[*]import json
[*]from multiprocessing import Pool
[*]from requests.exceptions import RequestException
[*]import os
[*]from hashlib import md5
[*]from multiprocessing import Pool
[*]
[*]def get_page_index(offset):
[*] headers = {
[*] 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
[*] }
[*] url = 'http://www.mdyuepai.com/?page=' + str(offset)
[*] try:
[*] response = requests.get(url, headers = headers)
[*] if response.status_code == 200:
[*] return response.text
[*] return None
[*] except RequestException:
[*] print('请求索引页出错')
[*] return None
[*]
[*]def parse_index_page(html):
[*] pattern = re.compile('', re.S)
[*] items = re.findall(pattern, html)
[*] for item in items:
[*] yield item
[*]
[*]def get_page_detail(url):
[*] try:
[*] response = requests.get(url)
[*] if response.status_code == 200:
[*] return response.text
[*] return None
[*] except ConnectionError:
[*] print('Error occurred')
[*] return None
[*]
[*]
[*]def save_image(content):
[*] file_path = '{0}/{1}.{2}'.format('/home/', md5(content).hexdigest(), 'jpg')
[*] print(file_path)
[*] if not os.path.exists(file_path):
[*] with open(file_path, 'wb') as f:
[*] f.write(content)
[*] f.close()
[*]
[*]
[*]def download_image(url):
[*] print('Downloading', url)
[*] try:
[*] response = requests.get(url)
[*] if response.status_code == 200:
[*] save_image(response.content)
[*] return None
[*] except ConnectionError:
[*] return None
[*]
[*]
[*]
[*]def parse_page_detail(html):
[*] pattern = re.compile('class="item_infor_img.*?src="(.*?)".*?', re.S)
[*] images = re.findall(pattern, html)
[*] for image in images:
[*] download_image(image)
[*]
[*]def main():
[*]
[*] for offset in range(20):
[*] html = get_page_index(offset)
[*] for item in parse_index_page(html):
[*] url = 'http://www.mdyuepai.com/'+ item
[*] html2 = get_page_detail(url)
[*] images = parse_page_detail(html2)
[*]
[*]
[*]if __name__ == '__main__':
[*] pool = Pool()
[*] pool.map(main())复制代码 兔子大佬? 膜拜
smile.gif
被爬的,那个站页面打开有点慢~
Mishaelre 发表于 2018-6-3 23:54
兔子大佬?
什么兔子,他外号叫兔子吗 "
花名胖兔子,因为他那个头像呀~ 在爬虫中多进程作用很小,要改成多线程 8楼说的对
lol.gif
刚打开你的帖子 我发现有只虫子在我屏幕上爬
009.gif
你这个,不是线程池啊 大佬 怎么学的? 还有约拍 让我想到了小鸟酱
页:
[1]
2