感谢一下eqblog大佬，我也学会写爬虫了，哈哈哈

流量之神 发表于 2018-6-3 23:53:59

本帖最后由流量之神于 2018-6-4 01:08 编辑

lol.gif

水平还是很菜的，哈哈，不过爬虫跑起来了挺有意思，找了一个简单的网站练手http://www.mdyuepai.com/

[*]import re
[*]import requests
[*]import json
[*]from multiprocessing import Pool
[*]from requests.exceptions import RequestException
[*]import os
[*]from hashlib import md5
[*]from multiprocessing import Pool
[*]
[*]def get_page_index(offset):
[*] headers = {
[*]    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
[*] }
[*] url = 'http://www.mdyuepai.com/?page=' + str(offset)
[*] try:
[*]    response = requests.get(url, headers = headers)
[*]    if response.status_code == 200:
[*]          return response.text
[*]    return None
[*] except RequestException:
[*]    print('请求索引页出错')
[*]    return None
[*]
[*]def parse_index_page(html):
[*] pattern = re.compile('', re.S)
[*] items = re.findall(pattern, html)
[*] for item in items:
[*]    yield item
[*]
[*]def get_page_detail(url):
[*] try:
[*]    response = requests.get(url)
[*]    if response.status_code == 200:
[*]          return response.text
[*]    return None
[*] except ConnectionError:
[*]    print('Error occurred')
[*]    return None
[*]
[*]
[*]def save_image(content):
[*] file_path = '{0}/{1}.{2}'.format('/home/', md5(content).hexdigest(), 'jpg')
[*] print(file_path)
[*] if not os.path.exists(file_path):
[*]    with open(file_path, 'wb') as f:
[*]          f.write(content)
[*]          f.close()
[*]
[*]
[*]def download_image(url):
[*] print('Downloading', url)
[*] try:
[*]    response = requests.get(url)
[*]    if response.status_code == 200:
[*]          save_image(response.content)
[*]    return None
[*] except ConnectionError:
[*]    return None
[*]
[*]
[*]
[*]def parse_page_detail(html):
[*] pattern = re.compile('class="item_infor_img.*?src="(.*?)".*?', re.S)
[*] images = re.findall(pattern, html)
[*] for image in images:
[*]    download_image(image)
[*]
[*]def main():
[*]
[*] for offset in range(20):
[*]    html = get_page_index(offset)
[*]    for item in parse_index_page(html):
[*]          url = 'http://www.mdyuepai.com/'+ item
[*]          html2 = get_page_detail(url)
[*]          images = parse_page_detail(html2)
[*]
[*]
[*]if __name__ == '__main__':
[*] pool = Pool()
[*] pool.map(main())复制代码

Mishaelre 发表于 2018-6-3 23:54:49

兔子大佬？

a1438861827 发表于 2018-6-3 23:55:38

膜拜
smile.gif

cdseoo 发表于 2018-6-3 23:57:25

被爬的，那个站页面打开有点慢~

yugan300 发表于 2018-6-4 00:06:22

Mishaelre 发表于 2018-6-3 23:54

兔子大佬？
什么兔子，他外号叫兔子吗

流量之神 发表于 2018-6-4 00:09:03

"
花名胖兔子，因为他那个头像呀~

TozFly 发表于 2018-6-3 23:54:00

在爬虫中多进程作用很小，要改成多线程

wenguonideshou 发表于 2018-6-4 00:22:29

8楼说的对
lol.gif

fei2018 发表于 2018-6-4 00:09:00

刚打开你的帖子我发现有只虫子在我屏幕上爬
009.gif
你这个，不是线程池啊

suantong 发表于 2018-6-5 01:12:00

大佬怎么学的？还有约拍让我想到了小鸟酱

页: [1] 2

全球主机交流论坛's Archiver

感谢一下eqblog大佬，我也学会写爬虫了，哈哈哈