爬取免费代理IP - outovoの奇妙站点

介绍

每次爬网站的时候总是被一些网站的反爬机制给封IP，所以就需要一些代理IP，但是很多代理IP都要钱，不要钱的很多不能用，所以就写了这么个代码来爬取代理IP

思路

确定爬取的url路径，headers参数
发送请求 – requests 模拟浏览器发送请求，获取响应数据
解析数据 – parsel 转化为Selector对象，Selector对象具有xpath的方法，能够对转化的数据进行处理
保存数据

准备

PYthon3.7
pycharm （其他的编辑器也可以）
模块 requests parsel time(安装模块指令pip install requests && pip install parsel)

目标网站

https://www.kuaidaili.com/free/fps/

步骤

第一步

导入模块，确定爬取的url路径，headers参数

import requests
import parsel
import time

base_url = 'https://www.kuaidaili.com/free/inha/1/'
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}

第二步

发送请求 – requests 模拟浏览器发送请求，获取响应数据

response = requests.get(base_url, headers=headers)
data = response.text

第三步

解析数据 – parsel 转化为Selector对象，Selector对象具有xpath的方法，能够对转化的数据进行处理

html_data = parsel.Selector(data)
parse_list = html_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')  # 返回Selector对象

第四步

循环遍历，二次提取，构建代理ip字典

	for tr in parse_list:
		proxies_dict = {}
		http_type = tr.xpath('./td[4]/text()').extract_first()
		ip_num = tr.xpath('./td[1]/text()').extract_first()
		port_num = tr.xpath('./td[2]/text()').extract_first()
		proxies_dict[http_type] = ip_num + ':' + port_num
		print(proxies_dict)
		proxies_list.append(proxies_dict)
		time.sleep(0.5)
print(proxies_list)
print("获取到的代理ip数量：", len(proxies_list), '个')

第五步

检测代理ip可用性，用获取到的IP访问百度或者其他网站，就可以检测其可用性

def check_ip(proxies_list):
    """检测ip的方法"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
 
    can_use = []
    for proxy in proxies_list:
        try:
            response = requests.get('http://www.baidu.com', headers=headers, proxies=proxy, timeout=0.1)  # 超时报错
            if response.status_code == 200:
                can_use.append(proxy)
        except Exception as error:
            print(error)
        finally:
            print("当前ip：", proxy, '检测完成')
 
    return can_use

打印出数据

can_use = check_ip(proxies_list)
print("能用的代理：", can_use)
print("能用的代理数量：", len(can_use))

完整代码

import requests
import parsel
import time
 
 
def check_ip(proxies_list):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
 
    can_use = []
    for proxy in proxies_list:
        try:
            response = requests.get('http://www.baidu.com', headers=headers, proxies=proxy, timeout=0.1)  # 超时报错
            if response.status_code == 200:
                can_use.append(proxy)
        except Exception as error:
            print(error)
        finally:
            print("当前ip：", proxy, '检测完成')
 
    return can_use
 
 
proxies_list = []
for page in range(1, 10): #更换数字，选择爬取页数
    print('++++++++++++++++++++++++++++正在爬取第{}页数据+++++++++++++++++++++++++++++'.format(page))
    base_url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}

    response = requests.get(base_url, headers=headers)
    data = response.text
    html_data = parsel.Selector(data)
    parse_list = html_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')  # 返回Selector对象
    for tr in parse_list:
        proxies_dict = {}
        http_type = tr.xpath('./td[4]/text()').extract_first()
        ip_num = tr.xpath('./td[1]/text()').extract_first()
        port_num = tr.xpath('./td[2]/text()').extract_first()
        proxies_dict[http_type] = ip_num + ':' + port_num
        print(proxies_dict)
        proxies_list.append(proxies_dict)
        time.sleep(0.5)
 
print(proxies_list)
print("获取到的代理ip数量：", len(proxies_list), '个')
can_use = check_ip(proxies_list)
print("能用的代理：", can_use)
print("能用的代理数量：", len(can_use))

食用方法

就拿我们经常使用的 requests 库来说

使用代理 ip 方法如下

定义代理IP

proxies = {
'http' : 'http://xx.xxx.xxx.xxx:xxxx',
'http' : 'http://xxx.xx.xx.xxx:xxx',
....
}

使用代理

response = requests.get(url,proxies=proxies)

和请求头放一起

接下来就可以创建一个属于自己的IP池了

Nangua

1 月前
2024-6-26 3:02:52

这篇博客很实用，它教我们怎样用Python程序找和检查免费的代理IP，这样我们就可以访问那些不让我们进入的网站。对于做数据工作和编程的人来说，这很有帮助，因为它不仅告诉我们怎么做，还展示了怎么用这些代理IP。
但是，用免费的代理IP并不总是最好的办法。这些免费的代理质量不一，有时候用起来不太稳定。在需要很多数据的时候，它们可能不够好用。虽然博客里的方法一时可以解决问题，但对于那些需要稳定和持续收集数据的公司来说，可能不够理想。
这时候，考虑用像123Proxy.cn这样的专业服务可能更合适。这家公司提供很多稳定的代理IP，覆盖180多个国家，可以帮助做大数据分析和跨国电商。用这种服务，公司可以得到更稳定、更可靠的数据。

这篇博客很实用，它教我们怎样用Python程序找和检查免费的代理IP，这样我们就可以访问那些不让我们进入的网站。对于做数据工作和编程的人来说，这很有帮助，因为它不仅告诉我们怎么做，还展示了怎么用这些代理IP。但是，用免费的代理IP并不总是最好的办法。这些免费的代理质量不一，有时候用起来不太稳定。在需要很多数据的时候，它们可能不够好用。虽然博客里的方法一时可以解决问题，但对于那些需要稳定和持续收集数据的公司来说，可能不够理想。这时候，考虑用像[123Proxy.cn](https://www.123proxy.cn)这样的专业服务可能更合适。这家公司提供很多稳定的代理IP，覆盖180多个国家，可以帮助做大数据分析和跨国电商。用这种服务，公司可以得到更稳定、更可靠的数据。

介绍

思路

准备

目标网站

步骤

第一步

第二步

第三步

第四步

第五步

完整代码

食用方法

评论

发送评论编辑评论

介绍

思路

准备

目标网站

步骤

第一步

第二步

第三步

第四步

第五步

完整代码

食用方法

评论

发送评论 编辑评论

发送评论编辑评论