目录
- 常规操作
- 接续操作
- 用户打开浏览器
- 程序接续浏览器
- 注意事项
- 实战示例
一般使用 selenium 进行数据爬取时,常用处理流程是让 selenium 从打开浏览器开始,完成全流程的所有操作。然而有时候,我们希望用户先自己打开浏览器进入指定网页,完成登录认证等一系列操作之后(比如用户、密码、短信验证码及各种难处理的图形验证码之类),再让 selenium 从登录后的页面进行接续操作爬取数据。那么怎样才能将前后操作接续起来呢?
常规操作
常规操作一般使用下面的这种方式,设置初始参数后直接使用 get 技巧去打开网页。
from selenium import webdriver class DriverClass: def __init__(self): self.driver = self._init_driver() def _init_driver(self): try: option = webdriver.ChromeOptions() option.add_experimental_option(‘excludeSwitches’, [‘enable-automation’]) option.add_experimental_option(‘useAutomationExtension’, False) prefs = dict() prefs[‘credentials_enable_service’] = False prefs[‘profile.password_manager_enable’] = False prefs[‘profile.name’] = “Person 1” option.add_experimental_option(‘prefs’, prefs) option.add_argument(‘–disable-gpu’) option.add_argument(“–disable-blink-features=AutomationControlled”) option.add_argument(‘–user-agent=”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36″‘) option.add_argument(‘–no-sandbox’) option.add_argument(‘ignore-certificate-errors’) driver = webdriver.Chrome(r”./driver/chromedriver.exe”, options=option) driver.implicitly_wait(2) driver.maximize_window() return driver except Exception as e: raise e def get_driver(self) -> webdriver.Chrome: if isinstance(self.driver, webdriver.Chrome): return self.driver raise Exception(‘初始化浏览器失败’) if __name__ == ‘__main__’: dc = DriverClass() driver = dc.get_driver() print(driver) driver.get(“https://www.baidu.com”)
接续操作
接续操作主要通过在打开浏览器时,都设置相同的接口来完成前后的衔接(不然 selenium 不知道要从哪个浏览器页面进行接续)。
用户打开浏览器
用户手动打开浏览器时,指定对应的端口(这里设置的是 9527)及数据目录(自己自定义自定一个)。
C:Program FilesGoogleChromeApplication>chrome.exe &8211;remote-debugging-port=9527 &8211;user-data-dir="E:lky_projecttmp_projecthandle_qcc_data\chrome_user_data"
执行完上面的命令以后,会打开一个新的浏览器页面。
打开浏览器后,用户可以手动输入相应页面,完成相应的用户登录认证等操作。
程序接续浏览器
selenium 通过增加下面的配置参数
option.add_experimental_option(“debuggerAddress”, “127.0.0.1:9527”)
来打开并接续处理用户已经打开的指定端口的浏览器。之后,程序就可以通过浏览器句柄去接续处理后续的任务了。
driver_class.py
from selenium import webdriver class DriverClass: def __init__(self): self.driver = self._init_driver() def _init_driver(self): try: option = webdriver.ChromeOptions() option.add_experimental_option(‘excludeSwitches’, [‘enable-automation’]) option.add_experimental_option(‘useAutomationExtension’, False) prefs = dict() prefs[‘credentials_enable_service’] = False prefs[‘profile.password_manager_enable’] = False prefs[‘profile.name’] = “Person 1” option.add_experimental_option(‘prefs’, prefs) option.add_argument(‘–disable-gpu’) option.add_argument(“–disable-blink-features=AutomationControlled”) option.add_argument(‘–user-agent=”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36″‘) option.add_argument(‘–no-sandbox’) option.add_argument(‘ignore-certificate-errors’) option.add_experimental_option(“debuggerAddress”, “127.0.0.1:9527″) driver = webdriver.Chrome(r”./driver/chromedriver.exe”, options=option) driver.implicitly_wait(2) driver.maximize_window() return driver except Exception as e: raise e def get_driver(self) -> webdriver.Chrome: if isinstance(self.driver, webdriver.Chrome): return self.driver raise Exception(‘初始化浏览器失败’) if __name__ == ‘__main__’: dc = DriverClass() driver = dc.get_driver() print(driver) 程序使用接续后的浏览器句柄 driver 完成后续操作
注意事项
注意看,我上面的接续操作函数,有一部分的参数设置是注释掉的。这是由于接续是从已经打开的浏览器接收继续进行操作,有部分的参数在用户打开浏览器的时候就已经设定了,因此不再支持通过接续的方式继续重复设置。
实战示例
比如在手动打开指定 9527 端口的浏览器后,登录企查查进入高质量搜索,接着使用程序获取具有相应资质的企业数目(操作太频繁可能触发校验或封号,请谨慎操作!),最终生成结局文件 data.json(中途可能会异常中断,可以做成下面这种利用 data.json 实现的断点续查的方式,这样,后续再次运行也只会查询未查询过的资质数据)。
driver_class.py 用上面的就可以。
main.py
import jsonimport reimport time from selenium.webdriver.common.by import Byfrom driver_class import DriverClass dc = DriverClass()driver = dc.get_driver()xpath_prefix = ‘//div/div/div/div/span[text()=”资质证书”]/following-sibling::div’ def checkbox_select(element_checkbox): “””复选框选中””” class_attribute = element_checkbox.get_attribute(“class”) if “checked” not in class_attribute: element_checkbox.find_element(By.XPATH, ‘./span[@class=”qccd-tree-checkbox-inner”]’).click() def checkbox_unselect(element_checkbox): “””复选框取消选中””” class_attribute = element_checkbox.get_attribute(“class”) if “checked” in class_attribute: element_checkbox.find_element(By.XPATH, ‘./span[@class=”qccd-tree-checkbox-inner”]’).click() def get_amount(element_checkbox): “””获取对应复选框对应的企业数目””” checkbox_select(element_checkbox) xpath_confirm = xpath_prefix + ‘/div/div/div/div/div[text()=”确定”]’ driver.find_element(By.XPATH, xpath_confirm).click() time.sleep(0.5) try: xpath_result = ‘//div/div/div[@class=”search-btn limit-svip”]’ result = str(driver.find_element(By.XPATH, xpath_result).text) except Exception as e: print(f”异常: str(e)}”) result = “0” result = result.replace(“,”, “”) match_object = re.search(“(d+)”, result) amount = match_object.group(1) print(f”数目:amount}”) 清除结局,避免点击选择项时误点击关闭 xpath_clear = ‘//div/div/a[contains(text(), “清除”)]’ try: driver.find_element(By.XPATH, xpath_clear).click() except: pass xpath_select = xpath_prefix + ‘[@class=”trigger-container”]’ driver.find_element(By.XPATH, xpath_select).click() time.sleep(0.2) checkbox_unselect(element_checkbox) return amount def extend_options(): “””展开折叠项并获取数据,只展开三层””” json.dump(data, open(“data.json”, ‘w’, encoding=”utf-8″), indent=2, ensure_ascii=False) try: data = json.load(open(“data.json”, encoding=”utf-8″)) except: data = } try: xpath_first_class = xpath_prefix + ‘//div/ul/li[@role=”treeitem”]’ xpath_first_class = xpath_prefix + ‘//div/ul/li/span[contains(@class, “qccd-tree-switcher”)]’ first_item_list = driver.find_elements(By.XPATH, xpath_first_class) for item_li in first_item_list: text_dk1 = item_li.find_element(By.XPATH, ‘./span/span/div/span[@class=”text-dk”]’).text data[text_dk1] = data.get(text_dk1, }) print(f”text_dk1}”) switcher = item_li.find_element(By.XPATH, ‘./span[contains(@class, “qccd-tree-switcher”)]’) class_attribute = switcher.get_attribute(“class”) element_checkbox = item_li.find_element(By.XPATH, ‘./span[contains(@class, “checkbox”)]’) if “close” in class_attribute: switcher.click() time.sleep(0.1) elif “noop” in class_attribute: 当前节点没有子节点 if not data[text_dk1]: amount = get_amount(element_checkbox) data[text_dk1] = amount continue 点开以后,下层级的 ul/li 会展示出来 second_item_list = item_li.find_elements(By.XPATH, “./ul/li”) for second_item_li in second_item_list: text_dk2 = second_item_li.find_element(By.XPATH, ‘./span/span/div/span[@class=”text-dk”]’).text data[text_dk1][text_dk2] = data[text_dk1].get(text_dk2, }) print(f”–text_dk2}”) switcher = second_item_li.find_element(By.XPATH, ‘./span[contains(@class, “qccd-tree-switcher”)]’) class_attribute = switcher.get_attribute(“class”) element_checkbox = second_item_li.find_element(By.XPATH, ‘./span[contains(@class, “checkbox”)]’) if “close” in class_attribute: switcher.click() time.sleep(0.1) elif “noop” in class_attribute: 当前节点没有子节点 if not data[text_dk1][text_dk2]: amount = get_amount(element_checkbox) data[text_dk1][text_dk2] = amount continue 点开以后,下层级的 ul/li 会展示出来 third_item_list = second_item_li.find_elements(By.XPATH, “./ul/li”) for third_item_li in third_item_list: text_dk3 = third_item_li.find_element(By.XPATH, ‘./span/span/div/span[@class=”text-dk”]’).text data[text_dk1][text_dk2][text_dk3] = data[text_dk1][text_dk2].get(text_dk3, }) print(f”—-text_dk3}”) switcher = third_item_li.find_element(By.XPATH, ‘./span[contains(@class, “qccd-tree-switcher”)]’) class_attribute = switcher.get_attribute(“class”) 到第三层时,不再展开,直接选择复选框 element_checkbox = third_item_li.find_element(By.XPATH, ‘./span[contains(@class, “checkbox”)]’) if not data[text_dk1][text_dk2][text_dk3]: amount = get_amount(element_checkbox) data[text_dk1][text_dk2][text_dk3] = amount except Exception as e: raise e finally: json.dump(data, open(“data.json”, ‘w’, encoding=”utf-8″), indent=2, ensure_ascii=False) def spider_data(): 尝试关闭资质证书选择框、清除所选项 xpath_close = xpath_prefix + ‘/div/div/div/a[@class=”nclose”]’ xpath_clear = ‘//div/div/a[contains(text(), “清除”)]’ try: driver.find_element(By.XPATH, xpath_close).click() except: pass try: driver.find_element(By.XPATH, xpath_clear).click() except: pass 点击资质证书选择框 xpath_select = xpath_prefix + ‘[@class=”trigger-container”]’ driver.find_element(By.XPATH, xpath_select).click() time.sleep(2) extend_options() 取消按钮 xpath_cancel = xpath_prefix + ‘/div/div/div/div/div[text()=”取消”]’ 确定按钮 xpath_confirm = xpath_prefix + ‘/div/div/div/div/div[text()=”确定”]’ driver.find_element(By.XPATH, xpath_confirm).click() if __name__ == ‘__main__’: spider_data()
最终可以得到生成的 data.json 文件如下:
“建筑业资质”: “工程设计资质证书”: “工程设计专项资质”: “26329”, “建筑工程设计事务所”: “356”, “工程设计行业资质”: “4487”, “工程设计专业资质”: “19902”, “工程设计综合资质”: “98” }, “工程勘察资质证书”: “工程勘察综合资质”: “377”, “工程勘察专业资质”: “7464”, “工程勘察劳务资质”: “3019” },… }, “食品农产品认证”: “有机产品(OGA)”: “49868”, “良好农业规范(GAP)”: “6449”, “食质量量认证(酒类)”: “151”, “绿色食品认证”: “34723”, “绿色市场认证”: “318”, “无公害农产品”: “31067”, “食品安全管理体系认证”: “72075”, “危害分析与关键控制点认证”: “51844”, “乳制品生产企业良好生产规范认证”: “445”, “乳制品生产企业危害分析与关键控制点(HACCP)体系认证”: “570”, “饲料产品”: “85” }, “其他资质”: “办学许可证”: “192010”, “代理记账许可证书”: “34588”, “会计师事务所执业证书”: “12252”, “DOC证书”: “982”, “SMC证书”: “1886”, “名特优新农产品证书”: “1818”, “招投标类综合资质”: “36317”, “区块链信息服务备案”: “2765”, “医疗机构执业许可证”: “570877”, “CCC工厂认证”: “16154”, “卫生许可证”: “3244” }}
以上就是Python selenium打开浏览器指定端口实现接续操作的详细内容,更多关于Python selenium浏览器的资料请关注风君子博客其它相关文章!
无论兄弟们可能感兴趣的文章:
- Python?Selenium怎样切换浏览器的页面
- python使用selenium操作浏览器的实现示例
- Python+Selenium实现浏览器的控制操作
- Python+selenium实现浏览器基本操作详解
- Python使用Selenium模拟浏览器自动操作功能