壹影博客.
我在下午4点钟开始想你
python高性能爬虫框架playwright使用教程
  • 2024-3-24日
  • 0评论
  • 755围观

python高性能爬虫框架playwright使用教程

Playwright 是微软在 2020 年初开源的新一代自动化测试工具,它的功能类似于 Selenium、Pyppeteer 等,都可以驱动浏览器进行各种自动化操作。它的功能也非常强大,对市面上的主流浏览器都提供了支持,API 功能简洁又强大。虽然诞生比较晚,但是现在发展得非常火热。

官网:点击跳转

github:点我跳转

当然他也可以用于爬虫!我们来看看他的优点

★性能相对较好-相对同类产品性能较优

★无需手动安装浏览器驱动 - 不需要手动去各大浏览器官网根据浏览器版本下载对应的驱动

★操作极为简单-引入依赖后几句话就能运行demo,极大降低了开发难度

★支持浏览器覆盖面较广-Playwright 支持当前所有主流浏览器,包括 Chrome 和 Edge(基于 Chromium)、Firefox、Safari(基于 WebKit) ,提供完善的自动化控制的 API。

★支持移动端页面测试-Playwright 支持移动端页面测试,使用设备模拟技术可以使我们在移动 Web 浏览器中测试响应式 Web 应用程序。

一、安装

环境:Python 3.7 版本及以上

安装 只需要 如下代码等待安装完成即可

pip3 install playwright
playwright install

二、基本使用

Playwright有两种启动方式 同步和异步。以下为同步:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)          # 启动 chromium 浏览器
    page = browser.new_page()              # 打开一个标签页
    page.goto("https://www.baidu.com")     # 打开百度地址
    print(page.title())                    # 打印当前页面title
    browser.close()                        # 关闭浏览器对象

异步:

from playwright.sync_api import sync_playwright
playwright = sync_playwright().start()

browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://www.baidu.com/")

browser.close()
playwright.stop()

 

三、执行JavaScript

 使用evaluate方法执行js

from playwright.sync_api import sync_playwright
with sync_playwright() as playwright:
  browser = playwright.chromium.launch(headless=False)
  page = browser.new_page()

  page.goto("https://bk.yyge.net")
  # 执行Js
  page.evaluate("console.log('hello playwright')")

四、监听事件

 监听页面加载的资源根据资源选择放行还是拦截

 使用route 方法来设置

from playwright.sync_api import sync_playwright

def intercept_request(route, request):
  if request.url.startswith("https://www.baidu.com"):
    route.abort()  # 中止请求
  else:
    route.continue_() # 通过请求

with sync_playwright() as playwright:
  browser = playwright.chromium.launch(headless=False)
  page = browser.new_page()

  # 监听请求并拦截
  page.route("**/*", lambda route, request: intercept_request(route, request))
  page.goto("https://bk.yyge.net")

五、Js与Playwright动态交互

2112

from playwright.sync_api import sync_playwright

def click_fun(info):
    print(info)

with sync_playwright() as playwright:
  browser = playwright.chromium.launch(headless=False)
  page = browser.new_page()

  
  page.goto("https://bk.yyge.net")
   
  # 执行Js 通过Js监听页面元素被点击
  # 元素被点击后执行alert 弹出对话框
  page.evaluate('window.addEventListener("click",(e)=>{alert(e.target.innerHTML)})')

  # 监听对话框弹出
  page.on("dialog",lambda info:click_fun(info))

 

相关链接

点我跳转

点击跳转

性能对比:点击跳转

简单使用:点击跳转

rote使用:点击跳转

简单的DEMO

import time

from playwright.sync_api import sync_playwright

from playwright.sync_api import sync_playwright
def click_fun(info):
    print(info)
def intercept_request(route, request):
  # print(route,request)
  if request.url.startswith("https://www.baidu.com"):
    # print(f"Intercepted request to: {request.url}")
    # route.continue()  # 中止请求
    route.continue_()
  else:
      route.continue_()

with sync_playwright() as playwright:
  browser = playwright.chromium.launch(headless=False)
  page = browser.new_page()

  # 监听请求并拦截
  page.route("**/*", lambda route, request: intercept_request(route, request))
  page.goto("https://bk.yyge.net")


  page.evaluate('window.addEventListener("click",(e)=>{alert(e.target.innerHTML)})')

  page.on("dialog",lambda info:click_fun(info))
  # time.sleep(5)
  result = page.evaluate('alert("你好王大锤")')

  page.wait_for_timeout(5000) # 强制等待

 

发表评论