Playwright as a Business Scraping Tool: Beyond E2E Testing
The overwhelming majority of Playwright articles discuss end-to-end testing. That is its most visible use case, but far from its only one. When a business portal exposes no API — or one that is incomplete, poorly documented, or restricted to select partners — Playwright becomes a first-class automation tool. Here is how to use it seriously, outside a testing context.
The Concrete Problem: A Portal With No Usable API
Some business portals offer a rich web interface but a limited or absent API. Data extraction, exports, form submission — everything goes through the browser. BeautifulSoup and requests stop there: they cannot handle JavaScript, single-page applications, or complex authentication flows involving MFA or OAuth2 redirections.
Playwright handles all of this natively.
Scraper Architecture
The goal is a scraper that authenticates reliably, navigates and extracts structured data, can be restarted without human intervention, and runs in a containerised environment.
from playwright.async_api import async_playwright, Browser, Page
from dataclasses import dataclass
@dataclass
class ScraperConfig:
base_url: str
username: str
password: str
headless: bool = True
timeout: int = 30_000 # ms
class BusinessScraper:
def __init__(self, config: ScraperConfig):
self.config = config
self._browser: Browser | None = None
self._page: Page | None = None
async def __aenter__(self):
self._playwright = await async_playwright().start()
self._browser = await self._playwright.chromium.launch(
headless=self.config.headless,
args=["--no-sandbox", "--disable-dev-shm-usage"] # Required in Docker
)
context = await self._browser.new_context(
viewport={"width": 1280, "height": 800},
locale="en-GB"
)
self._page = await context.new_page()
return self
async def __aexit__(self, *args):
await self._browser.close()
await self._playwright.stop()
The context manager ensures the browser closes cleanly even in the event of an exception — essential in production.
Robust Authentication
Authentication is the most fragile part of any scraper. Portals change their UI, introduce additional security steps, or add delays. A few principles for making it reliable:
async def login(self) -> bool:
page = self._page
await page.goto(f"{self.config.base_url}/login", wait_until="networkidle")
# Wait for the specific element, not just page load
await page.wait_for_selector("#username", state="visible", timeout=10_000)
await page.fill("#username", self.config.username)
await page.fill("#password", self.config.password)
# Intercept the login response to detect auth failures precisely
async with page.expect_response(
lambda r: "/api/auth" in r.url and r.status in (200, 401, 403)
) as response_info:
await page.click('[type="submit"]')
response = await response_info.value
if response.status != 200:
raise AuthenticationError(f"Login failed: HTTP {response.status}")
await page.wait_for_url(f"{self.config.base_url}/dashboard", timeout=15_000)
return True
Network response interception (expect_response) is more reliable than waiting for a CSS selector after the click — it detects authentication failures without depending on how the error message happens to be rendered.
Extracting Structured Data
Once authenticated, extraction must be deterministic. Playwright allows combining DOM navigation and network interception, depending on which is more stable:
async def extract_certificates(self, period: str) -> list[dict]:
page = self._page
await page.goto(
f"{self.config.base_url}/certificates?period={period}",
wait_until="networkidle"
)
# Strategy 1: intercept the underlying API call when available
async with page.expect_response(
lambda r: "/api/certificates" in r.url
) as api_response:
await page.click("#load-certificates")
data = await (await api_response.value).json()
return data.get("items", [])
async def extract_table_data(self) -> list[dict]:
"""Strategy 2: extract directly from the DOM."""
rows = await self._page.query_selector_all("table.data-grid tbody tr")
results = []
for row in rows:
cells = await row.query_selector_all("td")
values = [await cell.inner_text() for cell in cells]
results.append({
"id": values[0].strip(),
"date": values[1].strip(),
"volume": float(values[2].replace(",", ".")),
"status": values[3].strip(),
})
return results
Strategy 1 (network interception) is preferable when available: raw JSON data is cleaner and less sensitive to layout changes. Strategy 2 (DOM extraction) is the universal fallback.
Handling File Downloads
Many portals offer Excel or CSV exports via a download button. Playwright handles this natively:
async def download_export(self, output_path: str) -> str:
async with self._page.expect_download() as download_info:
await self._page.click("#export-button")
download = await download_info.value
if download.failure():
raise ExportError(f"Download failed: {download.failure()}")
await download.save_as(output_path)
return output_path
Running in Docker and OpenShift
Playwright in a container requires Chromium's system dependencies:
FROM python:3.12-slim
RUN apt-get update && apt-get install -y \
libnss3 libatk1.0-0 libatk-bridge2.0-0 \
libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
libxdamage1 libxfixes3 libxrandr2 libgbm1 \
libasound2 libpango-1.0-0 libcairo2 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN chown -R 1001:0 /app && chmod -R g=u /app
COPY --chown=1001:0 requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && playwright install chromium
COPY --chown=1001:0 . .
USER 1001
CMD ["python", "scraper.py"]
On OpenShift, --no-sandbox is mandatory: containers do not have the privileges required by Chromium's sandbox. This is not a security concern in this context — the sandbox protects against malicious web content, which does not apply to a scraper targeting a known internal portal.
Orchestrating with a Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: business-scraper
spec:
schedule: "0 6 * * 1-5"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: scraper
image: registry.internal/business-scraper:latest
env:
- name: SCRAPER_USERNAME
valueFrom:
secretKeyRef:
name: scraper-credentials
key: username
- name: SCRAPER_PASSWORD
valueFrom:
secretKeyRef:
name: scraper-credentials
key: password
restartPolicy: OnFailure
concurrencyPolicy: Forbid is critical: if one execution takes longer than expected, you do not want two scrapers authenticating simultaneously with the same account.
Playwright vs the Alternatives
| Criterion | requests + BS4 | Selenium | Playwright |
|---|---|---|---|
| SPAs / JavaScript | No | Yes | Yes |
| Network interception | No | Partial | Native |
| Native async | No | No | Yes |
| CI/CD stability | Good | Fragile | Good |
| Docker support | Simple | Complex | Reasonable |
| Modern API | No | No | Yes |
For simple static sites, requests and BeautifulSoup remain faster to set up and lighter to operate. However, as soon as complex authentication, dynamic JavaScript, or user interactions are involved — Playwright is the most robust open-source option available today.