Scraping at scale: The numbers that actually matter - London Business News | Londonlovesbusiness.com
Briefly

Web scraping at scale demands engineering aligned with current web realities: pervasive JavaScript, widespread HTTPS, large median page sizes, and significant automated traffic. JavaScript appears on about 98% of sites, so headless browsers or robust JS runtimes are required for complex targets. HTTPS exceeding 90% of page loads necessitates TLS and HTTP behaviour parity to avoid fingerprinting. Median page weight around 2 MB makes bandwidth and compression major constraints, since ten million full-page fetches can reach tens of terabytes uncompressed. Automated traffic proportions force defenses like rate limits, fingerprinting, and IP reputation checks. Track capture-weighted success rate, parse fidelity, and freshness lag to detect failures early.
JavaScript is present on about 98% of sites, so pages are not simply HTML dumps. Encrypted browsing exceeds 90% of page loads in major browsers, which means your client stack must match modern TLS and HTTP behaviour. The median page weight hovers around 2 MB, so bandwidth is a first-order constraint, not an afterthought. And automated traffic is a large slice of the internet, roughly half of requests by volume, so naive patterns get flagged quickly.
High JavaScript prevalence means headless browsers or robust JS runtimes are not optional on complex targets. Ubiquitous HTTPS means TLS fingerprinting consistency matters. Mismatched ciphers, ALPN, or HTTP/2 quirks can betray automation. A 2 MB median page weight compounds fast. Ten million full-page fetches approach 20 TB uncompressed. Compression can trim HTML by 30 to 70 percent, but images and scripts often dominate. With automated traffic accounting for a large share of requests, many sites enforce rate limits, fingerprinting, and IP reputation checks by default.
Read at London Business News | Londonlovesbusiness.com
[
|
]