Web crawling, how does it work?

Michael Misiewicz

17 Jan 2024 - 03 Mins read

Any system that is working on web text data needs to be able to remove the "boilerplate" in order to get the correct essence of the page. MContextual uses modern LLMs to generate targeting lists, so this improves the situation a lot, but nevertheless there's many good reasons to detect and remove this text.

What is boilerplate?

On many web pages, text in menus, headers, footers (e.g. copyright notices) exist and in large quantities. This boilerplate text can be a real problem... what if you wanted to use contextual targeting to target restaurant connoisseurs and you used the keyword "menu"?

Simply put, "boilerplate" means all the text on a page that isn't the main content.

Can't we just write a program to solve it?

In short, no. Webpages - especially modern ones - give authors a million ways to do display text. Furthermore, frameworks like React make the problem even more complicated. There's really no one single way to put text on a page - even though it looks the same in terms of what is rendered.

Boilerplate removal is a famously tricky problem. The best Python library out there to solve the problem is called trafilatura. Helpfully, the author publishes very thorough benchmarks of its performance.

But overall, it remains a challenging problem. The very first software that automated boilerplate remove is called Boilerpipe, and it dates to 2009, and is written in Java. At the same time there was a benchmark dataset published, but the academic and open source community haven't been able to address this hard problem really well.

What is currently implemented today is rules based, primarily. Trafilatura currently king of the hill here, having dethroned the others.

How does MContextual solve this problem?

Currently, we use Trafilatura due to its strong performance. That might sound like a challenge for scalability - how are you going to scrape 100MM web pages (the approximate size of our database, as of this writing) using Python? Well, happily many python extension written in C scale just about as well as any other compiled language. Trafilatura, under the hood, is based on the lxml HTML parser (rules are applied on top of the parsed DOM) so it works fairly well.

Why not use ChatGPT?

It would seem like an easy decision to consider an LLM for boilerplate removal. But there's a few issues with that approach.

Web pages are typically many kilobytes or megabytes. Big texts don't interact too well with LLMs. This is in part due to the size of the context window - even with modern approaches for bigger context windows, getting good evaluation out of long texts remains tricky. This paper has a really cool method for evaluating this problem, in figure 5 (the rest of the paper is unrelated to this subject, however).
LLMs are expensive! Even with the big decreases in costs, 100MM pages * 250k tokens (for all the HTML of a web page, assuming average of 4 bytes per token, which would be about 2 letters) would cost $750,000,000 to process with GPT4 as of this writing ($0.03 CPM for prompt tokens). Even with the cheaper models (GPT3.5 turbo) you're still looking at a bill > $25MM. So, scale is an issue.

What's coming up next?

One really exciting idea is to use some methods from LLM training to assist with the boilerplate problem.

An interesting paper entitled "An unsupervised perplexity-based method for boilerplate removal" uses word frequency information to predict which tokens on a page are boilerplate. The core idea is that if certain sequences of tokens are very common, then they are likely to be boilerplate. The authors of this paper collected statistics about web pages to compute the required statistical distributions to make these computations. Since HTML, though very flexible, uses a small set of "words" compared to language, this method works pretty well.

We're currently testing out this algorithm to see if it could improve the quality of the MContextual scraper.