Publications Details

Publications / SAND Report

An Evaluation of Main Content Extraction Libraries in Java and Python

Reeve, Madeline D.

Main content extraction is a method to isolate the relevant content from a webpage and remove extraneous content such as advertisements and sidebars. There are many different Python and Java libraries that attempt to perform main content extraction through various algorithms. Due to the differing structures between web pages, there is no “perfect” way to accomplish this task, motivating an evaluation of different main content extraction libraries.

Top