This blog demonstrates a simple web scraping example using three different tools. In the end a short comparison of the three is provided.
HtmlUnit
HtmlUnit is a “GUI-Less browser for Java programs”. The HtmlUnit browser can simulate Chrome, Firefox or Internet Explorer/Edge behaviour. It is a lightweight solution that does not have too many dependencies. Generally, it supports JavaScript and Cookies. HtmlUnit is used for testing, web scraping, and is the basis for other tools.
Usage
Add the following Maven dependency to your project:
The following example uses the search bar on the INNOQ website to search for all entries that contain the expression scraping:
The above example demonstrates how HtmlUnit can be used with JavaScript. Originally, HtmlUnit has been developed for testing. Therefore, per default JavaScript errors result in Exceptions. With webClient.getOptions().setThrowExceptionOnScriptError(false)
as shown in the example above you can change that behaviour (cf. JavaScript HowTo). Moreover, the WebClientOptions object of the WebClient that represents the browser allows various other configurations. Besides JavaScript, the example shows the activation of Cookies, Timeout for loading pages, and ignoring SSL problems.
The actual code starts when the webClient.getPage() method is called that loads the website. The returned page object is the root of the DOM tree that can be traversed using XPath. As multiple nodes may match a given XPath expression, the getByXPath() method provides a list of objects. So you need to filter and cast a found object. HtmlUnit provides with DomNode.querySelector() a way to select Elements by CSS classes. Of couse, it is also possible to select elements by id or name. For each HTML tag HtmlUnit provides a class (e.g. HtmlForm, HtmlInput, HtmlButton, HtmlAnchor etc.).
In the above example, we load the search page of innoq.com, enter a search string, click on the search button, and print the URIs of all found content. The click() method of the button returns the next loaded page once the page loading has been finished. The HTML of the page can be printed on the screen for debugging purposes. HtmlUnit is used without a GUI. Other libraries like Selenium might be an alternative where a GUI is needed.
Proxy
Normally, if a Java program is behind a proxy, it is sufficient to configure the JVM to contact the proxy server, when it tries to connect to the Internet via HTTP:
In the case of HtmlUnit, a special ProxyConfig object needs to be configured so that the setting is taken into account. Assuming that the proxy has been configured via the command line as shown above, we can configure HtmlUnit’s WebClient like this:
Selenium
Selenium is a set of tools that automates browsers. Its major use case is testing websites. Nevertheless, it could be used for web scraping. Selenium starts a web browser with a GUI window, which makes headless tests harder. On the other hand, a GUI window makes it easier to trace any causes of failure during the scraping process. Moreover, the browser allows the full usage of JavaScript or CSS. Besides Java, Selenium supports C#, Ruby, Python, JavaScript, and Kotlin. Chromium/Chrome, Firefox, InternetExplorer/Edge, and Safari are supported.
Usage
Having a browser of choice installed that fits your OS, you need to download the appropriate driver for it. The driver version must match the version of the browser. In the following example, we download the chrome driver and copy the downloaded executable to a certain directory (e.g., /home/martin/Documents/). Add the following Maven dependency to your project:
Like the HtmlUnit example, the following code uses the search bar on the INNOQ website to fetch all links that contain the expression scraping.
The example assumes the downloaded chrome-driver ELF file to be located in /home/martin/Documents/chromedriver
. The WebDriver represents the browser. Its get() method opens the passed page in the browser. driver.findElement() returns a found WebElement, i.e., node in the DOM tree. If the element does not exist, a NoSuchElementException is thrown. To avoid this exception, the user may call driver.findElements(), which returns a list that contains all elements that match the given search criteria. If the list is empty, nothing has been found.
By default Selenium’s get() does an HTTP POST that returns once the page has been fully loaded. To wait after a click() on an element, a WebDriverWait object can be created with a timeout in seconds as a parameter, that lets the driver wait for the existence of an element as specified by the criteria passed to the WebDriverWait’s until() method. There are several search criteria represented by the By object (e.g. by name, className (element has one and only one CSS class), cssSelector (element has multiple CSS classes), id attribute of HTML tags, HTML tag name, (partial)linktext or an XPath expression). If the returned WebElement belongs to a form (i.e., the form or any sub element), the submit() method can be called to submit the form, instead of using its click() method.
To tweak the ChromeDriver you can make use of the ChromeOptions' capabilities. For example, to execute the code without opening the UI of the browser you simply set the headless flag as shown in the example. If you like to test the mobile version of your website, you can set ChromeOptions to emulate a mobile browser:
chromeOptions.addArguments(List.of("--user-agent=Mozilla/5.0 (iPhone;)..."));
allows to specify the user agent. Checkout ChromeDriver documentation for more information on mobile emulation.
In gerneral, more information how to use Selenium can be found here.
Jaunt
Jaunt stands for Java Web Scraping & JSON Querying. It does not support JavaScript, but is extremely fast. Even though its website states the opposite, it is not a free library. A jar file is provided on its download page, which is usable for free for one month. A jar that can be used for a longer-term costs money. The library cannot be used with a GUI. A detailed tutorial is available. Since jaunt is not based on a webkit browser, it allows an access to HTTP that eases handling of REST calls. Its support for parsing JSON payloads is a plus. Instead of relying on XPath or CSS selectors, the selectors are kept as short as possible to reduce the liability to structural changes in the DOM tree. The next paragraph demonstrates that the Java code that uses jaunt is very concise.
Usage
Add the downloaded jar to your project.
The userAgent represents the browser whose visit() method opens a site which is provided through the userAgent’s doc property. The example shows that a form can be simply retrieved by specifying its index. Alternatively, a query like Form form = userAgent.doc.getForm("<form name=pagetreesearchform>");
allows to retrieve it by name. If the submit button is unambiguous, it is sufficient to call submit() on the form without a parameter, otherwise the label on the button can be passed as parameter to the submit() method (e.g. submit("Search")
). A disadvantage is the heavy usage of exceptions. Instead of providing Optional
s or null, if an element could not be found, exceptions are thrown that need to be handled.
Comparison
Based on the above observations, the following comparison table can be derived (X supported, - not supported, L limited).
Feature | GUI | No GUI | JavaScript/CSS | free | fast | XPath Selector | CSS Selector | HTML Selector |
---|---|---|---|---|---|---|---|---|
HtmlUnit | - | X | X | X | - | X | X | X |
SeleniumHQ | X | L | X | X | - | X | X | X |
jaunt | - | X | - | - | X | - | - | X |