(c) 2015-2022 Radek Burget ([email protected])
FitLayout/2 is an extensible framework for web document rendering, modeling and analysis. It provides a Java API as well as the platform-independent Command-line interface (CLI) and the Web API (REST) that allow implementing a complete web page processing workflow including the following features:
- A unified model for rendered web page representation in Java. The model describes the page at the level of individual boxes generated by a rendering engine and is independent on the source page format. It is suitable for further analysis for example by page segmentation algorithms.
- Page renderers for creating page models from source documents. Currently, the following renderers are available:
- Full-featured renderers based on the remotely controlled Chrome browser: the Playwright renderer, which is integrated directly in FitLayout and the Puppeteer renderer, which requires a separate backend based on Node.js. Both solutions allow rendering any web page including complex and dynamic web pages with JavaScript and are suitable for different application scenarios.
- A simple built-in CSSBox-based renderer - it is a pure Java renderer with no additional dependencies that is suitable for a quick rendering of simple web pages. It renders HTML+CSS pages and PDF documents and it may be faster for shorter documents because no external browser needs to be started. However, it does not support dynamic pages with JavaScript and complex CSS layouts.
- A PDF renderer for rendering PDF documents with the possibility of choosing a range of pages and a zoom factor. The implementation is based on Apache PDFBox.
- Page segmentation algorithms for performing page segmentation on the rendered pages and the corresponding area tree model that describe the segmentation result. Currently, the following page segmentation methods are available:
- VIPS - Vision-based page segmentation
- BCS - Block clustering segmentation
- Visual area grouping - A basic but configurable bottom-up segmentation method
The framework provides all necessary tools and data structures for easily implementing more page segmentation algorithms.
- RDF-based storage for storing the rendered pages, segmentation results and other artifacts in a common storage. The rendered pages and the segmentation results are described using the prepared ontologies and may be stored in a common storage or shared. This allows to analyze the rendered pages repeatedly with no need to re-render them. Moreover, this allows easy annotation of any part of the pages with different metadata and querying the page contents using SPARQL.
- Page processing workflow that puts everything together. It allows automation of the web document analysis process and invocation of the individual steps.
Detailed documentation is available from the project Wiki.
For using FitLayout out of the box for web page rendering, segmenation and storage, the easiest way is to use the available docker images.
For including all artifacts, simply add the following dependency to your pom.xml
:
<dependency>
<groupId>cz.vutbr.fit.layout</groupId>
<artifactId>fitlayout-all</artifactId>
<version>${fitlayout.version}</version>
<type>pom</type>
</dependency>
Alternatively, the individual maven artifacts may be added separately.
Different FitLayout parametres may be configured using Java properties. See Configuration for details.
Page rendering can be performed by a simple CSSBox-based renderer (see below) or alternatively, by a full-featured Chromium-based renderer.
URL url = new URL("http://...");
// setup the renderer
var renderer = new CSSBoxTreeProvider(url, 1200, 800);
renderer.setIncludeScreenshot(false); // we don't need screen shot in this demo
// perform page rendering
Page page = renderer.getPage();
The resulting Page object represents the rendered page:
System.out.println("Url: " + page.getSourceURL());
System.out.println("Title: " + page.getTitle());
System.out.println("Rendered size: " + page.getWidth() + " x " + page.getHeight() + " px");
Page content is represented as a tree of boxes where each box is generated by the rendering backend from the page DOM as defined by the CSS visual formatting model and it has exact position, size and visual style assigned.
Later, a page segmentation algorithm may be applied on the page (e.g. VIPS):
// setup the VIPS segmentation provider
var vips = new VipsProvider();
vips.setPDoC(9); // the preferred degree of coherence
// perform segmentation; produces an area tree
AreaTree atree = vips.createAreaTree(page);
The AreaTree represents the result of page segmentation.
More Java examples that include different rendering and page segmentation methods may be found in Demos.
The command line interface (CLI) is invoked by running FitLayout.jar
(see Compilation from Source above) or by running the corresponding docker container. In both cases, it accepts a list of commands as explained below.
For the local installation (FitLayout.jar) run:
java -jar FitLayout.jar <commands>
Make sure that the FitLayout.jar
and optionally the config.properties
configuration file are in the current working directory.
For the docker container, get the fitlayout.sh
script according to the configuration instructions and then run:
./fitlayout.sh <commands>
The CLI tool understands a small set of commands such as RENDER
for rendering the page, SEGMENT
for performing page segmentation or EXPORT
for exporting the result.
See the Command-line Interface wiki page for a complete reference on commands and their arguments.
Render a page using the puppeteer backend and export the model to a XML file:
./fitlayout.sh \
RENDER -b puppeteer http://cssbox.sf.net \
EXPORT -f xml
Render a page using the puppeteer backend, perform segmenation using VIPS and export to a XML file:
./fitlayout.sh \
RENDER -b puppeteer http://cssbox.sf.net \
SEGMENT -m vips -O pDoC=9 \
EXPORT -f xml
Render a page using the cssbox backend, store a screenshot, perform segmentation using BCS, store a screenshot of the segmented page, export areas in RDF/turtle.
./fitlayout.sh \
RENDER -b cssbox http://cssbox.sf.net \
EXPORT -f png -o /tmp/page.png \
SEGMENT -m bcs \
EXPORT -f png -o /tmp/segments.png \
EXPORT -f turtle
See the Command-line Interface wiki page for more examples including the usage of the built-in RDF storage.
The FitLayoutWeb project implements a standalone web service that allows access to all FitLayout features via a REST API.
If you find FitLayout useful for your scientific work, please cite the following publication:
MILIČKA Martin and BURGET Radek. Information Extraction from Web Sources based on Multi-aspect Content Analysis. In: Semantic Web Evaluation Challenges, SemWebEval 2015 at ESWC 2015. Communications in Computer and Information Science, vol. 2015. Portorož: Springer International Publishing, 2015, pp. 81-92. ISBN 978-3-319-25517-0. ISSN 1865-0929.
All the source code of the FIT Layout Analysis Framework is licensed under the GNU Lesser General Public License (LGPL), version 3. A copy of the LGPL can be found in the LICENSE file.
The framework is under development and its API or functionality may change in future versions. See the CHANGELOG for the most important changes to the previous versions.