FitLayout/2 - Web Page Analysis Framework

FitLayout/2 is an extensible framework for web document rendering, modeling and analysis. It provides a Java API as well as the platform-independent Command-line interface (CLI) and the Web API (REST) that allow implementing a complete web page processing workflow including the following features:

A unified model for rendered web page representation in Java. The model describes the page at the level of individual boxes generated by a rendering engine and is independent on the source page format. It is suitable for further analysis for example by page segmentation algorithms.
Page renderers for creating page models from source documents. Currently, the following renderers are available:
- Full-featured renderers based on the remotely controlled Chrome browser: the Playwright renderer, which is integrated directly in FitLayout and the Puppeteer renderer, which requires a separate backend based on Node.js. Both solutions allow rendering any web page including complex and dynamic web pages with JavaScript and are suitable for different application scenarios.
- A simple built-in CSSBox-based renderer - it is a pure Java renderer with no additional dependencies that is suitable for a quick rendering of simple web pages. It renders HTML+CSS pages and PDF documents and it may be faster for shorter documents because no external browser needs to be started. However, it does not support dynamic pages with JavaScript and complex CSS layouts.
- A PDF renderer for rendering PDF documents with the possibility of choosing a range of pages and a zoom factor. The implementation is based on Apache PDFBox.
Page segmentation algorithms for performing page segmentation on the rendered pages and the corresponding area tree model that describe the segmentation result. Currently, the following page segmentation methods are available:
- VIPS - Vision-based page segmentation
- BCS - Block clustering segmentation
- Visual area grouping - A basic but configurable bottom-up segmentation method
  The framework provides all necessary tools and data structures for easily implementing more page segmentation algorithms.
RDF-based storage for storing the rendered pages, segmentation results and other artifacts in a common storage. The rendered pages and the segmentation results are described using the prepared ontologies and may be stored in a common storage or shared. This allows to analyze the rendered pages repeatedly with no need to re-render them. Moreover, this allows easy annotation of any part of the pages with different metadata and querying the page contents using SPARQL.
Page processing workflow that puts everything together. It allows automation of the web document analysis process and invocation of the individual steps.

Documentation

Detailed documentation is available from the project Wiki.

Installation

Docker Images

For using FitLayout out of the box for web page rendering, segmenation and storage, the easiest way is to use the available docker images.

Maven Artifacts

For including all artifacts, simply add the following dependency to your pom.xml:

<dependency>
    <groupId>cz.vutbr.fit.layout</groupId>
    <artifactId>fitlayout-all</artifactId>
    <version>${fitlayout.version}</version>
    <type>pom</type>
</dependency>

Alternatively, the individual maven artifacts may be added separately.

Configuration

Different FitLayout parametres may be configured using Java properties. See Configuration for details.

Web Page Rendering and Processing in Java

Page rendering can be performed by a simple CSSBox-based renderer (see below) or alternatively, by a full-featured Chromium-based renderer.

URL url = new URL("http://...");

// setup the renderer
var renderer = new CSSBoxTreeProvider(url, 1200, 800);
renderer.setIncludeScreenshot(false); // we don't need screen shot in this demo

// perform page rendering
Page page = renderer.getPage();

The resulting Page object represents the rendered page:

System.out.println("Url: " + page.getSourceURL());
System.out.println("Title: " + page.getTitle());
System.out.println("Rendered size: " + page.getWidth() + " x " + page.getHeight() + " px");

Page content is represented as a tree of boxes where each box is generated by the rendering backend from the page DOM as defined by the CSS visual formatting model and it has exact position, size and visual style assigned.

Later, a page segmentation algorithm may be applied on the page (e.g. VIPS):

// setup the VIPS segmentation provider
var vips = new VipsProvider();
vips.setPDoC(9); // the preferred degree of coherence

// perform segmentation; produces an area tree
AreaTree atree = vips.createAreaTree(page);

The AreaTree represents the result of page segmentation.

More Java examples that include different rendering and page segmentation methods may be found in Demos.

Command Line Interface

The command line interface (CLI) is invoked by running FitLayout.jar (see Compilation from Source above) or by running the corresponding docker container. In both cases, it accepts a list of commands as explained below.

For the local installation (FitLayout.jar) run:

java -jar FitLayout.jar <commands>

Make sure that the FitLayout.jar and optionally the config.properties configuration file are in the current working directory.

For the docker container, get the fitlayout.sh script according to the configuration instructions and then run:

./fitlayout.sh <commands>

Commands

The CLI tool understands a small set of commands such as RENDER for rendering the page, SEGMENT for performing page segmentation or EXPORT for exporting the result.

See the Command-line Interface wiki page for a complete reference on commands and their arguments.

Usage examples

Render a page using the puppeteer backend and export the model to a XML file:

./fitlayout.sh \
    RENDER -b puppeteer http://cssbox.sf.net \
    EXPORT -f xml

Render a page using the puppeteer backend, perform segmenation using VIPS and export to a XML file:

./fitlayout.sh \
    RENDER -b puppeteer http://cssbox.sf.net \
    SEGMENT -m vips -O pDoC=9 \
    EXPORT -f xml

Render a page using the cssbox backend, store a screenshot, perform segmentation using BCS, store a screenshot of the segmented page, export areas in RDF/turtle.

./fitlayout.sh \
    RENDER -b cssbox http://cssbox.sf.net \
    EXPORT -f png -o /tmp/page.png \
    SEGMENT -m bcs \
    EXPORT -f png -o /tmp/segments.png \
    EXPORT -f turtle

See the Command-line Interface wiki page for more examples including the usage of the built-in RDF storage.

Web API (REST)

The FitLayoutWeb project implements a standalone web service that allows access to all FitLayout features via a REST API.

Publication

If you find FitLayout useful for your scientific work, please cite the following publication:

MILIČKA Martin and BURGET Radek. Information Extraction from Web Sources based on Multi-aspect Content Analysis. In: Semantic Web Evaluation Challenges, SemWebEval 2015 at ESWC 2015. Communications in Computer and Information Science, vol. 2015. Portorož: Springer International Publishing, 2015, pp. 81-92. ISBN 978-3-319-25517-0. ISSN 1865-0929.

License

All the source code of the FIT Layout Analysis Framework is licensed under the GNU Lesser General Public License (LGPL), version 3. A copy of the LGPL can be found in the LICENSE file.

The framework is under development and its API or functionality may change in future versions. See the CHANGELOG for the most important changes to the previous versions.

Name		Name	Last commit message	Last commit date
Latest commit History 685 Commits
all		all
bom		bom
fitlayout-core		fitlayout-core
fitlayout-io		fitlayout-io
fitlayout-mapping		fitlayout-mapping
fitlayout-patterns		fitlayout-patterns
fitlayout-render-cssbox		fitlayout-render-cssbox
fitlayout-render-json		fitlayout-render-json
fitlayout-render-pdf		fitlayout-render-pdf
fitlayout-render-playwright		fitlayout-render-playwright
fitlayout-render-puppeteer		fitlayout-render-puppeteer
fitlayout-segm-base		fitlayout-segm-base
fitlayout-segm-bcs		fitlayout-segm-bcs
fitlayout-segm-cormier		fitlayout-segm-cormier
fitlayout-segm-vips		fitlayout-segm-vips
fitlayout-storage-rdf		fitlayout-storage-rdf
fitlayout-text		fitlayout-text
fitlayout-tools		fitlayout-tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FitLayout/2 - Web Page Analysis Framework

Documentation

Installation

Docker Images

Maven Artifacts

Configuration

Web Page Rendering and Processing in Java

Command Line Interface

Commands

Usage examples

Web API (REST)

Publication

License

About

Releases

Packages

Languages

License

MightyW0lf/FitLayout

Folders and files

Latest commit

History

Repository files navigation

FitLayout/2 - Web Page Analysis Framework

Documentation

Installation

Docker Images

Maven Artifacts

Configuration

Web Page Rendering and Processing in Java

Command Line Interface

Commands

Usage examples

Web API (REST)

Publication

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages