-
Notifications
You must be signed in to change notification settings - Fork 761
Release Notes 3.0.0
- Release Notes - 3.0.0 (Dec 2009)
These are the project wiki release notes for the 3.0.0 release.
Release 3.0.0 is a major release, the first of the Heritrix3 ("H3") series. It includes new features and issue fixes, and a significant reworking of the configuration system and user interface based on current and expected needs.
Heritrix3 is currently suitable for advanced users and projects that are either customizing Heritrix (with Java or other scripting code) or embedding Heritrix in a larger system. Please review the Current Limitations, below, to help determine if Heritrix3 or a current Heritrix1 (1.14.3 or later) release is best suited for your needs.
The 3.0.0 release is now available for download at the archive-crawler Sourceforge project.
Documentation for Heritrix3 is available via the Heritrix 3.0 User Guide , Heritrix 3.0 API Guide , and other notes on the Heritrix project wiki .
Please discuss questions, problems, and ideas at our project discussion list , and submit bug reports or feature requests via our project JIRA issue tracker .
An XML file that conforms to the
Spring Inversion-of-Control container
framework (version
2.5.6)
is now the mechanism for configuring Heritrix crawl jobs. This file
exists for each crawl job and replaces the previous order.xml file. The
file is called crawler-beans.cxml
(or profile-crawler-beans.cxml
to
signify a template configuration which may be edited and test-built but
not launched).
The Web User Interface has been redesigned to generally follow REST architectural principles, allowing simple HTML GETs and POSTs to control crawler behavior, either with a web browser or other systems.
The Heritrix crawler source code, with the exception of a few files, is
now available under the Apache License, 2.0. See the included file
LICENSE.txt
for details.
Checkpoints may be requested without requiring a full-crawler pause, and complete with a much smaller impact on the rate of progress. See (HER-1516 ).
Heritrix now robustly supports crawls with millions of seeds. See (HER-769 ).
Custom Processors and DecideRules may be implemented in Beanshell, Javascript, or Groovy scripting languages without adding other libraries or restarting Heritrix, and an enhanced Scripting Console in the operator web user-interface allows any kind of inspection or modification of a running crawl.
The new Action directory allows adding seeds and recovery URI data at any point during a crawl. See (HER-1660 ).
An experimental facility for directing the crawler to revisit URIs at a fixed interval after their initial fetch is available. See (HER-1714 ).
Configuring a Heritrix3 crawl requires directly editing a textual XML configuration file. Simple changes are a matter of replacing or un-commenting existing values in a model file, while more extensive changes require understanding the Spring 'beans' container-specification XML format, and specifying implementation Java classes by their full names. An interface with individual fields and selection-lists of common options, as in prior versions of Heritrix, will be restored in a future release.
The new model of relaunching a job in its home directory, rather than creating a new directory for every launch, increases the likelihood logs and other files from later runs will attempt to use the same filenames as earlier runs. In general, Heritrix3 attempts to preserve the earlier files by renaming them aside, with a timestamp suffix added to their filenames.
The current migration utility, executable class
org.archive.crawler.migrate.MigrateH1to3Tool
, only works reliably for
changed basic order.xml
configuration values as reflected in our
bundled default configuration. Also, it makes no effort to convert H1
per-domain/per-host settings overrides. By providing a model H3-based
configuration with some values brought over, and reporting the values it
cannot convert, it may still provide a useful base for other
hand-conversion. Please let us know what advanced configuration in your
Heritrix1 crawls is most important to support in the automated tool.
The operator user interface is now accessible only by HTTPS, by default, on port 8443. A custom self-signed certificate will be generated (if a server certificate is not specified) and saved for reuse if necessary. As before, the interface is also only made available via the 'localhost' interface unless a non-default switch is used to listen on all interfaces.
Secure the operator interface
Important: Via its crawl-configuration and scripting capabilities, the operator web interface grants the ability to run arbitrary Java code, and make arbitrary filesystem changes, on the system running the crawler, with all privileges of the user account used to launch the Java runtime. As a result, securing the operator interface with network policies and strong password is very important. Please review the Security Considerations section of the User Manual for more details.
All pre-crawl configuration is guided by a primary XML configuration
file in the 'beans' format of the Spring framework. (It may import other
files or refer to other paths that are consulted, as well.) By
convention, this file is named 'crawler-beans.cxml
'.As noted above,
this file muse be edited directly -- either outside the web user
interface, or via a file-editor page provided in the web user interface
-- to change configuration parameters or the choice of implementing
classes for crawler functionality.
The bundled default profile contains an initial suggested configuration with recommended values and modules, and should be used as a model to build other configurations. It is extensively commented to describe optional properties that may be altered from default values.
A migration program is also available to help operators migrate 1.x crawl jobs to 3.0 crawl jobs. See (HER-1641 ). It currently only helps map simple configuration values to their new equivalents, but will also display warnings about configuration that requires more attention.
A built crawl may be inspected and changed in new ways, including the Scripting Console and Bean Browser, but these changes will not be automatically propagated to the CXML configuration.
The general lifecycle of a crawl is now:
- Compose the crawl job configuration as desired, perhaps starting from a bundled template or other earlier crawl job
- Build the crawl, which serves to test many aspects of the configuration's validity (such as whether the CXML is legal and specified all necessary components and legal property values). A built crawl has had all its components created and connected, but has not yet started consuming disk resources or begun network activity.
- Launch the crawl, which signals it to begin disk/network activity. Our bundled profile will begin in a paused state, requiring an un-pause for URI-fetching to actually begin, though this may be changed via a configuration property.
- Monitor the running crawl, pausing/unpausing/checkpointing as desired. The job's detail page may always be reloaded in the browser for the latest running statistics.
- Allow the crawl to reach its end or use the Terminate option. Even after termination, the crawl components remain alive in system memory for inspection/reporting.
- Teardown the crawl to discard all crawl components, to free memory or allow the job to be freshly 'built' and 'launched' again.
A crawl job may be relaunched repeatedly in the same job directory. (A new directory is not created for each launch.)
There have been some small but important changes in the syntax by which SURT Prefixes are specified.
First, when deducing an 'implied' SURT prefix from a seed URI, the presence or absence of a trailing '/' after the URI host portion (or more precisely, URI authority portion) is no longer significant. The trailing '/' is always assumed to be present – just as it is in browsers and on HTTP requests, and the deduced SURT prefix is always limited to that exact host (or authority). For example, in H3, either the seed "http://example.com" or "http://example.com/" (note trailing '/') are both treated as if they have the trailing slash. They both imply the SURT prefix "http://(com,example,)/", which is only a prefix of URIs that begin exactly "http://example.com/". If you want to accept all subdomains of 'example.com', you must supply a suitable SURT prefix yourself, for example "http://(com,example,".
Second, they syntax for supplying SURT prefixes is the same whether they are included in your seeds-source text, or a separate surts-source file. If you are going to provide a hand-crafted SURT prefix, it should be on a line that begins with the '+' directive, for example "+http://(com,example,", whether it is in your seeds file or a separate surt-prefixes file specified via the surtsSourceFile setting.
If you want the SURT prefix to be deduced from a normal URI or even naked hostname, these may be supplied in the seeds source text, with the seedsAsSurtPrefixes setting on, or in the surtsSourceFile. For example, consider the following short text file:
images.example.com
http://sounds.example.com
+videos.example.com
+http://(com,exampleresources,
If the above is supplied as seeds source text, then the first two lines will be interpreted as the seeds "http://images.example.com/" and "http://sounds.example.com/". Further, the two lines beginning '+' will be announced to any SURT-prefix-sensitive scoping rules – usually just a single one – and result in that rule including the SURT-prefixes "http://(com,example,videos,)/" (accepting any URLs on the videos.example.com domain) and "http://com.exampleresources," (accepting any URLs on exampleresources.com domain and any subdomains).
If the seedsAsSurtPrefixes setting is true on a particular SURT-prefix-sensitive scope rule (which is the default), then those two seeds will also result in that rule accepting URIs fitting the SURT-prefixes "http://(com,example,images,)/" and "http://(com,example,sounds,)/".
If that same text is supplied directly as SURTs-source-file, all lines supply SURT prefixes. (The 'seedsAsSurtPrefixes' setting is irrelevant.) Those lines beginning '' are considered to already be in SURT form; those without '' are first coerced into URIs (adding "http://" if necessary if a scheme isn't present) and then SURT prefixes are deduced.
The following 111 tracked issues are recorded as addressed in this 3.0 release:
http://webarchive.jira.com/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10160
type
key
summary
status
Unable to locate JIRA server for this macro. It may be due to Application Link configuration.
http://webarchive.jira.com/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10034
type
key
summary
status
Unable to locate JIRA server for this macro. It may be due to Application Link configuration.
http://webarchive.jira.com/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10102
type
key
summary
status
Unable to locate JIRA server for this macro. It may be due to Application Link configuration.
http://webarchive.jira.com/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10101
type
key
summary
status
Unable to locate JIRA server for this macro. It may be due to Application Link configuration.
The 3.0.0 "Create new job from recommended starting configuration" feature will produce a NullPointerException. A fix has been submitted (HER-1725 ) and will be included in the next incremental release. The recommended workaround is to navigate to the default jobs directory starter profile, and use its copy functionality, then revisit the "Engine" page, and press rescan. Your new job should appear in the list of job directories.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse