-
Notifications
You must be signed in to change notification settings - Fork 760
Release Notes 1.14.0
These are the project wiki Release Notes for the 1.14.0 release.
Release 1.14.0 adds a number of small features to the Heritrix 1.x line, most notably upgrading support for the WARC archived-web-content format to version 0.17 (ISO Committee Draft). This release also includes 41 bug fixes or other incremental improvements, including several based on community contributions or requests.
The 1.14.0 release is now available at the archive-crawler Sourceforge project.
WARC/0.17 support (HER-1180)
The WARC support now matches the 0.17 specification version (ISO Committee Draft). The prefix 'Experimental' has been removed from WARC support class names.
A new TopmostAssignedSurtQueueAssignmentPolicy assigns URIs to queues based on the information from publicsuffix.org. Specifically, the queue name will be based on the SURT form of the topmost domain that may be assigned from a name registry. This tends to group related subdomains in the same queue.
Hosts Report (HER-1254)
The hosts report automatically dumped at the end of a crawl has two additional fields per listed host: number of URIs discovered but not fetched due to robots.txt rules, and number of URIs still pending/queued when the crawl ended.
BdbFrontier "dump-pending-at-close" Option (HER-1255)
BdbFrontier has a new 'expert' setting, "dump-pending-at-close". If true, during crawl termination, all URIs still pending/queued will be logged to the crawl.log with status '0' (untried).
JMX 'dumpUris' operation (HER-1154)
CrawlJob offers a new 'dumpUris' JMX operation, which offers options similar to the view URIs option in the web admin UI, but dumps URIs to a local file.
Two distinct risks for triggering an OutOfMemoryError have been removed, one concerning heap memory exhaustion in large crawls requiring many queues, and the other non-heap memory exhaustion when garbage-collection-triggered finalization may lag in a fast crawl needing little heap memory.
Two settings with potentially-confusing names have been renamed. 'overly-eager-link-detection' in ExtractorHTML and JerichoExtractorHTML, with a default value of 'true', has been renamed 'extract-value-attributes' to more accurately reflect its effect. 'bind-address' in FetchHTTP, with a default value of the empty string, has been renamed 'http-bind-address', for consistency with 'http-proxy-host' and 'http-proxy-host' and to avoid confusion with the admin web UI bind address.
If you use previous version order.xml configuration files with the old setting names in Heritrix, you will receive non-fatal logged/alert warnings about "Unknown attribute". To avoid these warnings, either rename the old settings in the order.xml or, if you are happy with the default values, you may simply delete the old settings.
In addition to the usual suspects, this release includes contributed fixes or functionality from:
- Matt Sanford
- Eric C. Jensen
- Kohei TAKEDA
The following tracked issues are recorded as addressed in this 1.14.0 release:
http://webteam.archive.org/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10020
type
key
summary
status
Unable to locate JIRA server for this macro. It may be due to Application Link configuration.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse