-
Notifications
You must be signed in to change notification settings - Fork 760
Release Notes 1.14.4
These are the project wiki Release Notes for the 1.14.4 release.
Release 1.14.4 is a 'micro' release with a number of small bugfixes and new requested features.
The 1.14.4 release is now available at TK.
Support for FTP transactions in WARC records (HER-1577)
Heritrix now supports recording full FTP transactions in WARC records. For each FTP URL retrieved, the control conversation is recorded in a WARC metadata record with Content-Type: application/ftp; msgtype=control-conversation, the payload data is recorded in a WARC resource record with Content-Type: application/ftp; msgtype=payload-data, and FTP fetch metadata (as well as outlinks) are recorded in a corresponding WARC metadata record.
Other WARC corrections (HER-1659)
Written WARC files now consistently identify as WARC version "1.0" (HER-1648) and will grow to the 1GB size recommended by the specification.
Several problems causing errors or problems in using Heritrix on Windows, related to improper quoting or path-separators, have been corrected.
Seeds with Internationalized Domain Names (IDN) better supported (HER-1711)
Encoding problems which interfered with specification of some Internationalized Domain Name seeds have been corrected.
Hosts report expanded to include novel/duplicate bytes/URLs counts (HER-1650)
Crawl statistics now collect, and the 'Hosts' report includes, counts of the URLs and total content byte-sizes deemed either 'novel' or 'duplicate' by the duplication-reduction/persist-history mechanisms, if enabled on a crawl.
Trailing '*' tolerated in robots.txt Disallow/Allow rules (HER-1620)
Heritrix will now tolerate a trailing '*' wildcard sometimes added by webmasters (though not necessary) in their robots.txt Disallow/Allow rules. (Leading or internal wildcards are not yet supported.)
A number of performance, memory-retention, and deadlock-risk issues occasionally affecting the implementation class CachedBdbMap were identified. Fixes have been applied, but also the class has been replaced with a more simple implementation focused specifically on Heritrix's common use cases, ObjectIdentityBdbCache.
In addition to the usual suspects, this release includes contributed fixes or functionality from:
- Paul Baclace
- Sergey Khenkin
The following 44 tracked issues are recorded as addressed in this 1.14.4 release:
https://webarchive.jira.com/secure/ReleaseNote.jspa?projectId=10021&version=10105
Showing 20 out of 43 issues
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse