-
Notifications
You must be signed in to change notification settings - Fork 760
Release Notes Heritrix 3.4.0 20200304
Alex Osborne edited this page Mar 5, 2020
·
4 revisions
Summary of changes since Release Notes - Heritrix 3.4.0-20190418 - see the full changelog for more details.
This releases updates the Berkeley Database from a very old version 4.1.6 to version 7.5.11. This resolves a long-standing bug when recovering from checkpoints multiple times, but also means that the Heritrix state files from previous versions are not compatible with this version. In other words:
Any crawl state folders from previous versions of Heritrix are not compatible with this version! You can only use this new release with new crawls!
- ExtractorYoutubeDL enables the discovery video URLs using the external tool youtube-dl. #257 (nlevitt)
- WARC writing is now configurable with its own processor chain making it easier to write extra records. #285 (nlevitt)
- MatchesListDecideRule gained a timeoutPerRegexSeconds option to help debug runaway regular expressions. #290 (csrster)
- Added support for forced queue assignment and parallel queues. #286 (adam-miller)
- JDK 11 is now supported. #269-#273 (ato)
- BDB was upgrade to 7.5.11. See warning at top. #281 (anjackson)
- Heritrix now uses Guava's bloom filter and base32 encoder. #300, #304 (hennekey)
- JDK 7 is no longer supported. #269-#273 (ato)
- A number of performance and reliability improvements were made to the unit tests.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse