-
Notifications
You must be signed in to change notification settings - Fork 760
Mirroring HTML Files Only
Suppose you only want to crawl URIs that match http://foo.org/bar/\*.html. Also, you would like to save the crawled files in a file/directory format instead of saving them in WARC files. Also, assume the web server is case-sensitive. For example, http://foo.org/bar/abc.html and http://foo.org/bar/ABC.HTML are pointing to two different resources.
First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor. This Processor will store files in a directory structure that matches the crawled URIs. The files will be stored in the crawl job's mirror directory.
The following DecideRules should be configured in the rules property of the scope bean.
RejectDecideRule
SurtPrefixedDecideRule
TooManyHopsDecideRule
PathologicalPathDecideRule
TooManyPathSegmentsDecideRule
NotMatchesFilePatternDecideRule
PrerequisiteAcceptDecideRule
Use the NotMatchesFilePatternDecideRule
to keep from crawling any URIs
that don't end with .html. It is important that this DecideRule be
placed immediately before the PrerequisiteAcceptDecideRule
. Otherwise,
the DNS and robots.txt prerequisites will be rejected since they won't
match the regular expression.
For the NotMatchesFilePatternDecideRule
set the following property
values:
decision: REJECT
usePreset: CUSTOM
regex: .*(/|\.html)$
Note that the regexp will accept URIs that end with a /
as well as
.html
. If we don't accept the /
character, the seed URI will be
rejected. This also allows us to accept URIs like
http://foo.org/bar/dir/, which are likely pointing to an index.html. A
stricter regexp would be .*\.html$
, but the seed URI must be changed
if this regexp is used. Be aware that if Heritrix encounters a URI like
http://foo.org/bar/dir where dir is a directory, the URI will be
rejected since it is missing the terminating slash.
Finally, Heritrix must be configured to differentiate between abc.html and ABC.HTML. Do this by removing the LowercaseRule from the canonicalizationPolicy bean.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse