-
Notifications
You must be signed in to change notification settings - Fork 760
A Quick Guide to Running Your First Crawl Job
The Main Console page is displayed after you have installed Heritrix and logged into the WUI.
- Enter the name of the new job in the text box with the "Create new
job with recommended starting configuration" label. Then click
"create."
The new job will be displayed in the list of jobs on the Main Console page. The job will be based on the profile-defaults profile in Hertitrix 3.0. As of Heritrix 3.1, the profile-defaults profile has been eliminated. See Profiles for more information.
- Click on the name of the new job and you will be taken to the job
page.
The name of the configuration file, crawler-beans.cxml, will be displayed at the top of the page. Next to the name is an "edit" link. - Click on the "edit" link and the contents of the configuration file will be displayed in an editable text area.
- At this point you must enter several properties to make the job
runnable.
- First, add a valid value to the metadata.operatorContactUrl property, such as http://www.archive.org.
- Next, populate the
<prop>
element of thelongerOverrides
bean with the seed values for the crawl. A test seed is configured for reference. When done click "save changes" at the top of the page. For more detailed information on configuring jobs see Configuring Jobs and Profiles.
- From the job screen, click "build." This command will build the Spring infrastructure needed to run the job. In the Job Log the following message will display: "INFO JOB instantiated."
- Next, click the "launch" button. This command launches the job in "paused" mode. At this point the job is ready to run.
- To run the job, click the "unpause" button. The job will now begin
sending requests to the seeds of your crawl. The status of the job
will be set to "Running." Refresh the page to see updated
statistics.
Note
-
A job will not be modified if the profile or job it was based on is changed.
-
Jobs based on the default profile are not ready to run as-is. The
metadata.operatorContactUrl
must be set to a valid value.
Detailed information about evaluating the progress of a job can be found at Job Analysis.
mainconsole.png (image/png)
addjob.png (image/png)
mainconsolenewjob.png (image/png)
job.png (image/png)
cxmledit.png (image/png)
cxmloperator.png (image/png)
cxmloperator.png (image/png)
cxmlseeds.png (image/png)
build2.png (image/png)
launch.png (image/png)
unpause.png (image/png)
mainconsole.png (image/png)
newjob.png (image/png)
newjob.png (image/png)
mainconsolenewjob.png (image/png)
job.png (image/png)
cxmledit.png (image/png)
cxmleditoperator.png (image/png)
cxmledit2.png (image/png)
jobbuilt.png (image/png)
joblaunched.png (image/png)
jobunpaused.png (image/png)
cxmleditoperator[1].png
(image/png)
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse