-
Notifications
You must be signed in to change notification settings - Fork 760
Users of Heritrix
Alex Osborne edited this page Sep 1, 2022
·
4 revisions
If your project, institution, or company uses Heritrix, please feel free to add details below.
If you'd like to cite Heritrix you can reference the following paper:
@inproceedings{mohr2004introduction,
title={Introduction to heritrix},
author={Mohr, Gordon and Stack, Michael and Rnitovic, Igor and Avery, Dan and Kimpton, Michele},
booktitle={4th International Web Archiving Workshop},
pages={109--115},
year={2004},
organization={Citeseer}
}
- Internet Archive - leads open-source development of Heritrix; uses Heritrix as the crawler for numerous focused/thematic and broad crawling projects, including the Archive-It service and crawls feeding our Wayback Machine and partner collections. Versions most often used: Heritrix 1.14.4-SNAPSHOT, 3.0.0-SNAPSHOT.
- The British Library - uses Heritrix 1.14.3 as the crawler for our Domain Research Project. Collaborated with the National Library of New Zealand to develop the Web Curator Tool which also uses Heritrix as the underlying crawler and is presently used to drive the UK Web Archive.
- The Library of Congress - works with the Internet Archive to help build focused and thematic collection Web archives. The Library's Web Archiving Team has also begun in-house crawling, currently using Heritrix 3.0.X, for selected projects.
- California Digital Library, Web Archiving Service
- BNCF, Biblioteca Nazionale Centrale Firenze uses heritrix3 and warc 1.0 for the archiving program of doctoral electronic theses from universities repositories.
- Smithsonian Institution Archives testing Heritrix 1.14.3 on a Windows machine to capture its numerous websites and social networking sites.
- Netarchive.dk uses Heritrix 1.14.3 integrated within NetarchiveSuite to harvest the Danish internet.
- National and University Library of Iceland Has actively participated in the development of Heritrix since the projects inception. Have used Heritrix 1 to conduct domain and targeted crawls since 2004 and Heritrix 3 since 2010.
- The French National Library (BnF) - uses Heritrix for productive crawls since the end of 2006. It has performed the first French national domain crawl with Heritrix/NetarchiveSuite in spring 2010.
- The Austrian National Library uses Heritrix since the beginning of Web@rchive Austria in 2008.
- The Biblioteca de Catalunya (BC), the National Library of Catalonia, initiated in June 2005 a project called PADICAT (Digital Heritage of Catalonia). Uses Heritrix 1.14.4 as a crawler for the .CAT top level domain, selective compilation of the web site output of catalan organizations and focused harvesting of public events. Those crawls are used to feed Wayback 1.4.2 (URL search) and WERA (keyword search) and are shown in open access through PADICAT.
- [add yourself here]
- neofonie - uses Heritrix 1.14.4 and 3.0.x to gather unstructured data. After the data has been processed, enriched and classified, it is then used in search engines and web applications.
- Dataclip - uses Heritrix 3 to crawl the top 10 million business websites to offer sales and marketing intelligence based on what web technologies are used by those organizations.
- TNR Global - uses Heritrix to crawl web content. E.g. slide #12 in this presentation: Migration from Fast ESP to Lucene Solr - Michael McIntosh
- York University Libraries - uses Heritrix 3 to crawl local web content for common records schedule.
- [add yourself here]
- CiteSeerX - uses Heritrix 1.14.4 and 3.0.x to crawl open access academic documents online.
- Web Archiving Integration Layer (WAIL) bundles a pre-configured Heritrix 3.2.0 binary with other personal web archiving tools into a native application and provides a graphical user interface for access.
- [add yourself here]
- [add yourself here]
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse