-
Notifications
You must be signed in to change notification settings - Fork 760
ARC File Format
Original at: http://www.archive.org/web/researcher/ArcFileFormat.php
Authors: Mike Burner and Brewster Kahle
Date: September 15, 1996, Version 1.0
Internet Archive
The Archive stores the data it collects in large (currently 100MB)
aggregate files for ease of storage in a conventional file system. It
is
the Archive's experience that it is difficult to manage hundreds of
millions of small files in most existing file systems.
This document describes the format of the aggregate files. The file
format was designed to meet several requirements:
- The file must be self-contained: it must permit the aggregated
objects to be identified and unpacked without the use of a
companion index file. - The format must be extensible to accommodate files retrieved
via a variety of network protocols, including http, ftp, news,
gopher, and mail. - The file must be "stream able": it must be possible to
concatenate multiple archive files in a data stream. - Once written, a record must be viable: the integrity of the
file must not depend on subsequent creation of an in-file index
of the contents. - The reader will quickly recognize, however, that an external
index of the contents and object-offsets will greatly enhance
the retrievability of objects stored in this format. The Archive
maintains such indices, but does not seek to standardize their
format.
The description below uses pseudo-BNF to describe the archive file
format. By convention, archive files are named with a ".arc" extension
(e.g., "IA-000001.arc").
arc_file == <version_block><rest_of_arc_file>
version_block == See definition below
rest_of_arc_file == <doc>|<doc><rest_of_arc_file>
doc == <nl><URL-record><nl><network_doc>
URL-record == See definition below
network_doc == whatever the protocol returned
nl == Unix-newline-delimiter
sp == ' ' (ascii space) comma is inappropriate because it can be in an URL.
The version block identifies the original filename, file version, and URL record fields of the archive file.
version-block == filedesc://<path><sp><version specific data><sp><length><nl>
<version-number><sp><reserved><sp><origin-code><nl>
<URL-record-definition><nl>
<nl>
version-1-block == filedesc://<path><sp><ip_address><sp><date><sp>text/plain<sp><length><nl>
1<sp><reserved><sp><origin-code><nl>
<URL IP-address ArchivArchivee-date Content-type Archive-length<nl>
<nl>
version-2-block == filedesc://<path><sp><ip_address><sp><date><sp>text/plain<sp>200<sp>
-<sp>-<sp>0<sp><filename><sp><length><nl>
2<sp><reserved><sp><origin-code><nl>
URL<sp>IP-address<sp>Archive-date<sp>Content-type<sp>Result-code<sp>Checksum<sp>Location<sp> Offset<sp>Filename<sp>Archive-length<nl>
<nl>
The "filedesc" line is a special-case URL record (see below). The path
is the original path name of the archive file. The IP address is the
address of the machine that created the archive file. The date is the
date the archive file was created. The content type of "text/plain"
simply refers to the remainder of the version block. The length
specifies the size, in bytes, of the rest of the version block.
version-number == integer in ascii
reserved == string with no white space
origin-code == Name of gathering organization with no white space
URL-record-definition == names of fields in URL records
The URL record introduces an object in the archive file. It gives the
name and size of the object, as well as several pieces of metadata
about
its retrieval.
URL-record-v1 == <url><sp>
<ip-address><sp>
<archive-date><sp>
<content-type><sp>
<length><nl>
URL-record-v2 == <url><sp>
<ip-address><sp>
<archive-date><sp>
<content-type><sp>
<result-code><sp>
<checksum><sp>
<location><sp>
<offset><sp>
<filename><sp>
<length><nl>
url == ascii URL string (e.g., "http://www.alexa.com:80/")
ip_address == dotted-quad (eg 192.216.46.98 or 0.0.0.0)
archive-date == date archived
content-type == "no-type"|MIME type of data (e.g., "text/html")
length == ascii representation of size of network doc in bytes
date == YYYYMMDDhhmmss (Greenwich Mean Time)
result-code == result code or response code, (e.g. 200 or 302)
checksum == ascii representation of a checksum of the data. The specifics of the checksum are implementation specific.
location == "-"|url of re-direct
offset == offset in bytes from beginning of file to beginning of URL-record
filename == name of arc file
Note that all field values are ascii text. All fields have at least one character. No field value contains a space.
In the following example, please remember that length includes
carriage
returns and line feeds.
filedesc://IA-001102.arc 0 19960923142103 text/plain 76
1 0 Alexa Internet
URL IP-address Archive-date Content-type Archive-length
http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202
HTTP/1.0 200 Document follows
Date: Mon, 04 Nov 1996 14:21:06 GMT
Server: NCSA/1.4.1
Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT
Content-length: 30
<HTML>
Hello World!!!
</HTML>
filedesc://IA-001102.arc 0.0.0.0 19960923142103 text/plain 200 - - 0
IA-001102.arc 122
2 0 Alexa Internet
URL IP-address Archive-date Content-type Result-code Checksum
Location Offset Filename Archive-length
http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103
text/html 200 fac069150613fe55599cc7fa88aa089d - 209 IA-001102.arc 202
HTTP/1.0 200 Document follows
Date: Mon, 04 Nov 1996 14:21:06 GMT
Server: NCSA/1.4.1
Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT
Content-length: 30
<HTML>
Hello World!!!
</HTML>
As noted above, the best way to retrieve a specific object from an
archive file is to maintain an external database of object names, the
files they are located in, their offsets within the files, and the
sizes
of the objects. Then, to retrieve the object, one need only open the
file, seek to the offset, and do a single read of <size>
bytes.
Programs that need to read the file without an index (such as to
unpack
the whole file) should use buffered I/O. The URL record can then be
read
with an fgets(), and the objects can be read with an fread() of
<size>
bytes.
Since the Archive format uses the standard URL specification to
identify
objects, it naturally lends itself to the storage of data retrieved
via
protocols other than HTTP. For example, a news article might appear as
follows:
news:[email protected] 127.10.100.3 19960929142103 text/plain 328
Path: news.alexa.com!news1.best.com!news.dryswamp.edu!joebob
From: [email protected]
Newsgroups: alt.food
Subject: Re: I'm hungry
Date: 28 SEP 96 21:02:47 GMT
Organization: Dry Swamp University
Lines: 1
Message-ID: <[email protected]>
NNTP-Posting-Host: alligator.dryswamp.edu
please contact [email protected]
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse