-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
129 lines (89 loc) · 4.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
output: rmarkdown::github_document
---
`warc` : Tools to Work with the Web Archive Ecosystem
[WARC](http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml) files (and the metadata files that usually follow them) are the de facto method of archiving web content. There are tools in Python & Java to work with this data and there are many "big data" tools that make working with large-scale data from sites like [Common Crawl](http://commoncrawl.org/) and [The Internet Archive](https://archive.org/index.php) very straightforward.
Now there are tools to create and work with the WARC ecosystem in R.
Possible use-cases:
- If you need to scrape data from many URLs and would like to make the analyses on that data reproducible but are concerned that the sites may change format or may be offline but also don't want to manage individual HTML (etc) files
- Analyzing Common Crawl data (etc) natively in R
- Saving the entire state of an `httr` request (`warc` can turn `httr` responses into WARC files and turns WARC `response` records into `httr::response` objects)
`warc` can work with WARC files that are composed of individual gzip streams or on plaintext WARC files and can also read & generate CDX files. Support for more file types (e.g. WET, WAT, etc) are planned.
Since I ended up making some gz file functions for this package, it only seemed appropriate to expose them.
The following functions are implemented:
- `as_warc`: Convert an ‘httr::respone’ object to WARC response objects
- `create_cdx`: Takes as input an optionally compressed WARC file and creates a
- `create_warc_wget`: Newer versions of ‘wget’ are designed to support capturing of or
- `gz_close`: Close the gz file
- `gz_eof`: Test for end of file
- `gz_flush`: This will flush all zlib output buffers for the current file and
- `gz_fseek`: Sets the starting position for the next ‘gz_read()’ or
- `gz_gets`: Read a line from a gz file
- `gz_gets_raw`: Read a line from a gz file
- `gz_offset`: Return the current raw compressed offset in the file
- `gz_open`: Open a gzip file for reading or writing
- `gz_read_char`: Read from a gz file into a character vector
- `gz_read_raw`: Read from a gz file into a raw vector
- `gz_seek`: Sets the starting position for the next ‘gz_read()’ or
- `gz_tell`: Return the current raw uncompressedf offset in the file
- `gz_write_char`: Write an atomic character vector to a file
- `gz_write_raw`: Write a raw vector to a gz file
- `gzip_inflate_from_pos`: Given a gzip file that was built with concatenated individual gzip
- `read_cdx`: CDX files are used to index the content of WARC files.
- `read_warc_entry`: Given the path to a WARC file (compressed or uncompressed) and the
- `warc_headers`: Extract WARC headers from a WARC response object
- `write_warc_record`: Write a WARC record to a file
### Installation
You need `wget` on your system `PATH`. Folks on real operating systems can do the `apt-get`, `yum install` or `brew install` (et al) dance for your particular system. Version 1.18+ is recommended, but any version with support for WARC extensions should do.
Windows folks will need to grab the statically linked [32-bit](https://eternallybored.org/misc/wget/current/wget.exe) or [64-bit](https://eternallybored.org/misc/wget/current/wget64.exe) binaries from [here](https://eternallybored.org/misc/wget/) and put them on your system `PATH` somewhere if you want to create WARC files in bulk using `wget`.
```{r eval=FALSE}
devtools::install_git("https://gitlab.com/hrbrmstr/warc.git")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, error=FALSE}
options(width=120)
```
### Usage
```{r message=FALSE, warning=FALSE}
library(warc)
library(httr)
# current verison
packageVersion("warc")
cdx <- read_cdx(system.file("extdata", "20160901.cdx", package="warc"))
i <- 5
path <- file.path(cdx$warc_path[i], cdx$file_name[i])
start <- cdx$compressed_arc_file_offset[i]
entry <- read_warc_entry(path, start)
print(entry)
print(warc_headers(entry))
print(status_code(entry))
print(http_type(entry))
```
### Creating + reading
```{r message=FALSE, warning=FALSE}
library(warc)
library(purrr)
library(rvest)
warc_dir <- file.path(tempdir(), "rfolks")
dir.create(warc_dir)
urls <- c("http://rud.is/",
"http://hadley.nz/",
"http://dirk.eddelbuettel.com/",
"https://jeroenooms.github.io/",
"https://ironholds.org/")
create_warc_wget(urls, warc_dir, warc_file="rfolks-warc")
cdx <- read_cdx(file.path(warc_dir, "rfolks-warc.cdx"))
sites <- map(1:nrow(cdx),
~read_warc_entry(file.path(cdx$warc_path[.],
cdx$file_name[.]),
cdx$compressed_arc_file_offset[.]))
map(sites, ~read_html(content(., as="text", encoding="UTF-8"))) %>%
map_chr(~html_text(html_nodes(., "title")))
unlink(warc_dir)
```
### Test Results
```{r message=FALSE, warning=FALSE}
library(warc)
library(testthat)
date()
test_dir("tests/")
```