Skip to content

Commit

Permalink
Remove s3 examples.
Browse files Browse the repository at this point in the history
  • Loading branch information
ruebot committed Jul 24, 2024
1 parent 2a954a7 commit d72ed9f
Show file tree
Hide file tree
Showing 5 changed files with 4 additions and 231 deletions.
69 changes: 0 additions & 69 deletions docs/aut-at-scale.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,72 +52,3 @@ material, although it may take a while to process. This command then works:
```shell
spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=1.1.0000 --jars /path/to/aut-1.1.0-fatjar.jar
```

## Reading Data from AWS S3

We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
This advanced functionality requires that you provide Spark shell with your AWS
Access Key and AWS Secret Key, which you will get when creating your AWS
credentials ([read more
here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).

This script, for example, will find the top ten domains from a set of WARCs
found in an s3 bucket.

```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")

RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```

### Reading Data from an S3-like Endpoint

We also support loading data stored in an Amazon S3-like system such as [Ceph
RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
you'll need an access key and secret, and additionally, you'll need to define
your endpoint.

```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")

RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```

### Troubleshooting S3

If you run into this `AmazonHttpClient` timeout error:

```shell
19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
```

You can add the following two configuration lines to your script:

```scala
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)
```
12 changes: 2 additions & 10 deletions docs/extract-binary-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,24 +23,16 @@ import io.archivesunleashed.udfs._

sc.setLogLevel("INFO")

sc.hadoopConfiguration.set("fs.s3a.access.key", "YOUR ACCESS KEY")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "YOUR SECRET KEY ")

// Local web archive collection.
val warcs = RecordLoader.loadArchives("/local/path/to/warcs", sc)

// S3 hosted web archive collection.
val warcsS3 = RecordLoader.loadArchives("s3a://your-data-bucket/", sc)

// Choose your format: CSV or Parquet.

// For CSV:
// .write.csv("/path/to/derivatives/csv/audio")
// .write.csv("s3a://your-derivatives-bucket/parquet/pages")

// For Parquet:
// .write.parquet("/path/to/derivatives/parquet/pages/")
// .write.parquet("s3a://your-derivatives-bucket/parquet/pages")

// Audio Files.
warcs.audio()
Expand All @@ -66,13 +58,13 @@ warcs.pdfs()
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save("s3a://your-derivatives-bucket/csv/pdf")
.save("/path/to/derivatives/csv/pdf")

// Presentation Program Files.
warcs.presentationProgramFiles()
.select($"crawl_date", $"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1")
.write
.parquet("s3a://your-derivatives-bucket/parquet/presentation-program")
.parquet("/path/to/derivatives/parquet/presentation-program")

// Spreadsheets.
warcs.spreadsheets()
Expand Down
69 changes: 0 additions & 69 deletions website/versioned_docs/version-1.0.0/aut-at-scale.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,72 +53,3 @@ material, although it may take a while to process. This command then works:
```shell
spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --jars /path/to/aut-1.0.0-fatjar.jar
```

## Reading Data from AWS S3

We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
This advanced functionality requires that you provide Spark shell with your AWS
Access Key and AWS Secret Key, which you will get when creating your AWS
credentials ([read more
here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).

This script, for example, will find the top ten domains from a set of WARCs
found in an s3 bucket.

```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")

RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```

### Reading Data from an S3-like Endpoint

We also support loading data stored in an Amazon S3-like system such as [Ceph
RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
you'll need an access key and secret, and additionally, you'll need to define
your endpoint.

```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")

RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```

### Troubleshooting S3

If you run into this `AmazonHttpClient` timeout error:

```shell
19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
```

You can add the following two configuration lines to your script:

```scala
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)
```
16 changes: 2 additions & 14 deletions website/versioned_docs/version-1.0.0/extract-binary-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@ processor files, spreadsheet files, and presentation program files to a CSV
file, or into the [Apache Parquet](https://parquet.apache.org/) format
to [work with later](df-results.md#what-to-do-with-dataframe-results)?

You can also read and write to Amazon S3 by supplying your AWS credentials, and
using `s3a`.

## Scala RDD

**Will not be implemented.**
Expand All @@ -24,25 +21,16 @@ import io.archivesunleashed.udfs._

sc.setLogLevel("INFO")

sc.hadoopConfiguration.set("fs.s3a.access.key", "YOUR ACCESS KEY")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "YOUR SECRET KEY ")

// Local web archive collection.
val warcs = RecordLoader.loadArchives("/local/path/to/warcs", sc)

// S3 hosted web archive collection.
val warcsS3 = RecordLoader.loadArchives("s3a://your-data-bucket/", sc)

// Choose your format: CSV or Parquet.

// For CSV:
// .write.csv("/path/to/derivatives/csv/audio")
// .write.csv("s3a://your-derivatives-bucket/parquet/pages")

// For Parquet:
// .write.parquet("/path/to/derivatives/parquet/pages/")
// .write.parquet("s3a://your-derivatives-bucket/parquet/pages")

// Audio Files.
warcs.audio()
.select($"crawl_date", $"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1")
Expand All @@ -67,13 +55,13 @@ warcs.pdfs()
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save("s3a://your-derivatives-bucket/csv/pdf")
.save("/path/to/derivatives/csv/pdf")

// Presentation Program Files.
warcs.presentationProgramFiles()
.select($"crawl_date", $"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1")
.write
.parquet("s3a://your-derivatives-bucket/parquet/presentation-program")
.parquet("/path/to/derivatives/parquet/presentation-program")

// Spreadsheets.
warcs.spreadsheets()
Expand Down
69 changes: 0 additions & 69 deletions website/versioned_docs/version-1.1.1/aut-at-scale.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,72 +53,3 @@ material, although it may take a while to process. This command then works:
```shell
spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=1.1.0000 --jars /path/to/aut-1.1.0-fatjar.jar
```

## Reading Data from AWS S3

We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
This advanced functionality requires that you provide Spark shell with your AWS
Access Key and AWS Secret Key, which you will get when creating your AWS
credentials ([read more
here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).

This script, for example, will find the top ten domains from a set of WARCs
found in an s3 bucket.

```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")

RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```

### Reading Data from an S3-like Endpoint

We also support loading data stored in an Amazon S3-like system such as [Ceph
RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
you'll need an access key and secret, and additionally, you'll need to define
your endpoint.

```scala
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")

RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```

### Troubleshooting S3

If you run into this `AmazonHttpClient` timeout error:

```shell
19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
```

You can add the following two configuration lines to your script:

```scala
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)
```

0 comments on commit d72ed9f

Please sign in to comment.