Remove s3 examples.

- Address: archivesunleashed/aut#556
archivesunleashed · Jul 24, 2024 · d72ed9f · d72ed9f
1 parent 2a954a7
commit d72ed9f
Show file tree

Hide file tree

Showing 5 changed files with 4 additions and 231 deletions.
diff --git a/docs/aut-at-scale.md b/docs/aut-at-scale.md
@@ -52,72 +52,3 @@ material, although it may take a while to process. This command then works:
 ```shell
 spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=1.1.0000 --jars /path/to/aut-1.1.0-fatjar.jar
 ```
-
-## Reading Data from AWS S3
-
-We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
-This advanced functionality requires that you provide Spark shell with your AWS
-Access Key and AWS Secret Key, which you will get when creating your AWS
-credentials ([read more
-here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).
-
-This script, for example, will find the top ten domains from a set of WARCs
-found in an s3 bucket.
-
-```scala
-import io.archivesunleashed._
-import io.archivesunleashed.matchbox._
-
-sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
-
-RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
-  .keepValidPages()
-  .map(r => ExtractDomain(r.getUrl))
-  .countItems()
-  .take(10)
-```
-
-### Reading Data from an S3-like Endpoint
-
-We also support loading data stored in an Amazon S3-like system such as [Ceph
-RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
-you'll need an access key and secret, and additionally, you'll need to define
-your endpoint.
-
-```scala
-import io.archivesunleashed._
-import io.archivesunleashed.matchbox._
-
-sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
-sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")
-
-RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
-  .keepValidPages()
-  .map(r => ExtractDomain(r.getUrl))
-  .countItems()
-  .take(10)
-```
-
-### Troubleshooting S3
-
-If you run into this `AmazonHttpClient` timeout error:
-
-```shell
-19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
-org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
-  at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
-  at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
-  at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
-  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
-  at java.lang.reflect.Method.invoke(Method.java:498)
-  at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
-```
-
-You can add the following two configuration lines to your script:
-
-```scala
-sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
-sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)
-```
diff --git a/docs/extract-binary-info.md b/docs/extract-binary-info.md
@@ -23,24 +23,16 @@ import io.archivesunleashed.udfs._
 
 sc.setLogLevel("INFO")
 
-sc.hadoopConfiguration.set("fs.s3a.access.key", "YOUR ACCESS KEY")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "YOUR SECRET KEY ")
-
 // Local web archive collection.
 val warcs = RecordLoader.loadArchives("/local/path/to/warcs", sc)
 
-// S3 hosted web archive collection.
-val warcsS3 = RecordLoader.loadArchives("s3a://your-data-bucket/", sc)
-
 // Choose your format: CSV or Parquet.
 
 // For CSV:
 //  .write.csv("/path/to/derivatives/csv/audio")
-//  .write.csv("s3a://your-derivatives-bucket/parquet/pages")
 
 // For Parquet:
 // .write.parquet("/path/to/derivatives/parquet/pages/")
-// .write.parquet("s3a://your-derivatives-bucket/parquet/pages")
 
 // Audio Files.
 warcs.audio()
@@ -66,13 +58,13 @@ warcs.pdfs()
   .format("csv")
   .option("escape", "\"")
   .option("encoding", "utf-8")
-  .save("s3a://your-derivatives-bucket/csv/pdf")
+  .save("/path/to/derivatives/csv/pdf")
 
 // Presentation Program Files.
 warcs.presentationProgramFiles()
   .select($"crawl_date", $"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1")
   .write
-  .parquet("s3a://your-derivatives-bucket/parquet/presentation-program")
+  .parquet("/path/to/derivatives/parquet/presentation-program")
 
 // Spreadsheets.
 warcs.spreadsheets()

diff --git a/website/versioned_docs/version-1.0.0/aut-at-scale.md b/website/versioned_docs/version-1.0.0/aut-at-scale.md
@@ -53,72 +53,3 @@ material, although it may take a while to process. This command then works:
 ```shell
 spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --jars /path/to/aut-1.0.0-fatjar.jar
 ```
-
-## Reading Data from AWS S3
-
-We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
-This advanced functionality requires that you provide Spark shell with your AWS
-Access Key and AWS Secret Key, which you will get when creating your AWS
-credentials ([read more
-here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).
-
-This script, for example, will find the top ten domains from a set of WARCs
-found in an s3 bucket.
-
-```scala
-import io.archivesunleashed._
-import io.archivesunleashed.matchbox._
-
-sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
-
-RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
-  .keepValidPages()
-  .map(r => ExtractDomain(r.getUrl))
-  .countItems()
-  .take(10)
-```
-
-### Reading Data from an S3-like Endpoint
-
-We also support loading data stored in an Amazon S3-like system such as [Ceph
-RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
-you'll need an access key and secret, and additionally, you'll need to define
-your endpoint.
-
-```scala
-import io.archivesunleashed._
-import io.archivesunleashed.matchbox._
-
-sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
-sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")
-
-RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
-  .keepValidPages()
-  .map(r => ExtractDomain(r.getUrl))
-  .countItems()
-  .take(10)
-```
-
-### Troubleshooting S3
-
-If you run into this `AmazonHttpClient` timeout error:
-
-```shell
-19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
-org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
-  at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
-  at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
-  at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
-  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
-  at java.lang.reflect.Method.invoke(Method.java:498)
-  at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
-```
-
-You can add the following two configuration lines to your script:
-
-```scala
-sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
-sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)
-```
diff --git a/website/versioned_docs/version-1.0.0/extract-binary-info.md b/website/versioned_docs/version-1.0.0/extract-binary-info.md
@@ -9,9 +9,6 @@ processor files, spreadsheet files, and presentation program files to a CSV
 file, or into the [Apache Parquet](https://parquet.apache.org/) format
 to [work with later](df-results.md#what-to-do-with-dataframe-results)?
 
-You can also read and write to Amazon S3 by supplying your AWS credentials, and
-using `s3a`.
-
 ## Scala RDD
 
 **Will not be implemented.**
@@ -24,25 +21,16 @@ import io.archivesunleashed.udfs._
 
 sc.setLogLevel("INFO")
 
-sc.hadoopConfiguration.set("fs.s3a.access.key", "YOUR ACCESS KEY")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "YOUR SECRET KEY ")
-
 // Local web archive collection.
 val warcs = RecordLoader.loadArchives("/local/path/to/warcs", sc)
 
-// S3 hosted web archive collection.
-val warcsS3 = RecordLoader.loadArchives("s3a://your-data-bucket/", sc)
-
 // Choose your format: CSV or Parquet.
 
 // For CSV:
 //  .write.csv("/path/to/derivatives/csv/audio")
-//  .write.csv("s3a://your-derivatives-bucket/parquet/pages")
 
 // For Parquet:
 // .write.parquet("/path/to/derivatives/parquet/pages/")
-// .write.parquet("s3a://your-derivatives-bucket/parquet/pages")
-
 // Audio Files.
 warcs.audio()
   .select($"crawl_date", $"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1")
@@ -67,13 +55,13 @@ warcs.pdfs()
   .format("csv")
   .option("escape", "\"")
   .option("encoding", "utf-8")
-  .save("s3a://your-derivatives-bucket/csv/pdf")
+  .save("/path/to/derivatives/csv/pdf")
 
 // Presentation Program Files.
 warcs.presentationProgramFiles()
   .select($"crawl_date", $"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1")
   .write
-  .parquet("s3a://your-derivatives-bucket/parquet/presentation-program")
+  .parquet("/path/to/derivatives/parquet/presentation-program")
 
 // Spreadsheets.
 warcs.spreadsheets()

diff --git a/website/versioned_docs/version-1.1.1/aut-at-scale.md b/website/versioned_docs/version-1.1.1/aut-at-scale.md
@@ -53,72 +53,3 @@ material, although it may take a while to process. This command then works:
 ```shell
 spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=1.1.0000 --jars /path/to/aut-1.1.0-fatjar.jar
 ```
-
-## Reading Data from AWS S3
-
-We also support loading data stored in [Amazon S3](https://aws.amazon.com/s3/).
-This advanced functionality requires that you provide Spark shell with your AWS
-Access Key and AWS Secret Key, which you will get when creating your AWS
-credentials ([read more
-here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/)).
-
-This script, for example, will find the top ten domains from a set of WARCs
-found in an s3 bucket.
-
-```scala
-import io.archivesunleashed._
-import io.archivesunleashed.matchbox._
-
-sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
-
-RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
-  .keepValidPages()
-  .map(r => ExtractDomain(r.getUrl))
-  .countItems()
-  .take(10)
-```
-
-### Reading Data from an S3-like Endpoint
-
-We also support loading data stored in an Amazon S3-like system such as [Ceph
-RADOS](https://docs.ceph.com/docs/master/rados/). Similar to the above example,
-you'll need an access key and secret, and additionally, you'll need to define
-your endpoint.
-
-```scala
-import io.archivesunleashed._
-import io.archivesunleashed.matchbox._
-
-sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
-sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
-sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")
-
-RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
-  .keepValidPages()
-  .map(r => ExtractDomain(r.getUrl))
-  .countItems()
-  .take(10)
-```
-
-### Troubleshooting S3
-
-If you run into this `AmazonHttpClient` timeout error:
-
-```shell
-19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
-org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
-  at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
-  at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
-  at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
-  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
-  at java.lang.reflect.Method.invoke(Method.java:498)
-  at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
-```
-
-You can add the following two configuration lines to your script:
-
-```scala
-sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
-sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)
-```