Merge branch 'main' of https://github.com/cimentadaj/dataharvesting

cimentadaj · Jan 27, 2024 · f64304e · f64304e
2 parents 8e2d35a + 7a34e67
commit f64304e
Show file tree

Hide file tree

Showing 4 changed files with 14 additions and 320 deletions.
diff --git a/book/slides/README.md b/book/slides/README.md
@@ -7,27 +7,27 @@ Slides:
     -   [Welcome](./welcome/welcome.html)
     -   [A Primer on Webscraping](./primer_webscraping/primer_webscraping.html)
     -   [Data Formats for Webscraping](./data_formats_wscrap/data_formats_wscrap.html)
-    
+
 -   Second class
     -   [Intro to Regular Expressions](./intro_regex/intro_regex.html)
     -   [Intro to XPath](./intro_xpath/intro_xpath.html)
-    
+
 -   Third class
     -   [Scraping Spanish Schools](./case_study_spanish_schools/case_study_spanish_schools.html)
     -   [Project Guidelines](./project_guidelines/project_guidelines.html)
-    
+
 -   Fourth class
     -   [Introduction to REST APIs](./intro_apis/intro_apis.html)
     -   [A Primer on REST APIs](./primer_apis/primer_apis.html)
-    
+
 -   Fifth class
     -   [A Dialogue between computers](./dialogue_between_computers/dialogue_between_computers.html)
     -   [Intro to JSON](./intro_json/intro_json.html)
     -   [Reminder project guidelines](./project_guidelines/project_guidelines.html)
 
 -   Sixth class
     -   [Exploring the Amazon API](./cs_exploring_amazon_api/cs_exploring_amazon_api.html)
-    
+
 -   Seventh class
-    -   [Automating Webscraping](./automating_web_scraping/automating_web_scraping.html)
+    -   [Automating Webscraping](./automating_web_scraping/automating_data_harvesting.html)
     -   [Grabbing real time bicycle data](./automating_apis/automating_apis.html)
diff --git a/book/slides/automating_web_scraping/automating_data_harvesting.qmd b/book/slides/automating_web_scraping/automating_data_harvesting.qmd
@@ -290,9 +290,10 @@ Pick `nano`, the easiest one.
 
 ## Scheduling our scraper
 
-![](images/crontab_schedule_file.png){fig-align="center"}
-
 Here is where we write `* * * * * Rscript ~/newspaper/newspaper_scraper.R`
+Depending, you might need to add: `PATH=/usr/local/bin:/usr/bin:/bin`
+
+![](images/crontab_schedule_file.png){fig-align="center"}
 
 ## Scheduling our scraper
 

diff --git a/book/slides/project_guidelines/project_guidelines.qmd b/book/slides/project_guidelines/project_guidelines.qmd
@@ -7,16 +7,13 @@ editor: visual
 
 ## Project guidelines
 
-1.  A public Github repository that I will clone.
+1.  A **private** Github repository that you will share with me.
 2.  A clear README on how to reproduce the scraper EXACTLY.
-3.  One of the key points in the homework is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.
-4.  Document what the output is, where it is saved and what each script in the program does.
+3.  The entire scraping program should be in an RMarkdown HTML file.
+4.  The project should not be an R package or a set of scripts (see 3.)
+5.  One of the key points is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.
 
 ## Project guidelines
 
 1.  We want some medium-hard scraping/API projects. This means that I expect you to scrape several sources of information (on the same website or combining several websites) to build a meaningful dataset that you can possibly use for other classes or your own benefit. Remember most of the mark is for this project.
-2.  If your project is an API you need to provide clear instructions on how to get a token and where I need to place the token. \*IMPORTANT: TOKENS SHOULD NOT BE POSTED ON YOUR REPOSITORY, THIS IS SENSITIVE INFORMATION\*.
-
-## Examples
-
--   <https://github.com/sg-peytrignet/MHSDS-pipeline>
+2.  If your project is an API you need to provide clear instructions on how to get a token and where I need to place the token. \*IMPORTANT: TOKENS SHOULD NOT BE POSTED ON YOUR REPOSITORY, THIS IS SENSITIVE INFORMATION\*.
diff --git a/exercises_1_2.R b/exercises_1_2.R