Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
cimentadaj committed Jan 27, 2024
2 parents 8e2d35a + 7a34e67 commit f64304e
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 320 deletions.
12 changes: 6 additions & 6 deletions book/slides/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,27 @@ Slides:
- [Welcome](./welcome/welcome.html)
- [A Primer on Webscraping](./primer_webscraping/primer_webscraping.html)
- [Data Formats for Webscraping](./data_formats_wscrap/data_formats_wscrap.html)

- Second class
- [Intro to Regular Expressions](./intro_regex/intro_regex.html)
- [Intro to XPath](./intro_xpath/intro_xpath.html)

- Third class
- [Scraping Spanish Schools](./case_study_spanish_schools/case_study_spanish_schools.html)
- [Project Guidelines](./project_guidelines/project_guidelines.html)

- Fourth class
- [Introduction to REST APIs](./intro_apis/intro_apis.html)
- [A Primer on REST APIs](./primer_apis/primer_apis.html)

- Fifth class
- [A Dialogue between computers](./dialogue_between_computers/dialogue_between_computers.html)
- [Intro to JSON](./intro_json/intro_json.html)
- [Reminder project guidelines](./project_guidelines/project_guidelines.html)

- Sixth class
- [Exploring the Amazon API](./cs_exploring_amazon_api/cs_exploring_amazon_api.html)

- Seventh class
- [Automating Webscraping](./automating_web_scraping/automating_web_scraping.html)
- [Automating Webscraping](./automating_web_scraping/automating_data_harvesting.html)
- [Grabbing real time bicycle data](./automating_apis/automating_apis.html)
Original file line number Diff line number Diff line change
Expand Up @@ -290,9 +290,10 @@ Pick `nano`, the easiest one.

## Scheduling our scraper

![](images/crontab_schedule_file.png){fig-align="center"}

Here is where we write `* * * * * Rscript ~/newspaper/newspaper_scraper.R`
Depending, you might need to add: `PATH=/usr/local/bin:/usr/bin:/bin`

![](images/crontab_schedule_file.png){fig-align="center"}

## Scheduling our scraper

Expand Down
13 changes: 5 additions & 8 deletions book/slides/project_guidelines/project_guidelines.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,13 @@ editor: visual

## Project guidelines

1. A public Github repository that I will clone.
1. A **private** Github repository that you will share with me.
2. A clear README on how to reproduce the scraper EXACTLY.
3. One of the key points in the homework is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.
4. Document what the output is, where it is saved and what each script in the program does.
3. The entire scraping program should be in an RMarkdown HTML file.
4. The project should not be an R package or a set of scripts (see 3.)
5. One of the key points is to make it reproducible: I should be able to clone the repository and execute whatever you need to me to produce the scraper.

## Project guidelines

1. We want some medium-hard scraping/API projects. This means that I expect you to scrape several sources of information (on the same website or combining several websites) to build a meaningful dataset that you can possibly use for other classes or your own benefit. Remember most of the mark is for this project.
2. If your project is an API you need to provide clear instructions on how to get a token and where I need to place the token. \*IMPORTANT: TOKENS SHOULD NOT BE POSTED ON YOUR REPOSITORY, THIS IS SENSITIVE INFORMATION\*.

## Examples

- <https://github.com/sg-peytrignet/MHSDS-pipeline>
2. If your project is an API you need to provide clear instructions on how to get a token and where I need to place the token. \*IMPORTANT: TOKENS SHOULD NOT BE POSTED ON YOUR REPOSITORY, THIS IS SENSITIVE INFORMATION\*.
304 changes: 0 additions & 304 deletions exercises_1_2.R

This file was deleted.

0 comments on commit f64304e

Please sign in to comment.