Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web scraping roastery websites #1

Open
emreerhan opened this issue Mar 23, 2018 · 8 comments
Open

Web scraping roastery websites #1

emreerhan opened this issue Mar 23, 2018 · 8 comments

Comments

@emreerhan
Copy link
Contributor

emreerhan commented Mar 23, 2018

Building the data will be the first step, and maybe the most difficult step.

To-do:

  • Pick a web scraping tool (possibly BeautifulSoup).
  • Build initial data set of a coffee shop and where they get their coffee
  • Make generalizable scraper to build this kind of data for any coffee shop
@landalex
Copy link

Scrapy has a nicer website so it's obviously superior.

@emreerhan
Copy link
Contributor Author

Scrapy's documentation looks nicer too.

@emreerhan
Copy link
Contributor Author

emreerhan commented Mar 23, 2018

https://hexfox.com/p/scrapy-vs-beautifulsoup/

So the difference between the two is actually quite large: Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end; whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.

Afterwards the entire article is about why you should use Scrapy lol

@landalex
Copy link

Looking at Scrapy, the syntax is kind of cumbersome. You need to select the elements of the page you want to extract data from, but the problem is that we can't feasibly make a different scraper for each site, so either we use some kind of basic approach to finding/selecting elements or we just grab all the text and use a more language-based approach to find coffee descriptions/information in the text.

The language-based approach seems both easier and more reliable since web pages vary so wildly. In that case, BeautifulSoup is nice because it has a method to just grab all the text from a page. Maybe using both, Scrapy for the scraper to traverse sites and BeautifulSoup to grab the text?

@landalex
Copy link

Now I'm in an NLP rabbit whole: spaCy and Prodigy look interesting, specifically Prodigy for allowing faster annotation of data.

@emreerhan
Copy link
Contributor Author

emreerhan commented Mar 26, 2018

BeautifulSoup is nice because it has a method to just grab all the text from a page

That's not a bad idea. Right now my idea is to search for "hits" to a whitelist of roasteries from cafe websites. I don't know if we need any NLP for this. It might be as simple as a regex. Although I'm definitely not opposed to it if you can think of something clever.

@EChisholm
Copy link
Contributor

I decided to take a crack at writing a general scraper with the Requests library to fetch the HTML and Beautiful Soup to crawl the site for the rest of the pages. It's still very much a work in progress, so I won't be pushing my work so far for a bit. My baseline has been working with Matchstick's site and Agro's site, but Agro's site is unfortunately not very deep.... they don't even list details of their coffees.... I haven't looked deeply into the other roasters sites, but it'll be pretty dissapointing to find out that most of them don't even have relevant roast details available...

@emreerhan
Copy link
Contributor Author

@EChisholm I think a good first start is just getting information of which independent cafes host which local roasts. I agree, let's avoid tasting notes and other roast details for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants