Web scraping roastery websites #1

emreerhan · 2018-03-23T00:35:05Z

Building the data will be the first step, and maybe the most difficult step.

To-do:

~~Pick a web scraping tool (possibly BeautifulSoup).~~
Build initial data set of a coffee shop and where they get their coffee
Make generalizable scraper to build this kind of data for any coffee shop

landalex · 2018-03-23T03:46:01Z

Scrapy has a nicer website so it's obviously superior.

emreerhan · 2018-03-23T04:21:48Z

Scrapy's documentation looks nicer too.

emreerhan · 2018-03-23T04:27:21Z

https://hexfox.com/p/scrapy-vs-beautifulsoup/

So the difference between the two is actually quite large: Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end; whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.

Afterwards the entire article is about why you should use Scrapy lol

landalex · 2018-03-23T22:12:36Z

Looking at Scrapy, the syntax is kind of cumbersome. You need to select the elements of the page you want to extract data from, but the problem is that we can't feasibly make a different scraper for each site, so either we use some kind of basic approach to finding/selecting elements or we just grab all the text and use a more language-based approach to find coffee descriptions/information in the text.

The language-based approach seems both easier and more reliable since web pages vary so wildly. In that case, BeautifulSoup is nice because it has a method to just grab all the text from a page. Maybe using both, Scrapy for the scraper to traverse sites and BeautifulSoup to grab the text?

landalex · 2018-03-23T22:23:08Z

Now I'm in an NLP rabbit whole: spaCy and Prodigy look interesting, specifically Prodigy for allowing faster annotation of data.

emreerhan · 2018-03-26T01:13:48Z

BeautifulSoup is nice because it has a method to just grab all the text from a page

That's not a bad idea. Right now my idea is to search for "hits" to a whitelist of roasteries from cafe websites. I don't know if we need any NLP for this. It might be as simple as a regex. Although I'm definitely not opposed to it if you can think of something clever.

EChisholm · 2018-04-02T04:53:42Z

I decided to take a crack at writing a general scraper with the Requests library to fetch the HTML and Beautiful Soup to crawl the site for the rest of the pages. It's still very much a work in progress, so I won't be pushing my work so far for a bit. My baseline has been working with Matchstick's site and Agro's site, but Agro's site is unfortunately not very deep.... they don't even list details of their coffees.... I haven't looked deeply into the other roasters sites, but it'll be pretty dissapointing to find out that most of them don't even have relevant roast details available...

emreerhan · 2018-04-02T05:05:02Z

@EChisholm I think a good first start is just getting information of which independent cafes host which local roasts. I agree, let's avoid tasting notes and other roast details for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web scraping roastery websites #1

Web scraping roastery websites #1

emreerhan commented Mar 23, 2018 •

edited

Loading

landalex commented Mar 23, 2018

emreerhan commented Mar 23, 2018

emreerhan commented Mar 23, 2018 •

edited

Loading

landalex commented Mar 23, 2018

landalex commented Mar 23, 2018

emreerhan commented Mar 26, 2018 •

edited

Loading

EChisholm commented Apr 2, 2018

emreerhan commented Apr 2, 2018

Web scraping roastery websites #1

Web scraping roastery websites #1

Comments

emreerhan commented Mar 23, 2018 • edited Loading

landalex commented Mar 23, 2018

emreerhan commented Mar 23, 2018

emreerhan commented Mar 23, 2018 • edited Loading

landalex commented Mar 23, 2018

landalex commented Mar 23, 2018

emreerhan commented Mar 26, 2018 • edited Loading

EChisholm commented Apr 2, 2018

emreerhan commented Apr 2, 2018

emreerhan commented Mar 23, 2018 •

edited

Loading

emreerhan commented Mar 23, 2018 •

edited

Loading

emreerhan commented Mar 26, 2018 •

edited

Loading