-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web scraping roastery websites #1
Comments
Scrapy has a nicer website so it's obviously superior. |
Scrapy's documentation looks nicer too. |
https://hexfox.com/p/scrapy-vs-beautifulsoup/
Afterwards the entire article is about why you should use Scrapy lol |
Looking at Scrapy, the syntax is kind of cumbersome. You need to select the elements of the page you want to extract data from, but the problem is that we can't feasibly make a different scraper for each site, so either we use some kind of basic approach to finding/selecting elements or we just grab all the text and use a more language-based approach to find coffee descriptions/information in the text. The language-based approach seems both easier and more reliable since web pages vary so wildly. In that case, BeautifulSoup is nice because it has a method to just grab all the text from a page. Maybe using both, Scrapy for the scraper to traverse sites and BeautifulSoup to grab the text? |
That's not a bad idea. Right now my idea is to search for "hits" to a whitelist of roasteries from cafe websites. I don't know if we need any NLP for this. It might be as simple as a regex. Although I'm definitely not opposed to it if you can think of something clever. |
I decided to take a crack at writing a general scraper with the Requests library to fetch the HTML and Beautiful Soup to crawl the site for the rest of the pages. It's still very much a work in progress, so I won't be pushing my work so far for a bit. My baseline has been working with Matchstick's site and Agro's site, but Agro's site is unfortunately not very deep.... they don't even list details of their coffees.... I haven't looked deeply into the other roasters sites, but it'll be pretty dissapointing to find out that most of them don't even have relevant roast details available... |
@EChisholm I think a good first start is just getting information of which independent cafes host which local roasts. I agree, let's avoid tasting notes and other roast details for now. |
Building the data will be the first step, and maybe the most difficult step.
To-do:
Pick a web scraping tool (possibly BeautifulSoup).The text was updated successfully, but these errors were encountered: