-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New/add spider chi_ssa_35 and it's test case #991
base: main
Are you sure you want to change the base?
Changes from 1 commit
f973966
4ae7193
1c28a54
7ecb840
fcca61f
eb4dfd0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
from datetime import datetime | ||
from io import BytesIO, StringIO | ||
|
||
import requests | ||
from city_scrapers_core.constants import COMMISSION | ||
from city_scrapers_core.items import Meeting | ||
from city_scrapers_core.spiders import CityScrapersSpider | ||
from pdfminer.high_level import extract_text_to_fp | ||
from pdfminer.layout import LAParams | ||
|
||
|
||
class ChiSsa35Spider(CityScrapersSpider): | ||
name = "chi_ssa_35" | ||
agency = "Chicago Special Service Area #35 Lincoln Ave" | ||
timezone = "America/Chicago" | ||
start_urls = [ | ||
"https://www.lincolnparkchamber.com/" | ||
+ "businesses/special-service-areas/lincoln-avenue-ssa/ssa-administration/" | ||
] | ||
|
||
def parse(self, response): | ||
""" | ||
`parse` should always `yield` Meeting items. | ||
|
||
Change the `_parse_title`, `_parse_start`, etc methods to fit your scraping | ||
needs. | ||
""" | ||
content_div = response.css("div.content_block.content.background_white") | ||
_ = content_div.css("h4::text").getall() | ||
dates = content_div.css("ol").css("li::text").getall() | ||
urls = content_div.css("a") | ||
content = list(zip(dates, urls[: len(dates)])) | ||
|
||
for item in content[:1]: | ||
meeting = Meeting( | ||
title=self._parse_title(item), | ||
description=self._parse_description(item), | ||
classification=self._parse_classification(item), | ||
start=self._parse_start(item), | ||
end=self._parse_end(item), | ||
all_day=self._parse_all_day(item), | ||
time_notes=self._parse_time_notes(item), | ||
location=self._parse_location(item), | ||
links=self._parse_links(item), | ||
source=self._parse_source(response), | ||
) | ||
|
||
meeting["status"] = self._get_status(meeting) | ||
meeting["id"] = self._get_id(meeting) | ||
|
||
yield meeting | ||
|
||
def _parse_title(self, item): | ||
"""Parse or generate meeting title.""" | ||
return "Commission" | ||
|
||
def _parse_description(self, item): | ||
"""Parse or generate meeting description.""" | ||
item_url = item[1].css("::attr(href)").get() | ||
response = requests.get(item_url) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't much meaningful information in the description, so we can drop this. We would want to include PDF parsing within scrapy so we can manage rate-limiting and handling responses in a unified way instead of using |
||
|
||
lp = LAParams(line_margin=0.1) | ||
output_str = StringIO() | ||
extract_text_to_fp(BytesIO(response.content), output_str, laparams=lp) | ||
pdf_text = output_str.getvalue().replace("\n", "") | ||
|
||
return pdf_text | ||
|
||
def _parse_classification(self, item): | ||
"""Parse or generate classification from allowed options.""" | ||
return COMMISSION | ||
|
||
def _clean_date_item(self, date_item): | ||
date_item = date_item.split() | ||
|
||
if len(date_item) == 5: | ||
date_item[-1] = date_item[-1].replace(")", "") | ||
date_item[-2] = date_item[-2].replace("(", "") | ||
date_item.append(f"{datetime.today().year}") | ||
|
||
elif len(date_item) == 3: | ||
date_item.append("09:00 am") | ||
date_item.append(f"{datetime.today().year}") | ||
|
||
return " ".join(date_item) | ||
|
||
def _parse_start(self, item): | ||
"""Parse start datetime as a naive datetime object.""" | ||
date_item = self._clean_date_item(item[0]) | ||
date_obj = datetime.strptime(date_item, "%A, %B %d %H:%M %p %Y") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've had some issues in the past with agencies putting the wrong day of the week, but the correct date, so we should ignore the weekday. Also, if we're using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, i'll make the changes |
||
|
||
return date_obj | ||
|
||
def _parse_end(self, item): | ||
"""Parse end datetime as a naive datetime object. Added by pipeline if None""" | ||
return None | ||
|
||
def _parse_time_notes(self, item): | ||
"""Parse any additional notes on the timing of the meeting""" | ||
return "" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to add something about confirming at the number they leave on the page There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you talking about the number at the
|
||
|
||
def _parse_all_day(self, item): | ||
"""Parse or generate all-day status. Defaults to False.""" | ||
return False | ||
|
||
def _parse_location(self, item): | ||
"""Parse or generate location.""" | ||
return { | ||
"address": "Lincoln Park Chamber of Commerce, 2468 N. Lincoln, Chicago", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this has varied a lot according to their site. We might want to just include "Confirm with agency" in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍🏻 |
||
"name": "Lincoln Park Chamber of Commerce", | ||
} | ||
|
||
def _parse_links(self, item): | ||
"""Parse or generate links.""" | ||
href = item[1].css("::attr(href)").get() | ||
title = item[1].css("::text").get() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The date isn't very relevant if we're already associating it with a meeting on a given date, so we should label it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate a little more? I didn't understand @pjsier There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, if we're looking at the document in the context of a meeting we already know the meeting date, so it isn't adding information as the document title. It's more helpful to see whether we're looking at an Agenda or Minutes, so ideally we could check which section of links we're in and assign the title that way, otherwise it looks like "Agenda" or "Minutes" are often included in the URL string so we could parse them that way There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, i see. I'll check that out and make all the changes. Thank you for the explanation |
||
|
||
return [{"href": href, "title": title}] | ||
|
||
def _parse_source(self, response): | ||
"""Parse or generate source.""" | ||
return response.url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you ignoring this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using it for the commission titles in case i needed it later, but since they were all the same i just ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed if it isn't being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i'll remove