-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New/add spider chi_ssa_35 and it's test case #991
base: main
Are you sure you want to change the base?
Conversation
city_scrapers/spiders/chi_ssa_35.py
Outdated
needs. | ||
""" | ||
content_div = response.css("div.content_block.content.background_white") | ||
_ = content_div.css("h4::text").getall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you ignoring this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using it for the commission titles in case i needed it later, but since they were all the same i just ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed if it isn't being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i'll remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left some comments here
city_scrapers/spiders/chi_ssa_35.py
Outdated
def _parse_description(self, item): | ||
"""Parse or generate meeting description.""" | ||
item_url = item[1].css("::attr(href)").get() | ||
response = requests.get(item_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't much meaningful information in the description, so we can drop this. We would want to include PDF parsing within scrapy so we can manage rate-limiting and handling responses in a unified way instead of using requests
here as well
city_scrapers/spiders/chi_ssa_35.py
Outdated
def _parse_start(self, item): | ||
"""Parse start datetime as a naive datetime object.""" | ||
date_item = self._clean_date_item(item[0]) | ||
date_obj = datetime.strptime(date_item, "%A, %B %d %H:%M %p %Y") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had some issues in the past with agencies putting the wrong day of the week, but the correct date, so we should ignore the weekday. Also, if we're using %p
we'll need to use %I
for the hour format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i'll make the changes
city_scrapers/spiders/chi_ssa_35.py
Outdated
def _parse_location(self, item): | ||
"""Parse or generate location.""" | ||
return { | ||
"address": "Lincoln Park Chamber of Commerce, 2468 N. Lincoln, Chicago", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this has varied a lot according to their site. We might want to just include "Confirm with agency" in the name
and leave address
blank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
city_scrapers/spiders/chi_ssa_35.py
Outdated
def _parse_links(self, item): | ||
"""Parse or generate links.""" | ||
href = item[1].css("::attr(href)").get() | ||
title = item[1].css("::text").get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The date isn't very relevant if we're already associating it with a meeting on a given date, so we should label it Minutes
or Agenda
based off of the section or URL text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a little more? I didn't understand @pjsier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, if we're looking at the document in the context of a meeting we already know the meeting date, so it isn't adding information as the document title. It's more helpful to see whether we're looking at an Agenda or Minutes, so ideally we could check which section of links we're in and assign the title that way, otherwise it looks like "Agenda" or "Minutes" are often included in the URL string so we could parse them that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, i see. I'll check that out and make all the changes. Thank you for the explanation
city_scrapers/spiders/chi_ssa_35.py
Outdated
needs. | ||
""" | ||
content_div = response.css("div.content_block.content.background_white") | ||
_ = content_div.css("h4::text").getall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed if it isn't being used
Hello @pjsier , I was updating some stopped code and I made the corrections suggested by you. I also updated the code to make the year of each item correct, since it was fixed with a |
@sosolidkk thanks for the changes! I mentioned in the comment, but the |
Hey @pjsier , sorry for the delay. I've updated this PR with the changes that you request. Now i'm iterating over all the inner elements of the body and separating the items in groups based on their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, and thanks for the updates! I had a few general questions I wanted to address first, let me know if I can clarify anything
city_scrapers/spiders/chi_ssa_35.py
Outdated
self._add_year_to_date_item(content) | ||
self._parse_date_to_datetime(content) | ||
|
||
for item in content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for the two separate looks here? It seems like we could consolidate them and have the individual parse functions handle getting details from a parent element rather than as separate steps of a loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't have a reason for using two for loops. For me it was more simpler to save all data into a list and the just run the data that is into the list, but i can change that too. Anyways, the suggestion that you made about search for the elements and iterating over them was just like the way i did? Or did you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
yield meeting | ||
|
||
def _add_year_to_date_item(self, content): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little hard to follow this when it's modifying an object in-place and not returning a value, could this be made more explicit in a function that returns a value instead of modifying one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I could do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these updates, it's coming along! I left a few comments here
|
||
def _parse_links(self, item): | ||
"""Parse or generate links.""" | ||
return [{"href": item.get("url"), "title": item["date"]}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to know whether this is an agenda or meeting minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can i just replace the title key with the info needed? Agenda or minutes
|
||
def _parse_time_notes(self, item): | ||
"""Parse any additional notes on the timing of the meeting""" | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to add something about confirming at the number they leave on the page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the number at the contact us
section?
Contact Us
Lincoln Park Chamber of Commerce
2468 N. Lincoln
Chicago, IL 60614
773 880 5200
meeting_content["url"] = element.css("a::attr(href)").get() | ||
|
||
meeting_content["description"] = _description | ||
meeting_content["type"] = _type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this isn't currently being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't. I was looking for somewhere to put this info.
meeting_content["type"] = _type | ||
meeting_content["year"] = _year | ||
|
||
if "date" not in meeting_content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be some meetings that are both listed for agendas and minutes, and we'll need to combine those so that this doesn't return duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
Summary
Issue: #568
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Questions
I am having some doubts about this spider, because it has all the meetings time, date and place displayed on the website itself, but the meeting details for the current day that will happen are inside a
.pdf
document. So what i did was to put the.pdf
document content displayed into the description field in the spider. Anyway, i don't know if what i did was the correct approach or if the right way would be to iterate over the.pdf
documents and parse the data inside them as meetings.