Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New/add spider chi_ssa_35 and it's test case #991

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sosolidkk
Copy link
Contributor

Summary

Issue: #568

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Questions

I am having some doubts about this spider, because it has all the meetings time, date and place displayed on the website itself, but the meeting details for the current day that will happen are inside a .pdf document. So what i did was to put the .pdf document content displayed into the description field in the spider. Anyway, i don't know if what i did was the correct approach or if the right way would be to iterate over the .pdf documents and parse the data inside them as meetings.

needs.
"""
content_div = response.css("div.content_block.content.background_white")
_ = content_div.css("h4::text").getall()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you ignoring this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using it for the commission titles in case i needed it later, but since they were all the same i just ignore it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if it isn't being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'll remove

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some comments here

def _parse_description(self, item):
"""Parse or generate meeting description."""
item_url = item[1].css("::attr(href)").get()
response = requests.get(item_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't much meaningful information in the description, so we can drop this. We would want to include PDF parsing within scrapy so we can manage rate-limiting and handling responses in a unified way instead of using requests here as well

def _parse_start(self, item):
"""Parse start datetime as a naive datetime object."""
date_item = self._clean_date_item(item[0])
date_obj = datetime.strptime(date_item, "%A, %B %d %H:%M %p %Y")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some issues in the past with agencies putting the wrong day of the week, but the correct date, so we should ignore the weekday. Also, if we're using %p we'll need to use %I for the hour format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'll make the changes

def _parse_location(self, item):
"""Parse or generate location."""
return {
"address": "Lincoln Park Chamber of Commerce, 2468 N. Lincoln, Chicago",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this has varied a lot according to their site. We might want to just include "Confirm with agency" in the name and leave address blank

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

def _parse_links(self, item):
"""Parse or generate links."""
href = item[1].css("::attr(href)").get()
title = item[1].css("::text").get()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date isn't very relevant if we're already associating it with a meeting on a given date, so we should label it Minutes or Agenda based off of the section or URL text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a little more? I didn't understand @pjsier

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if we're looking at the document in the context of a meeting we already know the meeting date, so it isn't adding information as the document title. It's more helpful to see whether we're looking at an Agenda or Minutes, so ideally we could check which section of links we're in and assign the title that way, otherwise it looks like "Agenda" or "Minutes" are often included in the URL string so we could parse them that way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i see. I'll check that out and make all the changes. Thank you for the explanation

needs.
"""
content_div = response.css("div.content_block.content.background_white")
_ = content_div.css("h4::text").getall()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if it isn't being used

@sosolidkk
Copy link
Contributor Author

Hello @pjsier , I was updating some stopped code and I made the corrections suggested by you. I also updated the code to make the year of each item correct, since it was fixed with a datetime.today().year. The only problem I still have is your change suggestion to Minutes and Agenda on title. I can't think of a way to do this dynamically, since on the page all I have is a <h4> which is followed by several <p> tags that contain the links inside. I kind of have to count and make it a more hard coded process. Do you have any better suggestions?

@pjsier
Copy link
Collaborator

pjsier commented Jan 11, 2021

@sosolidkk thanks for the changes! I mentioned in the comment, but the href attribute usually contains "Agenda" or "Minutes" which is one way, and you could also loop through a selector that iterates through the immediate children of .content and updates the document name any time it runs into an h4

@sosolidkk
Copy link
Contributor Author

Hey @pjsier , sorry for the delay. I've updated this PR with the changes that you request. Now i'm iterating over all the inner elements of the body and separating the items in groups based on their <h4> title value, that can be Agenda, Schedule or Minutes.

@sosolidkk sosolidkk requested a review from pjsier February 23, 2021 17:56
Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, and thanks for the updates! I had a few general questions I wanted to address first, let me know if I can clarify anything

self._add_year_to_date_item(content)
self._parse_date_to_datetime(content)

for item in content:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the two separate looks here? It seems like we could consolidate them and have the individual parse functions handle getting details from a parent element rather than as separate steps of a loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't have a reason for using two for loops. For me it was more simpler to save all data into a list and the just run the data that is into the list, but i can change that too. Anyways, the suggestion that you made about search for the elements and iterating over them was just like the way i did? Or did you have something else in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


yield meeting

def _add_year_to_date_item(self, content):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little hard to follow this when it's modifying an object in-place and not returning a value, could this be made more explicit in a function that returns a value instead of modifying one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I could do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@sosolidkk sosolidkk requested a review from pjsier March 2, 2021 14:22
Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these updates, it's coming along! I left a few comments here


def _parse_links(self, item):
"""Parse or generate links."""
return [{"href": item.get("url"), "title": item["date"]}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to know whether this is an agenda or meeting minutes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can i just replace the title key with the info needed? Agenda or minutes


def _parse_time_notes(self, item):
"""Parse any additional notes on the timing of the meeting"""
return ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add something about confirming at the number they leave on the page

Copy link
Contributor Author

@sosolidkk sosolidkk Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the number at the contact us section?

Contact Us
Lincoln Park Chamber of Commerce
2468 N. Lincoln
Chicago, IL 60614
773 880 5200

meeting_content["url"] = element.css("a::attr(href)").get()

meeting_content["description"] = _description
meeting_content["type"] = _type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this isn't currently being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't. I was looking for somewhere to put this info.

meeting_content["type"] = _type
meeting_content["year"] = _year

if "date" not in meeting_content:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be some meetings that are both listed for agendas and minutes, and we'll need to combine those so that this doesn't return duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants