New/add spider chi_ssa_35 and it's test case #991

sosolidkk · 2020-11-29T15:23:14Z

Summary

Issue: #568

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

Questions

I am having some doubts about this spider, because it has all the meetings time, date and place displayed on the website itself, but the meeting details for the current day that will happen are inside a .pdf document. So what i did was to put the .pdf document content displayed into the description field in the spider. Anyway, i don't know if what i did was the correct approach or if the right way would be to iterate over the .pdf documents and parse the data inside them as meetings.

luchiago · 2020-11-29T15:52:11Z

city_scrapers/spiders/chi_ssa_35.py

+        needs.
+        """
+        content_div = response.css("div.content_block.content.background_white")
+        _ = content_div.css("h4::text").getall()


Why are you ignoring this variable?

I was using it for the commission titles in case i needed it later, but since they were all the same i just ignore it.

This can be removed if it isn't being used

Ok, i'll remove

pjsier

Thanks for the PR! Left some comments here

pjsier · 2020-11-30T14:25:07Z

city_scrapers/spiders/chi_ssa_35.py

+    def _parse_description(self, item):
+        """Parse or generate meeting description."""
+        item_url = item[1].css("::attr(href)").get()
+        response = requests.get(item_url)


There isn't much meaningful information in the description, so we can drop this. We would want to include PDF parsing within scrapy so we can manage rate-limiting and handling responses in a unified way instead of using requests here as well

pjsier · 2020-11-30T14:26:18Z

city_scrapers/spiders/chi_ssa_35.py

+    def _parse_start(self, item):
+        """Parse start datetime as a naive datetime object."""
+        date_item = self._clean_date_item(item[0])
+        date_obj = datetime.strptime(date_item, "%A, %B %d %H:%M %p %Y")


We've had some issues in the past with agencies putting the wrong day of the week, but the correct date, so we should ignore the weekday. Also, if we're using %p we'll need to use %I for the hour format

Ok, i'll make the changes

pjsier · 2020-11-30T14:27:54Z

city_scrapers/spiders/chi_ssa_35.py

+    def _parse_location(self, item):
+        """Parse or generate location."""
+        return {
+            "address": "Lincoln Park Chamber of Commerce, 2468 N. Lincoln, Chicago",


It looks like this has varied a lot according to their site. We might want to just include "Confirm with agency" in the name and leave address blank

pjsier · 2020-11-30T14:31:26Z

city_scrapers/spiders/chi_ssa_35.py

+    def _parse_links(self, item):
+        """Parse or generate links."""
+        href = item[1].css("::attr(href)").get()
+        title = item[1].css("::text").get()


The date isn't very relevant if we're already associating it with a meeting on a given date, so we should label it Minutes or Agenda based off of the section or URL text

Could you elaborate a little more? I didn't understand @pjsier

Sure, if we're looking at the document in the context of a meeting we already know the meeting date, so it isn't adding information as the document title. It's more helpful to see whether we're looking at an Agenda or Minutes, so ideally we could check which section of links we're in and assign the title that way, otherwise it looks like "Agenda" or "Minutes" are often included in the URL string so we could parse them that way

Oh, i see. I'll check that out and make all the changes. Thank you for the explanation

pjsier · 2020-11-30T14:31:40Z

city_scrapers/spiders/chi_ssa_35.py

+        needs.
+        """
+        content_div = response.css("div.content_block.content.background_white")
+        _ = content_div.css("h4::text").getall()


This can be removed if it isn't being used

sosolidkk · 2021-01-10T12:54:57Z

Hello @pjsier , I was updating some stopped code and I made the corrections suggested by you. I also updated the code to make the year of each item correct, since it was fixed with a datetime.today().year. The only problem I still have is your change suggestion to Minutes and Agenda on title. I can't think of a way to do this dynamically, since on the page all I have is a <h4> which is followed by several <p> tags that contain the links inside. I kind of have to count and make it a more hard coded process. Do you have any better suggestions?

pjsier · 2021-01-11T14:05:49Z

@sosolidkk thanks for the changes! I mentioned in the comment, but the href attribute usually contains "Agenda" or "Minutes" which is one way, and you could also loop through a selector that iterates through the immediate children of .content and updates the document name any time it runs into an h4

sosolidkk · 2021-02-17T19:46:33Z

Hey @pjsier , sorry for the delay. I've updated this PR with the changes that you request. Now i'm iterating over all the inner elements of the body and separating the items in groups based on their <h4> title value, that can be Agenda, Schedule or Minutes.

pjsier

Sorry for the delay, and thanks for the updates! I had a few general questions I wanted to address first, let me know if I can clarify anything

pjsier · 2021-03-01T20:38:26Z

city_scrapers/spiders/chi_ssa_35.py

+        self._add_year_to_date_item(content)
+        self._parse_date_to_datetime(content)
+
+        for item in content:


Is there a reason for the two separate looks here? It seems like we could consolidate them and have the individual parse functions handle getting details from a parent element rather than as separate steps of a loop

I actually don't have a reason for using two for loops. For me it was more simpler to save all data into a list and the just run the data that is into the list, but i can change that too. Anyways, the suggestion that you made about search for the elements and iterating over them was just like the way i did? Or did you have something else in mind?

pjsier · 2021-03-01T20:39:13Z

city_scrapers/spiders/chi_ssa_35.py

+
+            yield meeting
+
+    def _add_year_to_date_item(self, content):


It's a little hard to follow this when it's modifying an object in-place and not returning a value, could this be made more explicit in a function that returns a value instead of modifying one?

Sure, I could do that.

…icit

pjsier

Thanks for these updates, it's coming along! I left a few comments here

pjsier · 2021-03-05T18:45:36Z

city_scrapers/spiders/chi_ssa_35.py

+
+    def _parse_links(self, item):
+        """Parse or generate links."""
+        return [{"href": item.get("url"), "title": item["date"]}]


We'll want to know whether this is an agenda or meeting minutes

Can i just replace the title key with the info needed? Agenda or minutes

pjsier · 2021-03-05T18:45:59Z

city_scrapers/spiders/chi_ssa_35.py

+
+    def _parse_time_notes(self, item):
+        """Parse any additional notes on the timing of the meeting"""
+        return ""


We might want to add something about confirming at the number they leave on the page

Are you talking about the number at the contact us section?

Contact Us Lincoln Park Chamber of Commerce 2468 N. Lincoln Chicago, IL 60614 773 880 5200

pjsier · 2021-03-05T19:01:22Z

city_scrapers/spiders/chi_ssa_35.py

+                meeting_content["url"] = element.css("a::attr(href)").get()
+
+            meeting_content["description"] = _description
+            meeting_content["type"] = _type


It looks like this isn't currently being used?

It isn't. I was looking for somewhere to put this info.

pjsier · 2021-03-05T19:01:53Z

city_scrapers/spiders/chi_ssa_35.py

+            meeting_content["type"] = _type
+            meeting_content["year"] = _year
+
+            if "date" not in meeting_content:


There may be some meetings that are both listed for agendas and minutes, and we'll need to combine those so that this doesn't return duplicates.

add spider chi_ssa_35 and it's test case

f973966

luchiago reviewed Nov 29, 2020

View reviewed changes

pjsier requested changes Nov 30, 2020

View reviewed changes

sosolidkk added 2 commits January 10, 2021 09:50

Fixes based on @pjsier suggestions

4ae7193

Removal of unused imports

1c28a54

sosolidkk added 2 commits February 17, 2021 16:38

Changes based on @pjsier suggestions and copy of new html file for test

7ecb840

Remove of an extra comma

fcca61f

sosolidkk requested a review from pjsier February 23, 2021 17:56

pjsier requested changes Mar 1, 2021

View reviewed changes

Remove unecessary for loop and add return to make functions more expl…

eb4dfd0

…icit

sosolidkk requested a review from pjsier March 2, 2021 14:22

pjsier requested changes Mar 5, 2021

View reviewed changes

simran-2501 approved these changes Oct 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New/add spider chi_ssa_35 and it's test case #991

New/add spider chi_ssa_35 and it's test case #991

sosolidkk commented Nov 29, 2020

luchiago Nov 29, 2020

sosolidkk Nov 29, 2020

pjsier Nov 30, 2020

sosolidkk Nov 30, 2020

pjsier left a comment

pjsier Nov 30, 2020

pjsier Nov 30, 2020

sosolidkk Nov 30, 2020

pjsier Nov 30, 2020

sosolidkk Nov 30, 2020

pjsier Nov 30, 2020

sosolidkk Nov 30, 2020

pjsier Dec 1, 2020

sosolidkk Dec 2, 2020

pjsier Nov 30, 2020

sosolidkk commented Jan 10, 2021

pjsier commented Jan 11, 2021

sosolidkk commented Feb 17, 2021

pjsier left a comment

pjsier Mar 1, 2021

sosolidkk Mar 1, 2021

sosolidkk Mar 2, 2021

pjsier Mar 1, 2021

sosolidkk Mar 1, 2021

sosolidkk Mar 2, 2021

pjsier left a comment

pjsier Mar 5, 2021

sosolidkk Mar 5, 2021

pjsier Mar 5, 2021

sosolidkk Mar 5, 2021 •

edited

Loading

pjsier Mar 5, 2021

sosolidkk Mar 5, 2021

pjsier Mar 5, 2021

sosolidkk Mar 5, 2021

New/add spider chi_ssa_35 and it's test case #991

Are you sure you want to change the base?

New/add spider chi_ssa_35 and it's test case #991

Conversation

sosolidkk commented Nov 29, 2020

Summary

Checklist

Questions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjsier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosolidkk commented Jan 10, 2021

pjsier commented Jan 11, 2021

sosolidkk commented Feb 17, 2021

pjsier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjsier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosolidkk Mar 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosolidkk Mar 5, 2021 •

edited

Loading