Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New/add spider chi_ssa_35 and it's test case #991

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions city_scrapers/spiders/chi_ssa_35.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
from datetime import datetime
from io import BytesIO, StringIO

import requests
from city_scrapers_core.constants import COMMISSION
from city_scrapers_core.items import Meeting
from city_scrapers_core.spiders import CityScrapersSpider
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams


class ChiSsa35Spider(CityScrapersSpider):
name = "chi_ssa_35"
agency = "Chicago Special Service Area #35 Lincoln Ave"
timezone = "America/Chicago"
start_urls = [
"https://www.lincolnparkchamber.com/"
+ "businesses/special-service-areas/lincoln-avenue-ssa/ssa-administration/"
]

def parse(self, response):
"""
`parse` should always `yield` Meeting items.

Change the `_parse_title`, `_parse_start`, etc methods to fit your scraping
needs.
"""
content_div = response.css("div.content_block.content.background_white")
_ = content_div.css("h4::text").getall()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you ignoring this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using it for the commission titles in case i needed it later, but since they were all the same i just ignore it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if it isn't being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'll remove

dates = content_div.css("ol").css("li::text").getall()
urls = content_div.css("a")
content = list(zip(dates, urls[: len(dates)]))

for item in content[:1]:
meeting = Meeting(
title=self._parse_title(item),
description=self._parse_description(item),
classification=self._parse_classification(item),
start=self._parse_start(item),
end=self._parse_end(item),
all_day=self._parse_all_day(item),
time_notes=self._parse_time_notes(item),
location=self._parse_location(item),
links=self._parse_links(item),
source=self._parse_source(response),
)

meeting["status"] = self._get_status(meeting)
meeting["id"] = self._get_id(meeting)

yield meeting

def _parse_title(self, item):
"""Parse or generate meeting title."""
return "Commission"

def _parse_description(self, item):
"""Parse or generate meeting description."""
item_url = item[1].css("::attr(href)").get()
response = requests.get(item_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't much meaningful information in the description, so we can drop this. We would want to include PDF parsing within scrapy so we can manage rate-limiting and handling responses in a unified way instead of using requests here as well


lp = LAParams(line_margin=0.1)
output_str = StringIO()
extract_text_to_fp(BytesIO(response.content), output_str, laparams=lp)
pdf_text = output_str.getvalue().replace("\n", "")

return pdf_text

def _parse_classification(self, item):
"""Parse or generate classification from allowed options."""
return COMMISSION

def _clean_date_item(self, date_item):
date_item = date_item.split()

if len(date_item) == 5:
date_item[-1] = date_item[-1].replace(")", "")
date_item[-2] = date_item[-2].replace("(", "")
date_item.append(f"{datetime.today().year}")

elif len(date_item) == 3:
date_item.append("09:00 am")
date_item.append(f"{datetime.today().year}")

return " ".join(date_item)

def _parse_start(self, item):
"""Parse start datetime as a naive datetime object."""
date_item = self._clean_date_item(item[0])
date_obj = datetime.strptime(date_item, "%A, %B %d %H:%M %p %Y")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some issues in the past with agencies putting the wrong day of the week, but the correct date, so we should ignore the weekday. Also, if we're using %p we'll need to use %I for the hour format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i'll make the changes


return date_obj

def _parse_end(self, item):
"""Parse end datetime as a naive datetime object. Added by pipeline if None"""
return None

def _parse_time_notes(self, item):
"""Parse any additional notes on the timing of the meeting"""
return ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add something about confirming at the number they leave on the page

Copy link
Contributor Author

@sosolidkk sosolidkk Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the number at the contact us section?

Contact Us
Lincoln Park Chamber of Commerce
2468 N. Lincoln
Chicago, IL 60614
773 880 5200


def _parse_all_day(self, item):
"""Parse or generate all-day status. Defaults to False."""
return False

def _parse_location(self, item):
"""Parse or generate location."""
return {
"address": "Lincoln Park Chamber of Commerce, 2468 N. Lincoln, Chicago",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this has varied a lot according to their site. We might want to just include "Confirm with agency" in the name and leave address blank

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

"name": "Lincoln Park Chamber of Commerce",
}

def _parse_links(self, item):
"""Parse or generate links."""
href = item[1].css("::attr(href)").get()
title = item[1].css("::text").get()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date isn't very relevant if we're already associating it with a meeting on a given date, so we should label it Minutes or Agenda based off of the section or URL text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a little more? I didn't understand @pjsier

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if we're looking at the document in the context of a meeting we already know the meeting date, so it isn't adding information as the document title. It's more helpful to see whether we're looking at an Agenda or Minutes, so ideally we could check which section of links we're in and assign the title that way, otherwise it looks like "Agenda" or "Minutes" are often included in the URL string so we could parse them that way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i see. I'll check that out and make all the changes. Thank you for the explanation


return [{"href": href, "title": title}]

def _parse_source(self, response):
"""Parse or generate source."""
return response.url
Loading