Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Collector #1, Bykova Ekaterina - 19FPL2 #38

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

ffmiil
Copy link

@ffmiil ffmiil commented Mar 6, 2021

No description provided.

@ffmiil ffmiil changed the title ao Dataset Collector #1, Bykova Ekaterina - 19FPL2 Mar 6, 2021
@dmitry-uraev dmitry-uraev self-assigned this Mar 6, 2021
Copy link
Contributor

@dmitry-uraev dmitry-uraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is right time to implement scrapper. The deadline is coming.

@dmitry-uraev dmitry-uraev added the Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required label Mar 9, 2021
@ffmiil ffmiil requested a review from dmitry-uraev March 11, 2021 12:22
Copy link
Contributor

@dmitry-uraev dmitry-uraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for now. Waiting for green PR

@@ -7,3 +7,6 @@
PROJECT_ROOT = os.path.dirname(os.path.realpath(__file__))
ASSETS_PATH = os.path.join(PROJECT_ROOT, 'tmp', 'articles')
CRAWLER_CONFIG_PATH = os.path.join(PROJECT_ROOT, 'crawler_config.json')
HEADERS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

"max_number_articles_to_get_from_one_seed": 0
"base_urls": ["https://express-kamchatka1.ru/sobytiya.html"],
"total_articles_to_find_and_parse": 15,
"max_number_articles_to_get_from_one_seed": 15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted

scrapper.py Outdated
from datetime import datetime
from bs4 import BeautifulSoup
from article import Article
from constants import CRAWLER_CONFIG_PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may import them in one line

scrapper.py Outdated
response = requests.get(url, headers=HEADERS)
if response:
content = response.text
links = self._extract_url(BeautifulSoup(content, 'html.parser'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may try "lxml" option here

@dmitry-uraev
Copy link
Contributor

Nice commits namings BTW)

@dmitry-uraev
Copy link
Contributor

'puk'

@ffmiil ffmiil removed the Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required label Mar 21, 2021
@ffmiil ffmiil added the Review Required You are ready for next iteration of review label Apr 2, 2021
@ffmiil ffmiil requested a review from dmitry-uraev April 2, 2021 14:13
@dmitry-uraev dmitry-uraev added Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required and removed Review Required You are ready for next iteration of review labels Apr 2, 2021
Copy link
Contributor

@dmitry-uraev dmitry-uraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have different sites specified in crawler and in our table: http://express-kamchatka1.ru/ and https://www.e1.ru/news/ is this correct?

@dmitry-uraev dmitry-uraev added 🏆 Pipeline accepted 🕷️Crawler accepted and removed Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required Missed crawler deadline labels Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants