Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Collector #1, Kazyulina Marina - 19FPL1 #42

Open
wants to merge 55 commits into
base: main
Choose a base branch
from

Conversation

marina-kaz
Copy link

No description provided.

Copy link
Member

@demid5111 demid5111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to this moment, it is the first more or less recursive parser. however, there is a spaghetti code and broken single responsibility principle

article.py Show resolved Hide resolved
Useful constant variables
"""

import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all constants should be left in constants.py in the root of the project

config/raw_metadata_score_four_test.py Outdated Show resolved Hide resolved
constants.py Show resolved Hide resolved
@@ -7,3 +7,5 @@
PROJECT_ROOT = os.path.dirname(os.path.realpath(__file__))
ASSETS_PATH = os.path.join(PROJECT_ROOT, 'tmp', 'articles')
CRAWLER_CONFIG_PATH = os.path.join(PROJECT_ROOT, 'crawler_config.json')
LINKS_STORAGE = os.path.join(PROJECT_ROOT, 'links')
URL_START = 'https://burunen.ru'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is seed url or any other configuration of the crawler, then use configuration file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it is the beginning of every url in my sourse

you see, as in many webpages, the href tags in my source are filled with the continuation of the url, emitting the 'https://burunen.ru', so each and every time i try to request the link i found automatically i have to concatenate it with this beginning first.

i thought about making an additional attribute to the Crawler class (something like self.url_start = "), but then i decided that if it is constant, it should be placed among other constants... am i wrong?

scrapper.py Outdated Show resolved Hide resolved
scrapper.py Outdated Show resolved Hide resolved
scrapper.py Outdated Show resolved Hide resolved
scrapper.py Outdated Show resolved Hide resolved
scrapper.py Outdated Show resolved Hide resolved
@demid5111 demid5111 added the Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required label Mar 12, 2021
@marina-kaz marina-kaz requested a review from demid5111 March 30, 2021 12:12
@dmitry-uraev dmitry-uraev added the Review Required You are ready for next iteration of review label Mar 30, 2021
Copy link
Member

@demid5111 demid5111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very well! need to polish

pipeline.py Outdated Show resolved Hide resolved
pipeline.py Outdated Show resolved Hide resolved
pipeline.py Outdated Show resolved Hide resolved

def _scan_dataset(self):
"""
Register each dataset entry
"""
pass
path = Path(ASSETS_PATH)
for file in path.iterdir():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you find a way to directly get files by the given name template?

pipeline.py Outdated Show resolved Hide resolved
pipeline.py Outdated Show resolved Hide resolved
pipeline.py Outdated Show resolved Hide resolved
pos_frequency_pipeline.py Outdated Show resolved Hide resolved
pos_frequency_pipeline.py Show resolved Hide resolved
pipeline.py Outdated Show resolved Hide resolved
@demid5111 demid5111 added Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required and removed Review Required You are ready for next iteration of review labels Apr 1, 2021
@marina-kaz marina-kaz requested a review from demid5111 April 4, 2021 12:11
Copy link
Member

@demid5111 demid5111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve

pipeline.py Outdated
raise NotADirectoryError
if not list(path.iterdir()):
raise EmptyDirectoryError
files = [str(file.relative_to(path)) for file in path.iterdir()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strings

pos_frequency_pipeline.py Show resolved Hide resolved
Copy link
Member

@demid5111 demid5111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more

pipeline.py Outdated

def get_articles(self):
"""
Returns storage params
"""
pass
self._scan_dataset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot be in the plain getter - read the requirements

@demid5111 demid5111 added 🏆 Pipeline accepted 🥇 and removed Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required labels Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants