Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Collector #1, Zelekson Daniil - 19FPL1 #43

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

daniilzelekson
Copy link

No description provided.

Copy link
Contributor

@dmitry-uraev dmitry-uraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's time to write some code and lint&test it.

@@ -1,32 +1,35 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, finally. I was waiting for this one to be created.

scrapper.py Outdated

class IncorrectURLError(Exception):
"""
Custom error
"""
pass
# def __init__(self, ):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice constructor initialization



class NumberOfArticlesOutOfRangeError(Exception):
"""
Custom error
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this line targeted for?


class IncorrectNumberOfArticlesError(Exception):
"""
Custom error
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

scrapper.py Outdated
@@ -36,13 +39,15 @@ def find_articles(self):
"""
Finds articles
"""
pass
raise IncorrectURLError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!

scrapper.py Outdated

def get_search_urls(self):
"""
Returns seed_urls param
"""
pass
return seed_urls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

@dmitry-uraev dmitry-uraev self-assigned this Mar 9, 2021
@dmitry-uraev dmitry-uraev added the Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required label Mar 9, 2021
@dmitry-uraev dmitry-uraev added Missed crawler deadline Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required and removed Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required labels Mar 16, 2021
@daniilzelekson
Copy link
Author

Oh! It was an accident!

Copy link
Contributor

@dmitry-uraev dmitry-uraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see crawler for chosen link: https://znamia29.ru/

Copy link
Contributor

@dmitry-uraev dmitry-uraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I see different links in crawler config and our table. Can you explain?
  2. Please move on to pipeline.

@@ -0,0 +1,86 @@
argon2-cffi==20.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

????

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pip freeze

wrapt==1.12.1
xlrd==1.2.0
xlwt==1.3.0
zipp==3.4.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You used all these? Will you share with me on Monday?

@@ -0,0 +1,26 @@
def get_month(m):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice framework, but it is not quite a framework) It is just one module with one function

@@ -0,0 +1,3 @@
beautifulsoup4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better specify version here for consistency

scrapper.py Outdated
@@ -27,71 +53,164 @@ class UnknownConfigError(Exception):
"""


lw = LinkWorker('', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need it to get absolute link from relative

Suggested change
lw = LinkWorker('', '')
lw = LinkWorker('', '')

scrapper.py Outdated
arr = date_str.split(" ")
arr[0] = arr[0][0:len(arr[0]) - 1]
return arr[3] + '-' + get_month(arr[2]) + '-' + arr[1] + ' ' + arr[0] + ':00'
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may specify particular error here

self.article.title = article_soup.find('h1').text
except Exception:
self.article.title = 'NOT FOUND'
self.article.topics.append(self.article.title)

@staticmethod
def unify_date_format(date_str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good method

@dmitry-uraev dmitry-uraev added Missed pipeline deadline 🕷️Crawler accepted and removed Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required Missed crawler deadline labels Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants