Dataset Collector #1, Kazyulina Marina - 19FPL1 #42

marina-kaz · 2021-03-08T14:34:25Z

No description provided.

… for thanks

demid5111

to this moment, it is the first more or less recursive parser. however, there is a spaghetti code and broken single responsibility principle

article.py

demid5111 · 2021-03-12T11:52:55Z

config/constants.py

+Useful constant variables
+"""
+
+import os


all constants should be left in constants.py in the root of the project

config/raw_metadata_score_four_test.py

constants.py

demid5111 · 2021-03-12T11:53:47Z

constants.py

@@ -7,3 +7,5 @@
 PROJECT_ROOT = os.path.dirname(os.path.realpath(__file__))
 ASSETS_PATH = os.path.join(PROJECT_ROOT, 'tmp', 'articles')
 CRAWLER_CONFIG_PATH = os.path.join(PROJECT_ROOT, 'crawler_config.json')
+LINKS_STORAGE = os.path.join(PROJECT_ROOT, 'links')
+URL_START = 'https://burunen.ru'


if it is seed url or any other configuration of the crawler, then use configuration file

nope, it is the beginning of every url in my sourse

you see, as in many webpages, the href tags in my source are filled with the continuation of the url, emitting the 'https://burunen.ru', so each and every time i try to request the link i found automatically i have to concatenate it with this beginning first.

i thought about making an additional attribute to the Crawler class (something like self.url_start = "), but then i decided that if it is constant, it should be placed among other constants... am i wrong?

scrapper.py

demid5111

very well! need to polish

pipeline.py

demid5111 · 2021-04-01T04:56:28Z

pipeline.py


    def _scan_dataset(self):
        """
        Register each dataset entry
        """
-        pass
+        path = Path(ASSETS_PATH)
+        for file in path.iterdir():


can you find a way to directly get files by the given name template?

pipeline.py

pos_frequency_pipeline.py

pipeline.py

pos_frequency_pipeline.py

demid5111

improve

demid5111 · 2021-04-05T11:55:05Z

pipeline.py

+        raise NotADirectoryError
+    if not list(path.iterdir()):
+        raise EmptyDirectoryError
+    files = [str(file.relative_to(path)) for file in path.iterdir()]


pos_frequency_pipeline.py

demid5111

one more

demid5111 · 2021-04-05T13:48:33Z

pipeline.py


    def get_articles(self):
        """
        Returns storage params
        """
-        pass
+        self._scan_dataset()


it cannot be in the plain getter - read the requirements

marina-kaz added 22 commits February 28, 2021 15:52

target score change

3261adb

completed stages 1 and 2

f471f4a

uhm whatever i'm doing here i want to get upded branch that'all i ask…

742558c

… for thanks

build for four

dbcaa4d

build for four, fixed target score

0c35e77

build for idk what score

8a09da7

fixed linting a little

6fee8b3

major link work

aafdefd

added requirements

0cc3a91

changed target score

e3e8eda

i dont understand why test fails

1355f14

oooohh

56fbb38

i have questions

cfeae30

didn't work

39e8923

working on

6e5ec03

there is no pleasing you

4733964

i'm experimenting

ead1f17

added stuff for 10, tests will fail?

cf7e97b

fixed my favorite lint

54a49ed

optimized a few things

2854e58

improved text formation

358f74b

uhm

f7aa7f6

demid5111 added the Demidovskij A.V. label Mar 12, 2021

demid5111 self-assigned this Mar 12, 2021

demid5111 requested changes Mar 12, 2021

View reviewed changes

demid5111 added the Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required label Mar 12, 2021

dmitry-uraev and others added 4 commits March 12, 2021 19:21

Merge remote-tracking branch 'upstream/main' into HEAD

9f45627

fixed some problems from review

e75381c

fixed lint

225e72a

fixed config valid

dc7c08a

marina-kaz requested a review from demid5111 March 30, 2021 12:12

marina-kaz added 6 commits March 30, 2021 15:16

Merge remote-tracking branch 'upstream/main' into main

7a7a01f

fixed everything

19ff11d

now everything

c30d85a

i don't understand

c58526b

fixed regular expression

0a9c9fe

added COM to tags

db17bf6

dmitry-uraev added the Review Required You are ready for next iteration of review label Mar 30, 2021

demid5111 requested changes Apr 1, 2021

View reviewed changes

demid5111 added Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required and removed Review Required You are ready for next iteration of review labels Apr 1, 2021

demid5111 reviewed Apr 1, 2021

View reviewed changes

pos_frequency_pipeline.py Show resolved Hide resolved

marina-kaz added 6 commits April 4, 2021 14:40

added requested changes, ready to fight linter

7ea4755

fought linter

7e85982

fixed dataset validation

a83469e

adjusted ds validator

27280a0

i am sorry for my terrible commit history

4efc05d

adjusted ds validator

b45ee40

marina-kaz requested a review from demid5111 April 4, 2021 12:11

demid5111 requested changes Apr 5, 2021

View reviewed changes

marina-kaz added 4 commits April 5, 2021 15:59

fixed several drawbacks

5c3a245

fixed several drawbacks

f6c7034

fixed several drawbacks

b8399cc

oh well I noticed smt else

b8dc298

demid5111 requested changes Apr 5, 2021

View reviewed changes

marina-kaz added 2 commits April 5, 2021 16:58

turned get articles into authentic getter

96f7e8e

fixed lintering

ff570e9

demid5111 added 🏆 Pipeline accepted 🥇 and removed Changes required Reviewer has comments you need to apply. Once you are ready, replace it with Review Required labels Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Collector #1, Kazyulina Marina - 19FPL1 #42

Dataset Collector #1, Kazyulina Marina - 19FPL1 #42

marina-kaz commented Mar 8, 2021

demid5111 left a comment

demid5111 Mar 12, 2021

demid5111 Mar 12, 2021

marina-kaz Mar 13, 2021

demid5111 left a comment

demid5111 Apr 1, 2021

demid5111 left a comment

demid5111 Apr 5, 2021

demid5111 left a comment

demid5111 Apr 5, 2021

Dataset Collector #1, Kazyulina Marina - 19FPL1 #42

Are you sure you want to change the base?

Dataset Collector #1, Kazyulina Marina - 19FPL1 #42

Conversation

marina-kaz commented Mar 8, 2021

demid5111 left a comment

Choose a reason for hiding this comment

demid5111 Mar 12, 2021

Choose a reason for hiding this comment

demid5111 Mar 12, 2021

Choose a reason for hiding this comment

marina-kaz Mar 13, 2021

Choose a reason for hiding this comment

demid5111 left a comment

Choose a reason for hiding this comment

demid5111 Apr 1, 2021

Choose a reason for hiding this comment

demid5111 left a comment

Choose a reason for hiding this comment

demid5111 Apr 5, 2021

Choose a reason for hiding this comment

demid5111 left a comment

Choose a reason for hiding this comment

demid5111 Apr 5, 2021

Choose a reason for hiding this comment