As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.
This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements.
Instructors:
- Khomenko Anna Yurievna - linguistic track lecturer
- Lyashevskaya Olga Nikolaevna - linguistic track lecturer
- Demidovskij Alexander Vladimirovich - technical track lecturer
- Uraev Dmitry Yurievich - technical track practice lecturer
- Kashchikhin Andrei Nikolaevich - technical track expert
- Kazyulina Marina Sergeevna - technical track assistant
- Zharikov Egor Igorevich - technical track assistant
- Novikova Irina Alekseevna - technical track assistant
- Blyudova Vasilisa Mikhailovna - technical track assistant
- Zaytseva Vita Vyacheslavovna - technical track assistant
- Scrapper:
- Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
- Deadline: April, 14
- Format: each student works in their own PR.
- Dataset volume: 5-7 articles.
- Design document:
./lab_5_scrapper/README.md
. - List of media websites to select from: at the
Resources
section on this page.
- Pipeline:
- Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
- Deadline: May, 12
- Format: each student works in their own PR.
- Dataset volume: 5-7 articles.
- Design document:
./lab_6_pipeline/README.md
Date | Lecture topic | Important links |
---|---|---|
13.03.2023 | Lecture: Introduction to technical track. | Lab no. 5 description |
17.03.2023 | Seminar: 3rd party libraries. | N/A |
20.03.2023 | Lecture: Requests and HTML . |
Listing |
24.03.2023 | Seminar: Headers and introduction to bs4 . |
Listing |
27.03.2023 | EXAM WEEK: skipping lecture and seminars. | N/A |
03.04.2023 | Lecture: Access file system via pathlib . |
Listing, Listing |
07.04.2023 | Seminar: Early version of HTMLParser . |
Listing |
10.04.2023 | Lecture: Working with dates via datetime . |
Listing |
14.04.2023 | First deadline: crawler assignment. | N/A |
17.04.2023 | Lecture: Assignment no. 6: concept and details. | N/A |
21.04.2023 | Seminar: CorpusManager implementation. |
N/A |
24.04.2023 | Lecture: Automated morphological analysis. | Listing, Listing |
28.04.2023 | Seminar: pymystem3 API. |
Listing, Listing |
01.05.2023 | HOLIDAYS: skipping lecture and seminars. | N/A |
05.05.2023 | HOLIDAYS: skipping lecture and seminars. | N/A |
08.05.2023 | HOLIDAYS: skipping lecture and seminars. | N/A |
12.05.2023 | Second deadline: pipeline assignment. | N/A |
You can find a more complete summary from lectures as a list of topics.
Module | Description | Component | Need to get |
---|---|---|---|
pathlib |
working with file paths | scrapper | 4 |
requests |
downloading web pages | scrapper | 4 |
BeautifulSoup4 |
finding information on web pages | scrapper | 4 |
lxml |
Optional parsing HTML | scrapper | 6 |
datetime |
working with dates | scrapper | 6 |
json |
working with json text format | scrapper, pipeline | 4 |
pymystem3 |
module for morphological analysis | pipeline | 6 |
pymorphy2 |
module for morphological analysis | pipeline | 10 |
Software solution is built on top of three components:
scrapper.py
- a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.pipeline.py
- a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.article.py
- a module for article abstraction to encapsulate low-level manipulations with the article.
Order of handing over:
- Lab work is accepted for oral presentation.
- A student has explained the work of the program and showed it in action.
- A student has completed the min-task from a mentor that requires some slight code modifications.
- A student receives a mark:
- That corresponds to the expected one, if all the steps above are completed and mentor is satisfied with the answer.
- One point bigger than the expected one, if all the steps above are completed and mentor is very satisfied with the answer.
- One point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied.
- Two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied.
NOTE: A student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.
A lab work is accepted for oral presentation if all the criteria below are satisfied:
- There is a Pull Request (PR) with a correctly formatted name:
Scrapper, <NAME> <SURNAME> - <UNIVERSITY GROUP NAME>
. Example:Scrapper, Valeriya Kuznetsova - 19FPL1
. - Has a filled file
target_score.txt
with an expected mark. Acceptable values: 4, 6, 8, 10. - Has green status.
- Has a label
done
, set by mentor.
- Academic performance: link
- Media websites list: link
- Python programming course from previous semester: link
- Scrapping tutorials: YouTube series (russian)
- HOWTO: Set up your fork
- HOWTO: Running tests
- HOWTO: Running assignments in terminal