Technical Track of Computer Tools for Linguistic Research (2022/2023)

As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements.

Instructors:

Khomenko Anna Yurievna - linguistic track lecturer
Lyashevskaya Olga Nikolaevna - linguistic track lecturer
Demidovskij Alexander Vladimirovich - technical track lecturer
Uraev Dmitry Yurievich - technical track practice lecturer
Kashchikhin Andrei Nikolaevich - technical track expert
Kazyulina Marina Sergeevna - technical track assistant
Zharikov Egor Igorevich - technical track assistant
Novikova Irina Alekseevna - technical track assistant
Blyudova Vasilisa Mikhailovna - technical track assistant
Zaytseva Vita Vyacheslavovna - technical track assistant

Project Timeline

Scrapper:
1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
2. Deadline: April, 14
3. Format: each student works in their own PR.
4. Dataset volume: 5-7 articles.
5. Design document: ./lab_5_scrapper/README.md.
6. List of media websites to select from: at the Resources section on this page.
Pipeline:
1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
2. Deadline: May, 12
3. Format: each student works in their own PR.
4. Dataset volume: 5-7 articles.
5. Design document: ./lab_6_pipeline/README.md

Lectures history

Date	Lecture topic	Important links
13.03.2023	Lecture: Introduction to technical track.	Lab no. 5 description
17.03.2023	Seminar: 3rd party libraries.	N/A
20.03.2023	Lecture: Requests and `HTML`.	Listing
24.03.2023	Seminar: Headers and introduction to `bs4`.	Listing
27.03.2023	EXAM WEEK: skipping lecture and seminars.	N/A
03.04.2023	Lecture: Access file system via `pathlib`.	Listing, Listing
07.04.2023	Seminar: Early version of `HTMLParser`.	Listing
10.04.2023	Lecture: Working with dates via `datetime`.	Listing
14.04.2023	First deadline: crawler assignment.	N/A
17.04.2023	Lecture: Assignment no. 6: concept and details.	N/A
21.04.2023	Seminar: `CorpusManager` implementation.	N/A
24.04.2023	Lecture: Automated morphological analysis.	Listing, Listing
28.04.2023	Seminar: `pymystem3`API.	Listing, Listing
01.05.2023	HOLIDAYS: skipping lecture and seminars.	N/A
05.05.2023	HOLIDAYS: skipping lecture and seminars.	N/A
08.05.2023	HOLIDAYS: skipping lecture and seminars.	N/A
12.05.2023	Second deadline: pipeline assignment.	N/A

You can find a more complete summary from lectures as a list of topics.

Technical solution

Module	Description	Component	Need to get
`pathlib`	working with file paths	scrapper	4
`requests`	downloading web pages	scrapper	4
`BeautifulSoup4`	finding information on web pages	scrapper	4
`lxml`	Optional parsing HTML	scrapper	6
`datetime`	working with dates	scrapper	6
`json`	working with json text format	scrapper, pipeline	4
`pymystem3`	module for morphological analysis	pipeline	6
`pymorphy2`	module for morphological analysis	pipeline	10

Software solution is built on top of three components:

scrapper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
article.py - a module for article abstraction to encapsulate low-level manipulations with the article.

Handing over your work

Order of handing over:

Lab work is accepted for oral presentation.
A student has explained the work of the program and showed it in action.
A student has completed the min-task from a mentor that requires some slight code modifications.
A student receives a mark:
1. That corresponds to the expected one, if all the steps above are completed and mentor is satisfied with the answer.
2. One point bigger than the expected one, if all the steps above are completed and mentor is very satisfied with the answer.
3. One point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied.
4. Two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied.

NOTE: A student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.

A lab work is accepted for oral presentation if all the criteria below are satisfied:

There is a Pull Request (PR) with a correctly formatted name: Scrapper, <NAME> <SURNAME> - <UNIVERSITY GROUP NAME>. Example: Scrapper, Valeriya Kuznetsova - 19FPL1.
Has a filled file target_score.txt with an expected mark. Acceptable values: 4, 6, 8, 10.
Has green status.
Has a label done, set by mentor.

Resources

Academic performance: link
Media websites list: link
Python programming course from previous semester: link
Scrapping tutorials: YouTube series (russian)
HOWTO: Set up your fork
HOWTO: Running tests
HOWTO: Running assignments in terminal

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
config		config
core_utils		core_utils
docs		docs
lab_5_scrapper		lab_5_scrapper
lab_6_pipeline		lab_6_pipeline
seminars		seminars
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_qa.txt		requirements_qa.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical Track of Computer Tools for Linguistic Research (2022/2023)

Project Timeline

Lectures history

Technical solution

Handing over your work

Resources

About

Releases

Packages

Languages

License

fipl-hse/2022-2-level-ctlr

Folders and files

Latest commit

History

Repository files navigation

Technical Track of Computer Tools for Linguistic Research (2022/2023)

Project Timeline

Lectures history

Technical solution

Handing over your work

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages