Python competencies required to complete this tutorial:
- working with external dependencies, going beyond Python standard library;
- working with external modules: local and downloaded from PyPi;
- working with files: create/read/update;
- applying basic cleaning techniques to the raw text: tokenization, lemmatization etc.;
- extracting linguistic features from the raw text: part of speech, case etc.
Processing data breaks down in the following steps:
- Loading raw data.
- Tokenizing the text.
- Performing necessary transformations, such as lemmatization or stemming.
- Extracting valuable information from the text, such as detect each word part of speech.
- Saving necessary information.
As a part of the second milestone, you need to implement processing logic as a pipeline.py
module.
When it is run as a standalone Python program, it should perform all aforementioned stages.
During this assignment you will be working with the UD text description format
(.conllu
extension).
Refer to the corresponding document for the explanation
on linguistic information extraction,
mappings to pymystem3
and pymorphy2
, as well as fields description.
Example execution (Windows
):
python pipeline.py
Expected result:
N
raw texts previously collected by scrapper are processed.- Each article has a processed version (or versions) saved in the
tmp/articles
directory.
An example tmp
directory content for mark 6:
+-- 2022-2-level-ctlr
+-- tmp
+-- articles
+-- 1_raw.txt <- the paper with the ID (from scrapper.py run)
+-- 1_meta.json <- the paper meta-information (from scrapper.py run)
+-- 1_cleaned.txt <- processed text with no punctuation (by pipeline.py run)
+-- 1_pos_conllu.conllu <- processed text in the UD format (by pipeline.py run)
NOTE: When using CI (Continuous Integration), generated
processed-dataset.zip
is available in build artifacts. Go toActions
tab in GitHub UI of your fork, open the last job and if there is an artifact, you can download it.
Processing behavior must follow several steps:
- Pipeline takes a raw dataset that is collected by
scrapper.py
and placed atASSETS_PATH
(seeconstants.py
for a particular place). - Pipeline goes through each raw file, for example
1_raw.txt
. - Pipeline processes each text, sentence, word token and extracts linguistic information.
- Pipeline saves extracted information into the file with the same id for each article processed,
for example
1_pos_conllu.conllu
.
You state your ambition on the mark by editing the file target_score.txt
. For example:
6
would mean that you have made tasks for mark 6
and request mentors to check if you can get it.
- Desired mark 4:
pylint
level:5/10
.- Pipeline validates that raw dataset has a proper structure and fails appropriately if
the latter is incorrect. Criteria:
- Dataset exists (there is a folder).
- Dataset is not empty (there are files inside).
- Dataset is balanced - there are only files that follow the naming conventions:
N_raw.txt
where N is a valid number.- Numbers of articles are from 1 to N without any slips.
- Pipeline tokenizes text in each file, removes punctuation, and casts it to the lower case (no lemmatization or tagging).
- Pipeline produces
N_cleaned.txt
files in thetmp/articles
.
- Desired mark 6:
pylint
level:7/10
.- All requirements for the mark 4.
- Pipeline uses
pymystem3
library to perform lemmatization and POS tagging. - Pipeline should define ID, FORM, LEMMA, and POS information in the resulting
.conllu
file. - Pipeline produces
N_pos_conllu.conllu
files with text tagging for each article.
- Desired mark 8:
pylint
level:10/10
.- All requirements for the mark 6.
- Pipeline uses
pymystem3
library to perform morphological tags extraction (pymystem3
tags are represented in angle brackets). - Pipeline should define FEATS alongside ID, FORM, LEMMA, and POS fields information
in the resulting
.conllu
file. - Pipeline produces
N_morphological_conllu.conllu
files with extended morphological information for each article, e.g. word animacy.
- Desired mark 10:
pylint
level:10/10
.- All requirements for the mark 8.
- An additional pipeline is introduced that:
- Uses backup
pymorphy2
analyzer and backup tag converter forNOUN
tags processing. - Produces
N_full_conllu.conllu
files with extended morphological information forNOUN
bypymorphy2
.
- Uses backup
- An additional pipeline is introduced
pos_frequency_pipeline.py
that:- Collects frequencies of POS in each text.
- Extends
N_meta.json
files with this information. - Visualizes this distribution as
.png
files that are created for each article and saved intoN_image.png
files.
All logic for instantiating and using needed abstractions should be implemented
in a special block of the module pipeline.py
:
def main():
print('Your code goes here')
if __name__ == '__main__':
main()
scrapper.py
implementation
You will not be able to start your implementation if there is no scrapper.py
implementation.
Make sure you implemented and passed scrapper.py
assignment first.
- Ensure you only use
pathlib
to work with file paths
As we discussed during lectures it is always better to have something designed specifically
for the given task. Comparing os
and pathlib
modules, the latter is the one that is
designed for most file system related operations. Make sure you use only pathlib
in the code you write.
NOTE: Do not change modules external to your code, for example
core_utils/article/article.py
- consider them as not available for installation. If you see a way to improve external modules, propose them in a separate PR - mentors will review them separately and give you bonuses as any improvements are appreciated.
As we discussed multiple times, when we are working from our Python programs with
the real world entities, we need to emulate their behavior by new abstractions.
If we think of the Pipeline and consider the Single Responsibility Principle,
we will quickly realize that it is not the responsibility of the Pipeline to know
where the dataset files are located and how to read/write to them, etc.
Therefore, we need a new abstraction to be responsible for such tasks.
We call it CorpusManager
.
CorpusManager
is an entity that knows where the dataset is placed and what
are the available files of this dataset.
It should be instantiated with the following instruction:
corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
CorpusManager
instance validates the dataset provided and saves all the
constructor arguments in attributes with
corresponding names. Each instance should also have an additional attribute
self._storage
of a dictionary type and filled with information about
the files. Read about the filling instructions in the Stage 1.3.
NOTE: Remember to use
pathlib
to create file path object.
Pipeline expects that dataset is collected by scrapper.
It must not start working if dataset is invalid.
The very first thing that should happen after CorpusManager
is
instantiated is a dataset validation.
Interface to implement:
def _validate_dataset(self) -> None:
pass
NOTE: Path to the dataset is stored in the attribute
path_to_raw_txt_data
. Remember to usepathlib
module in order to operate paths.
NOTE: Call this method during initialization.
When dataset is valid, method returns None
. Otherwise:
- One of the following errors is thrown:
FileNotFoundError
: file does not exist;NotADirectoryError
: path does not lead to directory;InconsistentDatasetError
: IDs contain slips, number of meta and raw files is not equal, files are empty;- For mark 4, check that dataset contains no slips in IDs of raw files and files are not empty.
EmptyDirectoryError
: directory is empty.
- Script immediately finishes execution.
NOTE: While validating dataset, ignore files which are named not using formats.
During initialization of the CorpusManager
,
it should scan the provided folder path and register each dataset entry.
All the storage is represented as self._storage
attribute.
Filling the storage should be done by executing this method:
def _scan_dataset(self) -> None:
pass
NOTE: Call this method during initialization and save the results in
self._storage
attribute.
SELF CHECK: Can you explain why the name of the method starts with an underscore?
The method should contain logic for iterating over the content of the folder,
finding all N_raw.txt
files and creating an Article
instance for each file.
NOTE:
Article
constructor expects URL as the first argument. It should be safe to passNone
instead of the real URL. Pipeline does not need to know where was the article downloaded from. See article package description.
As it was stated before, self._storage
attribute is just a dictionary.
Keys are ids of the files, values are instances of the Article
class.
For example, pipeline finds a file 1_raw.txt
. Then we put new pair to the storage:
self._storage[1] = Article(url=None, article_id=1)
self._storage
attribute is not a part of the CorpusManager
interface,
therefore we need a special getter - a method that just returns a storage value:
def get_articles(self) -> dict:
pass
SELF CHECK: Can you explain why we might need getters?
Eventually, CorpusManager
should return a dictionary of Article
instances via get_articles()
method.
NOTE:
CorpusManager
knows where are the files, it can easily find them by id, but it is not its responsibility to perform actual file reads and writes. Seecore_utils/article/io.py
module for article save/read functionality.
NB: Stages 0-2 are required to get the mark 4.
To get a mark not lower than 4, your pipeline must perform basic text preprocessing:
- Tokenize sentences (split into words).
- Lowercase each token.
- Remove punctuation.
After implementation of preprocessing, your pipeline must save results in the files with the names
following the pattern N_cleaned.txt
. See examples for a better understanding:
Raw text -
Desired output.
You need to define abstractions responsible for managing data.
We start with the ConlluToken
abstraction.
Interface to implement:
class ConlluToken:
pass
The ConlluToken
abstraction should be initialized with token string as a parameter:
conllu_token = ConlluToken(text='мама')
Abstraction field:
_text
- original token text from the article raw text;
The ConlluToken
abstraction should implement returning lowercased original form of
a token and removing punctuation by this method:
def get_cleaned(self) -> str:
pass
NOTE: For mark 4 you are not required to fill in and implement morphological parameters related methods.
The ConlluSentence
abstraction stores the representation of the sentence.
Interface to implement:
class ConlluSentence(SentenceProtocol):
pass
Abstraction fields:
_position
- sentence position in the article text;_text
- original sentence text from the article raw text;_tokens
- list ofConlluTokens
instances for each token in the sentence.
The ConlluSentence
should define a method for getting lowercased sentence with no punctuation:
def get_cleaned_sentence(self) -> str:
pass
NOTE: In this method it is mandatory to call
get_cleaned()
method fromConlluToken
abstraction.
All of the above stages are necessary for implementing
simplified MorphologicalAnalysisPipeline
abstraction:
class MorphologicalAnalysisPipeline:
pass
Simplified MorphologicalAnalysisPipeline
takes the raw text of the article and
saves the processed (lowercased with no punctuation) text to a file N_cleaned.txt
.
The abstraction should have _corpus
attribute which represents your CorpusManager
abstraction.
It is executed with a simple interface method that you need to implement:
def run(self) -> None:
pass
Once executed, pipeline.run()
iterates through the available articles taken from CorpusManager
,
reads each file, performs basic preprocessing and writes processed text to files.
NOTE: It is mandatory to get articles with
CorpusManager.get_articles()
method.
NOTE: It is mandatory to read article with the
from_raw(path, article)
function from thecore_utils/article/io.py
module.
HEALTH CHECK: Try to implement
pipeline.run()
in a way that it goes through the articles collected byCorpusManager.get_articles()
, reads each of them using thefrom_raw()
function, stores the sentences in the article usingArticle.set_conllu_sentences()
, and then writes to the file as a processed article using theto_cleaned()
method. At least you will see that everything works to this moment and you can proceed to implementing core logic of pipeline.
All preprocessing logic is encapsulated in the following protected method:
def _process(text) -> List[ConlluSentence]:
pass
It takes the text of the article, splits it into sentences and returns a list of ConlluSentence
.
NOTE:
_process(text)
method should be called in therun()
method.
HINT: You can use
split_by_sentence(text)
function fromcore_utils/article/article.py
module for splitting text into sentences.
In order to save each article to its separate file, inspect the core_utils/article/io.py
module.
Use to_cleaned(article)
function to save cleaned text to the appropriate folder.
Call this function with the article instance you want to save text for.
The function generates a file with a name N_cleaned.txt
, where N
is the index of your article
in the tmp/articles
directory.
NOTE: It is mandatory to save generated text to file in the
run()
method.
NB: Stages 0-3 are required to get the mark 6.
To get a mark not lower than 6, your pipeline, in addition to mark 4 requirements,
must perform morphological text analysis for each article using pymystem3
library and
save the result in the file with the name following the pattern N_pos_conllu.conllu
.
See examples for a better understanding: Raw text - Desired output.
File with .conllu
extension means that it corresponds to the UD format.
Starting with the mark 6 you are required to save results of morphological text analysis in the UD
format.
For better understanding of the format fields see UD description document.
NOTE: Specifically for mark 6 your pipeline should define ID, FORM, LEMMA, and POS information in the resulting
.conllu
file.
As all article text information storing and managing is done by the Article
abstraction,
see Article description document
before proceeding to the next stages.
The MorphologicalTokenDTO
abstraction stores the following information for each token:
- lemma
- part of speech
- morphological tags
Interface to implement:
class MorphologicalTokenDTO:
pass
MorphologicalTokenDTO
instance should have the following attributes:
lemma
that returns token's lemma (base form);pos
that returns token's POS tag in the UD format;tags
that returns token's morphological tags in the UD format.
All class fields should be optional for initialisation. It means if there is no morphological tags information the field should be left as an empty string.
NOTE: For mark 6 you need only lemma and POS information, so
tags
field should be empty.
After implementing the MorphologicalTokenDTO
which is responsible for storing
morphological information, you need to add this information to the ConlluToken
abstraction.
The following attributes should be added in the ConlluToken
constructor:
_morphological_parameters
that stores morphological information asMorphologicalTokenDTO
.position
that stores the position of the token in the sentence asint
.
After that define getting and setting methods for morphological information that are:
def set_morphological_parameters(self, parameters: MorphologicalTokenDTO) -> None:
pass
def get_morphological_parameters(self) -> MorphologicalTokenDTO:
pass
In ConlluToken
abstraction you also have to define a method
for token's string representation for .conllu
files:
def get_conllu_text(self, include_morphological_tags: bool) -> str:
pass
NOTE: Parameter
include_morphological_tags
is responsible for displaying morphological tags. For implementation on mark 6 you do not need to display them.
In this method you have to create a string with the following features
which must be joined with a tabulation \t
:
position
text
lemma
which will be filled with lemma. You are going to do that in stage 3.5.pos
xpos
which is filled with_
feats
which is filled with_
so far (you will need it for mark 8)head
which is filled with0
deprel
which is filled withroot
deps
which is filled with_
misc
which is filled with_
NOTE: The last four fields will be filled certain way as the UD format demands it, but these fields will not be needed in our laboratory work.
Also in ConlluSentence
abstraction you have to create the sentence's string representation for
.conllu
files.
For this aim you need to define the following methods:
def _format_tokens(self, include_morphological_tags: bool) -> str:
pass
def get_conllu_text(self, include_morphological_tags: bool) -> str:
pass
NOTE: Parameter
include_morphological_tags
is responsible for displaying morphological tags. For implementation on mark 6 you don't need to display them.
The _format_tokens()
method formats tokens per newline.
The get_conllu_text()
method creates the sentence's string representation that includes:
sent_id
text
tokens
sent_id
and text
should look like the following way:
f'# sent_id = {self._position}\n'
f'# text = {self._text}\n'
NOTE: To write
tokens
you have to call_format_tokens()
method.
After implementing abstractions for storing and managing morphological data
you need to define overall processing logic to fill abstractions defined with data and
save processing result in the UD format. All processing and filling actions is
the responsibility of pipeline. So you need to extend MorphAnalysisPipeline
pipeline.
For mark 6, apart from tokenization, punctuation removal and casting to lowercase, you must implement the following processing:
- Setting token id (that is
ID
field in the UD-formatted file). - Setting token text (that is
FORM
field in the UD-formatted file). - Lemmatization (that is
LEMMA
field in the UD-formatted file). - POS tagging (that is
POS
field in the UD-formatted file).
NOTE: For more information about these fields see UD format description.
MorphologicalAnalysisPipeline
is executed with the same interface method run()
as described during previous stages. The only difference is inside processing logic.
Once executed:
- pipeline goes over each
Article
instance and gets raw text;- for each sentence in raw text pipeline fills
ConlluSentence
; - for each token in the sentence pipeline fills
ConlluToken
; - for each
ConlluToken
instance pipeline fillsMorphologicalTokenDTO
; - to get morphological information for
MorphologicalTokenDTO
pipeline usespymystem3
library.
- for each sentence in raw text pipeline fills
- pipeline sets
Article
conllu sentences field usingset_conllu_sentences(sentences)
method; - pipeline saves processing result
using
to_conllu(article, include_morphological_tags: bool, include_pymorphy_tags: bool)
function.
NOTE: It is mandatory to use
split_by_sentence
method fromarticle
module to split sentences.
To extract lemma and POS information you need to use pymystem3
library.
NOTE: It is recommended to rely on
pymystem3
ability to process text as a whole and perform tokenization, lemmatization and morphological analysis at once. There are several reasons to do that, but from the linguistic perspective it would be interesting for you to remember that context-aware lemmatization works better than lemmatization of each word separately.
Use the following way to analyze the text:
result = Mystem().analyze(text)
Here, text
is the text that you want to process, e.g. raw text of the article,
and result
is the result of morphological analysis.
Inspect the result
as you need. It stores all information required for the assignment.
NOTE: Use
debug
orresult
- you will find everything you need there.
HINT:
result['text']
is likely to have the original word. Use the same approach to find POS information and normalized form.
Keep in mind that all processing logic is encapsulated in the protected _process(text)
pipeline method as described previously.
NOTE: The only punctuation mark used in resulting
.conllu
files is a dot at the end of the sentence. Make sure you remove or ignore other punctuation marks.
NOTE: Since conversion to UD requires us to provide POS tag to every entity and
pymystem3
andpymorphy2
do not give any tag to numbers and punctuation, you need to identify such tokens as numbers and punctuation and map them to appropriate UD tags:PUNCT
andNUM
.
When you get word POS info, it is not presented in an appropriate format. You need to convert it into the UD format. For more information about the UD format and tag conversion see UD format description.
The TagConverter
abstraction is responsible for tags conversion between different formats.
Its interface is defined inside the core_utils/article/ud.py
file.
You need to inherit interface given and implement the following abstraction
inside the pipeline.py
file:
class MystemTagConverter(TagConverter):
pass
The MystemTagConverter
instance should be initialised with path to the tag mappings file,
e.g. information about the correspondence of one format tags to another format tags.
You need to define tag mappings in the JSON format under the lab_6_pipeline/data
directory.
Again, see UD format description and
UD mapping description for more information on the UD format
and its mapping with other formats.
NOTE: JSON file with
pymystem3
tag mappings should be namedmystem_tags_mapping.json
.
After initialising the MystemTagConverter
instance, it should extract mapping information from
the file provided inside its constructor. All mapping information should be filled into the class
attribute field.
Specifically for mark 6 you need to convert POS information from pymystem3
format into
the UD format. For example, pymystem3
analysis result string A=им,ед,полн,жен
should be
converted into ADJ
UD POS tag. Method to implement:
def convert_pos(self, tags: str) -> str:
pass
Method extracts POS tag from pymystem3
analysis string and converts it into the UD format.
NOTE: Make sure you convert
pymystem3
POS tag right after getting it as a result of text analysis.MorphologicalTokenDTO
instancepos
field should be filled with the UD formatted POS tag.MystemTagConverter
instance should be initialised as a field ofMorphologicalAnalysisPipeline
.
In order to save each article to its separate file, inspect the core_utils/article/io.py
module.
Use to_conllu(article, include_morphological_tags: bool, include_pymorphy_tags: bool)
function
to save the result of text POS tagging to the appropriate folder.
Call this function with the article instance you want to save text for.
NOTE:
include_morphological_tags
andinclude_pymorphy_tags
parameters should beFalse
.
The function generates a file with a name N_pos_conllu.conllu
, where N
is
the index of your article in the tmp/articles
directory.
NOTE: It is mandatory to save generated text to file in the
run()
method.
NB: Stages 0-4 are required to get the mark 8.
To get a mark not lower than 8, your MorphologicalAnalysisPipeline
should also produce
files with extended morphological information for each article, e.g. word animacy,
and save the result in the file with the name following the pattern
N_morphological_conllu.conllu
.
See examples for a better understanding: Raw text - Desired output.
NOTE: For mark 8 your pipeline should fill FEATS alongside ID, FORM, LEMMA, and POS fields information in the resulting
.conllu
file.
As you already know, pymystem3
library allows you to get morphological analysis for a particular
word.
During previous stages you extracted only lemma and POS tag from pymystem3
analysis.
For example, analyzing word красивая
you have got a lemmatized version - красивый
- and its
POS tag - A
which you subsequently have converted into the UD format.
For mark 8 you need to implement deeper morphological analysis with pymystem3
library
by obtaining the morphological features of the word. For example, when analysing word красивая
,
pymystem3
gives the following result: A=им,ед,полн,жен
. You need to take им,ед,полн,жен
tags,
convert them into the UD-formatted string and save alongside other morphological information
in the UD format (that is FEATS
field in the UD-formatted file).
NOTE: It is still
_process(text)
method that contains all the processing logic, including additional analysis done withpymystem3
.
NOTE:
pos
,lemma
,tags
fields ofMorphologicalTokenDTO
abstraction should be initialized during processing with word morphological features converted into the UD format. Thetags
field is used later to fill theFEATS
field in the UD-formatted file.
First of all, you need to extend JSON file with mapping information,
as now you are working with additional morphological attributes.
Think about pymystem3
and UD format attributes correspondence and create mappings accordingly.
As you now retrieve extended morphological information you need to extend the
MystemTagConverter
abstraction with the following method:
def convert_morphological_tags(self, tags: str) -> str:
pass
Method takes pymystem3
word attributes string, extracts all morphological
features and converts them in the UD format according to mapping information.
For example pymystem3
word attributes string A=им,ед,полн,жен
should
be converted into the Case=Nom|Gender=Fem|Number=Sing
string according
to UD format.
NOTE: To have a better understanding of conversion logic see UD document description.
In order to save each article to its separate N_morphological_conllu.conllu
file,
call the method to_conllu(article, include_morphological_tags: bool, include_pymorphy_tags: bool)
from the core_utils/article/io.py
module with each of your articles instances as parameter.
NOTE: The
include_morphological_tags
parameter should beTrue
and theinclude_pymorphy_tags
parameter should beFalse
. Moreover, ifinclude_morphological_tags
parameter ofget_conllu_text()
method isFalse
, you will not save morphological tags.
NOTE: It is mandatory to save generated text to file in the
run()
method.
NB: Stages 0-6 are required to get the mark 10.
For mark 10 you need to refine the logic of the existing MorphologicalAnalysisPipeline
pipeline by making a more flexible AdvancedMorphologicalAnalysisPipeline
version.
Mark 10 implies not just extracting information from text, but also statistical processing
and visualisation of the resulting information. For this purpose you are going to introduce
POSFrequencyPipeline
. Let us start with improving morphological pipeline processing logic.
You have used pymystem3
analyzer before, and you may have noticed that the analyzer provides
quite accurate analysis of tokens in some places, but in other places the analyzer is inferior
and does not always give correct analysis, for example equating VERB
with ADV
. In fact,
there are various analyzer libraries providing functions for text analysis. Some analyzers
handle certain tasks better than others and vice versa.
You need to make your pipeline more flexible and add a secondary additional backup analyzer
pymorphy2
, which will handle only NOUN
tokens. The other parts of speech will be handled
using pymystem3
as before.
This way, by adding the possibility to use several analyzers, you can improve the performance of the program, using all the best features of each analyzer.
Before adding pymorphy2
analyzer support, it is necessary to define a converter which will
convert the pymorphy2
tags into the UD format we are working with. Interface to implement:
class OpenCorporaTagConverter(TagConverter):
pass
The OpenCorporaTagConverter
class should inherit TagConverter
and use its methods.
The OpenCorporaTagConverter
instance should be initialised with path to the tag mappings file.
You need to define tag mappings in the JSON format under the lab_6_pipeline/data
directory.
Again, see UD format description and
UD mapping description for more information on the UD format
and its mapping with other formats.
NOTE: JSON file with
pymorphy2
tag mappings should be namedopencorpora_tags_mapping.json
.
def convert_pos(self, tags: OpencorporaTagProtocol) -> str:
pass
def convert_morphological_tags(self, tags: OpencorporaTagProtocol) -> str:
pass
NOTE: Both
convert_pos()
andconvert_morphological_tags()
methods require specialtags
parameter of theOpencorporaTagProtocol
type. When you analyse each token usingpymorphy2
you get the result of analysis as theOpencorporaTagProtocol
token instance. Inspect this object to get all morphological information required.
NOTE: As we handle only
NOUN
tokens you have to parsegender
,number
,animacy
andcase
tags inconvert_morphological_tags()
method.
When you are done defining pymorphy2
into UD format mappings and relevant converter,
it is right time to implement the AdvancedMorphologicalAnalysisPipeline
class with
pymorphy2
as backup analyser and OpenCorporaTagConverter
as backup converter.
Interface to implement:
class AdvancedMorphologicalAnalysisPipeline(MorphologicalAnalysisPipeline):
pass
You need to redefine _process(text)
method as you are inheriting from the
MorphologicalAnalysisPipeline
class.
Method to redefine:
def _process(self, text: str) -> List[ConlluSentence]:
pass
NOTE: You also need to specify own initializer of this class to create two more attributes alongside with parent's attributes:
pymorphy2
as the_backup_analyzer
attribute andOpenCorporaTagConverter
as the_backup_tag_converter
attribute. Keep in mind that both fields are to be used only when you are working withNOUN
tokens inside the_process(text)
pipeline method.
NOTE:
pymystem3
may define some words as NOUN, but when you use backuppymorphy2
analyzer it may define the same words as not NOUN and even provide no analysis at all. In this case you should still use the result of backuppymorphy2
analyzer.
You will need to redefine run()
method as you are inheriting from
the MorphologicalAnalysisPipeline
class and need to call
to_conllu(article, include_morphological_tags: bool, include_pymorphy_tags: bool)
function with different parameters.
Method to redefine:
def run(self) -> None:
pass
Use the to_conllu(article, include_morphological_tags: bool, include_pymorphy_tags: bool)
function to save result to the appropriate folder.
Call this function with the article instance you want to save text for.
NOTE:
include_morphological_tags
andinclude_pymorphy_tags
parameters should beTrue
.
The function generates a file with a name N_full_conllu.conllu
, where N
is
the index of your article in the tmp/articles
directory.
NOTE: It is mandatory to save generated text to file in the
run()
method.
We have just made several text processing pipelines with base and advanced processing logic. However, this is only the beginning of your linguistic research: you have the data and now need to start analyzing it, gaining insights, understanding it and finding hidden meanings. During this stage we will make a small pipeline that will compute distribution of various parts of speech in our texts, visualize it and, maybe, it will give better understanding of the text.
This is a sample result we are going to obtain:
Now we are going to work with the file pos_frequency_pipeline.py
and with the class POSFrequencyPipeline
. All code
should be written in the main()
function. POSFrequencyPipeline
is instantiated in the
similar manner as the MorphologicalAnalysisPipeline
or AdvancedMorphologicalAnalysisPipeline
:
corpus_manager = CorpusManager(...)
...
pipeline = POSFrequencyPipeline(corpus_manager=corpus_manager)
Since you are going to calculate POS frequencies of the tokens,
you will need to get access to ConlluTokens
in ConlluSentences
.
In order to do that, implement the following method in ConlluSentence
class:
def get_tokens(self) -> list[ConlluToken]:
pass
In order to work with .conllu
files which have already been written (e.g.
N_full_conllu.conllu
), you need not only be able to open those files, but also to
represent information from them using abstractions you have previously written that are
responsible for representing information from .conllu
files in your program. That is for
example, Article
abstraction.
You need to implement a service function that reads the information from the .conllu
file
into the program and populates the Article
abstraction with all information from the source file.
Interface to implement inside the pos_frequency_pipeline.py
module:
def from_conllu(path: Union[Path, str],
article: Optional[Article] = None) -> Article:
pass
The function takes path to the article requested and optionally an instance of Article
.
Function reads and extracts all .conllu
processing information from relevant file and
fills new Article
instance, if no instance provided, or fills Article
instance
provided within the article
param.
HINT: Inspect
core_utils/article/ud.py
module for service functionality that can be helpful in current task, especiallyextract_sentences_from_raw_conllu()
function.
POSFrequencyPipeline
is executed with the same interface method that you need to implement:
pipeline.run()
Once executed, pipeline.run()
:
- Iterates through the available articles taken from
CorpusManager
. - Reads each file (any
.conllu
file type can be read, as each has POS info). - Calculates frequencies of each part of speech.
- Writes them to the meta file.
- Visualizes frequencies in a form of images with names following this convention:
N_image.png
.
NOTE: It is mandatory to get articles with the
CorpusManager.get_articles()
method.
NOTE: It is mandatory to use
Article.get_file_path()
,Article.set_pos_info()
methods. It is mandatory to useto_meta()
and from_meta() function.
NOTE: You have to create
EmptyFileError
exception class and to raise it when an article file is empty.
NOTE: Make sure that resulting meta files are valid: they must contain no more than one dictionary-line object.
HINT: To speedup
pymystem3
for processing large texts you should delete all line breaks. You can do it with regular expressions.
For visualization, you need to use visualize(article, path_to_save)
method
from visualizer.py
module available under the core_utils
folder of the project.
Sample usage:
visualize(article=article, path_to_save=ASSETS_PATH / '1_image.png')