From a6f9753dea8bbcb9279200510949dc4a73af8553 Mon Sep 17 00:00:00 2001 From: Rajdeep Das Date: Tue, 7 Feb 2023 20:12:47 +0530 Subject: [PATCH] Created using Colaboratory --- .../Bengali_POS_Tagging_project.ipynb | 86 ++++++++++++++++++- 1 file changed, 82 insertions(+), 4 deletions(-) diff --git a/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb b/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb index 16955a3..6804d59 100644 --- a/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb +++ b/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb @@ -7,9 +7,19 @@ "colab_type": "text" }, "source": [ - "\"Open" + "\"Open" ] }, + { + "cell_type": "markdown", + "source": [ + "We have first used Natural Language ToolKit or NLTK library to define & apply basic POS tagging on English Corpus." + ], + "metadata": { + "id": "nhT_ER8KSARN" + }, + "id": "nhT_ER8KSARN" + }, { "cell_type": "code", "execution_count": null, @@ -228,6 +238,16 @@ "nltk.pos_tag(text)" ] }, + { + "cell_type": "markdown", + "source": [ + "Now we are going to install bnlp_toolkit." + ], + "metadata": { + "id": "YBLBPoj-SPeY" + }, + "id": "YBLBPoj-SPeY" + }, { "cell_type": "code", "execution_count": null, @@ -281,6 +301,16 @@ "pip install bnlp_toolkit" ] }, + { + "cell_type": "markdown", + "source": [ + "In the next step, we took a small Bengali Corpus & tokenized each Bengali words from sentences individually using BasicTokenizer from BNLP under Rule-Based Approach. Then the same applied on two larger Bengali corpora.\n" + ], + "metadata": { + "id": "K85canWmSkNT" + }, + "id": "K85canWmSkNT" + }, { "cell_type": "code", "execution_count": null, @@ -331,6 +361,16 @@ "print(tokens)" ] }, + { + "cell_type": "markdown", + "source": [ + "In next step, we have used NLTKTokenizer from BNLP to tokenize Bengali small corpus into two phases. One is in Word Tokenizing & second one is in Sentence Tokenizing under Rule-based approach. Word Tokenizer tokenized Bengali Words while Sentence Tokenizer tokenized each sentences separately. Then applied the same on two larger Bengali Corpora." + ], + "metadata": { + "id": "pQVUZkemSqKR" + }, + "id": "pQVUZkemSqKR" + }, { "cell_type": "code", "execution_count": null, @@ -393,6 +433,16 @@ "print(sentence_tokens)" ] }, + { + "cell_type": "markdown", + "source": [ + "In the next step, we used POS function with pre-trained model from BNLP & took a small Bengali Corpus to tag Bengali words & categorize them into different Parts of Speeches under Conditional Random Field based approach." + ], + "metadata": { + "id": "64DBokj_S-Ru" + }, + "id": "64DBokj_S-Ru" + }, { "cell_type": "code", "execution_count": null, @@ -445,6 +495,16 @@ "print(res)" ] }, + { + "cell_type": "markdown", + "source": [ + "Next we used SentencePieceTokenizer to apply Unsupervised Learning on two Bengali Corpora." + ], + "metadata": { + "id": "Cwg77f3-UZqa" + }, + "id": "Cwg77f3-UZqa" + }, { "cell_type": "code", "execution_count": null, @@ -499,6 +559,16 @@ "print(tokens)" ] }, + { + "cell_type": "markdown", + "source": [ + "In the next we have embedded Bengali Words of a corpus using BengaliWord2Vector with pre-trained model from BNLP to get the vector shape of words & their values under Deep Learning approach." + ], + "metadata": { + "id": "nCm5xh8aUgzm" + }, + "id": "nCm5xh8aUgzm" + }, { "cell_type": "code", "execution_count": null, @@ -544,6 +614,16 @@ "print(vector)" ] }, + { + "cell_type": "markdown", + "source": [ + "We have used again BasicTokenizer, NLTKTokenizer and POS function on different copora for testing the same." + ], + "metadata": { + "id": "MbYg1_xvUmZY" + }, + "id": "MbYg1_xvUmZY" + }, { "cell_type": "code", "execution_count": null, @@ -634,9 +714,7 @@ "id": "7ef58636" }, "outputs": [], - "source": [ - "" - ] + "source": [] } ], "metadata": {