diff --git a/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb b/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb
index 16955a3..6804d59 100644
--- a/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb
+++ b/Bengali POS Tagging/Bengali_POS_Tagging_project.ipynb
@@ -7,9 +7,19 @@
"colab_type": "text"
},
"source": [
- "
"
+ "
"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We have first used Natural Language ToolKit or NLTK library to define & apply basic POS tagging on English Corpus."
+ ],
+ "metadata": {
+ "id": "nhT_ER8KSARN"
+ },
+ "id": "nhT_ER8KSARN"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -228,6 +238,16 @@
"nltk.pos_tag(text)"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now we are going to install bnlp_toolkit."
+ ],
+ "metadata": {
+ "id": "YBLBPoj-SPeY"
+ },
+ "id": "YBLBPoj-SPeY"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -281,6 +301,16 @@
"pip install bnlp_toolkit"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In the next step, we took a small Bengali Corpus & tokenized each Bengali words from sentences individually using BasicTokenizer from BNLP under Rule-Based Approach. Then the same applied on two larger Bengali corpora.\n"
+ ],
+ "metadata": {
+ "id": "K85canWmSkNT"
+ },
+ "id": "K85canWmSkNT"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -331,6 +361,16 @@
"print(tokens)"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In next step, we have used NLTKTokenizer from BNLP to tokenize Bengali small corpus into two phases. One is in Word Tokenizing & second one is in Sentence Tokenizing under Rule-based approach. Word Tokenizer tokenized Bengali Words while Sentence Tokenizer tokenized each sentences separately. Then applied the same on two larger Bengali Corpora."
+ ],
+ "metadata": {
+ "id": "pQVUZkemSqKR"
+ },
+ "id": "pQVUZkemSqKR"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -393,6 +433,16 @@
"print(sentence_tokens)"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In the next step, we used POS function with pre-trained model from BNLP & took a small Bengali Corpus to tag Bengali words & categorize them into different Parts of Speeches under Conditional Random Field based approach."
+ ],
+ "metadata": {
+ "id": "64DBokj_S-Ru"
+ },
+ "id": "64DBokj_S-Ru"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -445,6 +495,16 @@
"print(res)"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Next we used SentencePieceTokenizer to apply Unsupervised Learning on two Bengali Corpora."
+ ],
+ "metadata": {
+ "id": "Cwg77f3-UZqa"
+ },
+ "id": "Cwg77f3-UZqa"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -499,6 +559,16 @@
"print(tokens)"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In the next we have embedded Bengali Words of a corpus using BengaliWord2Vector with pre-trained model from BNLP to get the vector shape of words & their values under Deep Learning approach."
+ ],
+ "metadata": {
+ "id": "nCm5xh8aUgzm"
+ },
+ "id": "nCm5xh8aUgzm"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -544,6 +614,16 @@
"print(vector)"
]
},
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We have used again BasicTokenizer, NLTKTokenizer and POS function on different copora for testing the same."
+ ],
+ "metadata": {
+ "id": "MbYg1_xvUmZY"
+ },
+ "id": "MbYg1_xvUmZY"
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -634,9 +714,7 @@
"id": "7ef58636"
},
"outputs": [],
- "source": [
- ""
- ]
+ "source": []
}
],
"metadata": {