Name		Name	Last commit message	Last commit date
parent directory ..
LICENSE_CC_BY_4.0.txt		LICENSE_CC_BY_4.0.txt
LICENSE_CC_BY_NC_SA_4.0.txt		LICENSE_CC_BY_NC_SA_4.0.txt
LICENSE_MIT.txt		LICENSE_MIT.txt
README.md		README.md
dev		dev
test_asr		test_asr
test_bn.txt		test_bn.txt
test_news		test_news
test_ref		test_ref
train		train

README.md

Punctuation Restoration Dataset

We used the dataset reported in in the work of Punctuation Restoration using Transformer Models for High-and Low-Resource Languages. Please also check the git repo for the experimental scripts: https://github.com/xashru/punctuation-restoration. The dataset also contains newspaper articles, which are curated from https://data.mendeley.com/datasets/xp92jxr8wn/2.

Dataset

The dataset consists of train, development, and test splits prepared from a publicly available corpus of Bangla newspaper articles. Additionally, the authors prepared two test datasets from manual and ASR transcribed texts. These were collected from 65 minutes of speech excerpts extracted from four Bangla short stories. There are four labels including three punctuation marks: (i) Comma: includes commas, colons and dashes, (ii) Period: includes full stops, exclamation marks and semicolons, (iii) Question: only question mark, and (iv) O: for any other token.

Directory Structure:

train: training newspapers articles
dev: development newspapers articles
test_news: newspapers articles
test_ref: manual transcription
test_asr: ASR transcription
test_bn.txt:

Licensing

The newspapers articles is licensed under CC BY 4.0. The manual, automatic transcription and data splits are released under MIT License.

Citation

@article{alam2021review,
  title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
  author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
  journal={arXiv preprint arXiv:2107.03844},
  year={2021}
}

@inproceedings{alam-etal-2020-punctuation,
    title = "Punctuation Restoration using Transformer Models for High-and Low-Resource Languages",
    author = "Alam, Tanvirul  and Khan, Akib  and Alam, Firoj",
    booktitle = "Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.wnut-1.18",
    pages = "132--142",
}
@article{DBLP:journals/corr/abs-1911-07613,
 archiveprefix = {arXiv},
 author = {Aisha Khatun and
Anisur Rahman and
Hemayet Ahmed Chowdhury and
Md. Saiful Islam and
Ayesha Tasnim},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/journals/corr/abs-1911-07613.bib},
 eprint = {1911.07613},
 journal = {CoRR},
 timestamp = {Mon, 02 Dec 2019 17:48:37 +0100},
 title = {A Subword Level Language Model for Bangla Language},
 url = {http://arxiv.org/abs/1911.07613},
 volume = {abs/1911.07613},
 year = {2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

punctuation_restoration

punctuation_restoration

README.md

Punctuation Restoration Dataset

Dataset

Directory Structure:

Licensing

Citation

Files

punctuation_restoration

Directory actions

More options

Directory actions

More options

Latest commit

History

punctuation_restoration

Folders and files

parent directory

README.md

Punctuation Restoration Dataset

Dataset

Directory Structure:

Licensing

Citation