IndicNLPParser

Indic NLP Unique Words

Program is meant to get unique list of words and the frequency in which it has occurred in Wikipedia. There is a bloom filter implementation to return a flag whether the word is valid in a language or not. Current implementation has bloom filter for Tamil, Telugu, Malayalam and Bengali.

For Tamil get the latest dump from http://dumps.wikimedia.org/tawiki/latest/ . Telugu will be like http://dumps.wikimedia.org/tewiki/latest/
File to be downloaded is tawiki-latest-pages-articles.xml.bz2
Extract the file using the following command bunzip2 tawiki-latest-pages-articles.xml.bz2
Clone Attardi Wiki Extractor Tool https://github.com/attardi/wikiextractor
An example on how to run WikiExtractor python3 WikiExtractor.py -o /Users/malaikannan/Documents/Work/opensource/TamilData /Users/malaikannan/Documents/Work/opensource/tawiki-latest-pages-articles.xml
Clone the IndicNLPParser repo
To run python3 wikiparser.py --wiki_dump_path "/home/ANANT/msankarasubbu/Documents/Work/opensource/Data" --csv_file_path "/home/ANANT/msankarasubbu/Documents/Work/opensource/Data/tamil_words.csv" --bloomfilter_file_path "/home/ANANT/msankarasubbu/Documents/Work/opensource/Data/tamil_words_filter.txt" --lower_unicode_value 2944 --upper_unicode_value 3071
lower_unicode_value and upper_unicode_value are Decimal values
For Tamil lower_unicode_value = 2944 and upper_unicode_value = 3071
For Telugu lower_unicode_value = 3072 and upper_unicode_value = 3199
For Malaylam lower_unicode_value = 3328 and upper_unicode_value = 3455
Output from the program will be written into a csv file and bloom filter file.
Tamil bloomfilter file https://www.dropbox.com/s/3bibyzccjkdkh86/tamil_words_filter.txt?dl=0
Telugu bloomfilter file https://www.dropbox.com/s/qdc0a7ueqowyw2z/telugu_words_filter.txt?dl=0
Malayalam bloomfilter file https://www.dropbox.com/s/aqienzy351i1420/malayalam_words_filter.txt?dl=0
Bengali bloom filter file https://www.dropbox.com/s/okskp4skl2tbsqn/bengali_words_filter.txt?dl=0
Update bloomservice.py with the correct path and items_count for respective languages. I have updated it for the runs that I did.
Run gunicorn --bind 0.0.0.0:5000 bloomservice:app to serve the REST Service
Tamil can be accessed in http://localhost:5000/indicnlp/tamil/v1.0/. Replace with actual tamil word.
Telugu can be accessed in http://localhost:5000/indicnlp/telugu/v1.0/. Replace with actual telugu word.
Malayalam can be accessed in http://localhost:5000/indicnlp/malayalam/v1.0/. Replace with actual Malayalam word.
Bengali can be accessed in http://localhost:5000/indicnlp/bengali/v1.0/. Replace with actual Bengali word.

Current URL

Previous Work

T Shrinivasan has done some earlier work on this area https://github.com/tshrinivasan/tamil-wikipedia-word-list

Muthunedumaran of Murasu Anjal fame has a bash script to do it one line

bzcat archive.bz2 | grep -v '<[a-z]\s' | grep -v '&[a-z0-9];' | tr '[:punct:][:blank:][:digit:]' '\n' | tr 'A-Z' 'a-z' | tr 'ÆØÅŜĴĤĜŬ' 'æøåŝĵĥĝŭ' | uniq | sort -f | uniq -c | sort -nr | head -50000 | tail -n +2 | awk '{print "<w f=""$1"">"$2""}' > dict.xml * Or for output without accents characters: bzcat archive.bz2 | grep -v '<[a-z]\s' | grep -v '&[a-z0-9];' | tr '[:punct:][:blank:][:digit:]' '\n' | tr 'A-Z' 'a-z' | uniq | grep -o '^[a-z]*$' | sort -f | uniq -c | sort -nr | head -50000 | awk '{print "<w f=""$1"">"$2""}' > en.xml

https://code.google.com/archive/p/softkeyboard/wikis/BinaryDictionaries.wiki?fbclid=IwAR0efckl3qpqeRBuAK9pc7MGZx1ZcFjgcdHa_FIRStLf46fEiAYo3pl8kjg

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
README.md		README.md
bloomfilter.py		bloomfilter.py
bloomread.py		bloomread.py
bloomservice.py		bloomservice.py
wikiparser.py		wikiparser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndicNLPParser

Current URL

Tamil

Telugu

Malayalam

Bengali

Negative Test

Previous Work

About

Releases

Packages

Languages

malaikannan/IndicNLPUniqueWords

Folders and files

Latest commit

History

Repository files navigation

IndicNLPParser

Current URL

Tamil

Telugu

Malayalam

Bengali

Negative Test

Previous Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages