Skip to content

Extract word embeddings from PDF paper (in Bahasa) using BERT, do similarity checking by FAISS, and schedule the language model and FAISS index updating

Notifications You must be signed in to change notification settings

ranggarppb/bert-bahasa-plagiarism-checker

Repository files navigation

HUBLA Paper (Indonesian Lang.) Plagiarism Detector using BERT

The structre of this project described as follows :

  • app.py : main application
  • config.ini : configuration of the project (model, etc.)
  • global_var.py : initialization of global variable used in this entire project
  • text_preprocessing.py : process pdf to trainable text
  • model (folder) : the model used for feature extraction
  • sample_database_embeddings : sample database of extracted features from various pdf
  • vector_comparison.py : comparing extracted features from input file to extracted features in database
  • utils.py : helper function

About

Extract word embeddings from PDF paper (in Bahasa) using BERT, do similarity checking by FAISS, and schedule the language model and FAISS index updating

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages