Skip to content

PDF to NER Update

Keith Alcock edited this page Apr 17, 2023 · 38 revisions

The files recorded in the table below correspond to key stages of the process described subsequently.

Stage File Notes
1 SAED100.pdf.zip Original PDF files
2 SAED100.txt.zip Text files after conversion from PDF
4 SAED100.out.zip Annotated files with named entities and all sentences
13 baseline_non_entities.csv List of non-entities
14 SAED100.conll.uncorrected Single annotated file in CONLL format with only sentences needing correction
14 SAED100.conll The same with the actual corrections for retraining

The steps below are needed in order to update Named Entity Recognition (NER) because of either new data from an updated collection of PDFs (i.e., data changes) or updated procedures for extracting named entities (i.e., code changes):

  1. Collect the PDFs and place them in ../corpora/SAED100/pdf, for example.
  2. Convert them to text with ScienceParseApp from the pdf2txt project. The most recent update was performed using code from commit 70e559. Arguments used were org.clulab.pdf2txt.apps.ScienceParseApp -in ../corpora/SAED100/pdf -out ../corpora/SAED100/txt -case false. Case is not corrected here.
  3. On the text files, run (e.g., start sbt and use the runMain command with the name of the main class and any other arguments following) ConditionalCaseMultipleFileApp from this habitus project, possibly from commit 5b843b*: org.clulab.habitus.apps.ConditionalCaseMultipleFileApp ../corpora/SAED100/txt This converts each text file to two files, one with name ending in .txt.preserved and another in .txt.restored. We're interested in the latter. So far these are all still text files, but they have been tokenized. *This commit made use of a processors 8.5.2-SNAPSHOT, which was locally built (sbt publishLocal) from c546da9. To get the same results as in the next step, one would need to use the same version of processors.
  4. From this same project, run LexiconNerMultipleFile org.clulab.habitus.entitynetworks.LexiconNerMultipleFile ../corpora/SAED100/txt .txt.restored. This annotates all the sentences from the case-corrected text files and outputs them in tab-separated files containing the words and entities (in BIO notation). Sentences are separated by blank lines in files with the extension .txt.restored.out.
  5. The next steps use code from the Habitus-SRE project which can be downloaded with git clone github.com/picsolab/Habitus-SRE. Move or copy the folder ../corpora/SAED100/txt to ./report_v5 because the programs expect files to be in a subdirectory of the project and some names in scripts are hard-coded. The project includes some files that will be regenerated with these instructions, so it is best to either move or remove them so that you know they have been reproduced by the end of the procedure.
    • rm data/*_report_v5.csv
    • rm -r metrics_graph_json_report_v5
    • rm error_analysis/*.csv
    • rm coranking/*.csv
  6. You will probably need to install an executable program called mallet plus a few Python packages with pip as well as R libraries with RStudio when they are found to be missing. Note that Python maybe installed as python3 on your system.
    • Install mallet.
      • Download and unzip. These instructions assume that the executables are available at ./mallet-2.0.8/bin.
      • You may need to set the environment variable MALLET_HOME.
    • Use pip install for these Python libraries:
      • networkx
      • nltk
      • pyvis
      • spacy
      • gensim (use pip install gensim==3.8.3)
      • matplotlib
      • pandas
      • tqdm
      • sklearn
    • From within Python, download additional parts of nltk.
      • import nltk; nltk.download('popular'); exit()
    • From the command line, download additional parts of spacy.
      • python -m spacy download en_core_web_lg
    • RStudio should offer to install packages automatically when you open these files manually:
      • animate_importance.Rmd
      • R/compute_metrics.R
      • R/get_wrong_pred.R
  7. Run get_sent_ner.py: python get_sent_ner.py --folder report_v5. It converts the individual .txt.restored.out files to a single file, data/sent_ner_report_v5.csv, with these changes: a header line is added; the tabs are converted to commas appropriate for a csv file; sentences are not separated by blank lines; specific named entity labels PER, ORG, and ACRONYM are replaced with a generic ANIMATE.
  8. The next step involves get_graph.py. Run python get_graph.py --file data/sent_ner_report_v5.csv. This will produce files data/entityID_text_report_v5.csv and data/sentID_text_report_v5.csv as well as directories graph_json_report_v5 and graph_viz_report_v5 containing 12 files each.
  9. topic_model.py trains a model with command python topic_model.py --file data/sentID_text_report_v5.csv --mallet ./mallet-2.0.8/bin/mallet. This step may not work under Windows. The file data/topic_model_report_v5.csv is produced as well as the image elbow_chart.png.
  10. Next run hetero-network-embedding.py with the command python hetero-network-embedding.py --folder report_v5 to produce coranking/weight_Wvs_report_v5.csv and coranking/Wvs_report_v5.csv.
  11. Run the program get_coranking_graph.py with the command python get_coranking_graph.py --folder report_v5 to produce four reports in the directory graph_json_report_v5 with the .json extension and and four more in directory graph_viz_report_v5 with the .html extension: lda_ordered_top10_all, lda_ordered_top1_all. lda_top10_all, and lda_top1_all.
  12. Metrics are calculated with get_metrics.py. The command is python get_metrics.py --folder graph_json_report_v5. However, first increase the value for max_iter in one line of code to 5000 like eig_cen = pd.DataFrame.from_dict(nx.eigenvector_centrality(G, max_iter=5000).items()). The command creates a new directory metrics_graph_json_report_v5 containing a large number of files (96).
  13. In RStudio open animate_importance.Rmd and run all cells. An error analysis should be produced in files error_analysis/baseline_non_entities.csv, error_analysis/baseline_pred_all.csv, error_analysis/coranking_non_entities.csv, and error_analysis/coranking_pred_all.csv. For us, the first is the pertinent one.
  14. Finally, switch back to this habitus project and run ExportNamedEntities2App on the file error_analysis/baseline_non_entities.csv and the directory of .txt.restored.out files to produce SAED100.conll for training. Depending your directory structure, the command arguments to the App may be ../corpora/SAED100/baseline_non_entities.csv ../corpora/SAED100/txt ../corpora/SAED100/SAED100.conll. The files SAED100.conll and SAED100.conll.uncorrected should be produced.