-
Notifications
You must be signed in to change notification settings - Fork 5
PDF to NER Update
Keith Alcock edited this page Apr 17, 2023
·
38 revisions
The files recorded in the table below correspond to key stages of the process described subsequently.
Stage | File | Notes |
---|---|---|
1 | SAED100.pdf.zip | Original PDF files |
2 | SAED100.txt.zip | Text files after conversion from PDF |
4 | SAED100.out.zip | Annotated files with named entities and all sentences |
13 | baseline_non_entities.csv | List of non-entities |
14 | SAED100.conll.uncorrected | Single annotated file in CONLL format with only sentences needing correction |
14 | SAED100.conll | The same with the actual corrections for retraining |
The steps below are needed in order to update Named Entity Recognition (NER) because of either new data from an updated collection of PDFs (i.e., data changes) or updated procedures for extracting named entities (i.e., code changes):
- Collect the PDFs and place them in
../corpora/SAED100/pdf
, for example. - Convert them to text with ScienceParseApp from the pdf2txt project. The most recent update was performed using code from commit 70e559. Arguments used were
org.clulab.pdf2txt.apps.ScienceParseApp -in ../corpora/SAED100/pdf -out ../corpora/SAED100/txt -case false
. Case is not corrected here. - On the text files, run (e.g., start
sbt
and use therunMain
command with the name of the main class and any other arguments following) ConditionalCaseMultipleFileApp from this habitus project, possibly from commit 5b843b*:org.clulab.habitus.apps.ConditionalCaseMultipleFileApp ../corpora/SAED100/txt
This converts each text file to two files, one with name ending in.txt.preserved
and another in.txt.restored
. We're interested in the latter. So far these are all still text files, but they have been tokenized. *This commit made use of a processors 8.5.2-SNAPSHOT, which was locally built (sbt publishLocal
) from c546da9. To get the same results as in the next step, one would need to use the same version of processors. - From this same project, run LexiconNerMultipleFile
org.clulab.habitus.entitynetworks.LexiconNerMultipleFile ../corpora/SAED100/txt .txt.restored
. This annotates all the sentences from the case-corrected text files and outputs them in tab-separated files containing the words and entities (in BIO notation). Sentences are separated by blank lines in files with the extension.txt.restored.out
. - The next steps use code from the Habitus-SRE project which can be downloaded with
git clone github.com/picsolab/Habitus-SRE
. Move or copy the folder../corpora/SAED100/txt
to./report_v5
because the programs expect files to be in a subdirectory of the project and some names in scripts are hard-coded. The project includes some files that will be regenerated with these instructions, so it is best to either move or remove them so that you know they have been reproduced by the end of the procedure.rm data/*_report_v5.csv
rm -r metrics_graph_json_report_v5
rm error_analysis/*.csv
rm coranking/*.csv
- You will probably need to install an executable program called
mallet
plus a few Python packages withpip
as well as R libraries withRStudio
when they are found to be missing. Note that Python maybe installed aspython3
on your system.- Install mallet.
-
Download and unzip. These instructions assume that the executables are available at
./mallet-2.0.8/bin
. - You may need to set the environment variable
MALLET_HOME
.
-
Download and unzip. These instructions assume that the executables are available at
- Use
pip install
for these Python libraries:- networkx
- nltk
- pyvis
- spacy
- gensim (use
pip install gensim==3.8.3
) - matplotlib
- pandas
- tqdm
- sklearn
- From within Python, download additional parts of nltk.
-
import nltk; nltk.download('popular')
; exit()
-
- From the command line, download additional parts of spacy.
python -m spacy download en_core_web_lg
- RStudio should offer to install packages automatically when you open these files manually:
- animate_importance.Rmd
- R/compute_metrics.R
- R/get_wrong_pred.R
- Install mallet.
- Run get_sent_ner.py:
python get_sent_ner.py --folder report_v5
. It converts the individual.txt.restored.out
files to a single file,data/sent_ner_report_v5.csv
, with these changes: a header line is added; the tabs are converted to commas appropriate for a csv file; sentences are not separated by blank lines; specific named entity labels PER, ORG, and ACRONYM are replaced with a generic ANIMATE. - The next step involves get_graph.py. Run
python get_graph.py --file data/sent_ner_report_v5.csv
. This will produce filesdata/entityID_text_report_v5.csv
anddata/sentID_text_report_v5.csv
as well as directoriesgraph_json_report_v5
andgraph_viz_report_v5
containing 12 files each. -
topic_model.py trains a model with command
python topic_model.py --file data/sentID_text_report_v5.csv --mallet ./mallet-2.0.8/bin/mallet
. This step may not work under Windows. The filedata/topic_model_report_v5.csv
is produced as well as the imageelbow_chart.png
. - Next run hetero-network-embedding.py with the command
python hetero-network-embedding.py --folder report_v5
to producecoranking/weight_Wvs_report_v5.csv
andcoranking/Wvs_report_v5.csv
. - Run the program get_coranking_graph.py with the command
python get_coranking_graph.py --folder report_v5
to produce four reports in the directorygraph_json_report_v5
with the.json
extension and and four more in directorygraph_viz_report_v5
with the.html
extension:lda_ordered_top10_all
,lda_ordered_top1_all
.lda_top10_all
, andlda_top1_all
. - Metrics are calculated with get_metrics.py. The command is
python get_metrics.py --folder graph_json_report_v5
. However, first increase the value formax_iter
in one line of code to 5000 likeeig_cen = pd.DataFrame.from_dict(nx.eigenvector_centrality(G, max_iter=5000).items())
. The command creates a new directorymetrics_graph_json_report_v5
containing a large number of files (96). - In
RStudio
open animate_importance.Rmd and run all cells. An error analysis should be produced in fileserror_analysis/baseline_non_entities.csv
,error_analysis/baseline_pred_all.csv
,error_analysis/coranking_non_entities.csv
, anderror_analysis/coranking_pred_all.csv
. For us, the first is the pertinent one. - Finally, switch back to this habitus project and run ExportNamedEntities2App on the file
error_analysis/baseline_non_entities.csv
and the directory of.txt.restored.out
files to produceSAED100.conll
for training. Depending your directory structure, the command arguments to the App may be../corpora/SAED100/baseline_non_entities.csv ../corpora/SAED100/txt ../corpora/SAED100/SAED100.conll
. The filesSAED100.conll
andSAED100.conll.uncorrected
should be produced.
- Datasets
- Grid
- Habitus Application
- Other