Summary :
-
How they did it?
- 13,739 papers
- PDFBOX to extract information
-
What they did - novelty ?
- Semi manual annotation of references
- provide citation contexts : sentences around where a paper was cited
- Manual gender annotation
- related phrases for every author using the text from the papers they have authored (basic tf-idf)
- a lot of graph/network statistics like out-degree analysis
-
How they did it?
- 10,921 pdfs - 2007 version
- ???? pdfs - 2016 version
- PDFBOX
-
What they did - novelty?
- inter and intra reference linking
- proposed a task for benchmarking the above
- Corpling@GU (Georgetown University) have ACL Anthology from 1985-2022 - Behind firewall
Other work using these datasets/similar work-
-
Citation Analysis, Centrality, and the ACL Anthology Detailed citation network analysis. They even list which paper has most citation inside the network which would be good to see. The work also calculates the impact factor of ACL anthology which is interesting.
-
Purpose and Polarity of Citation: Towards NLP-based Bibliometrics This might of interest to folks working in citation context classification.
-
CORD-19: The COVID-19 Open Research Dataset This paper can be a template for the work we are doing. It is very similar to what we are doing but in ACL domain. We can take inspiration and do stuff tailored for the linguistics community. Tasks mentioned in section 4
Research directions
are specially very iteresting for us.
There are mostly summarization tasks around AAN - using the citation context given by the dataset.