forked from WING-NUS/scisumm-corpus
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathFAQ2016
45 lines (28 loc) · 4.41 KB
/
FAQ2016
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Hi,
In the /data folder are the ten training sets (or training "topics").
Within each /annotation folder for each topic, there are 2 version of annotation files available.
There is a new folder in every topic, called Reference_XML. The sentence id's in *annv3.txt files reference the XML version of the reference document, available in the Reference_XML folder in every topic.
This is the legend to understand each row of the file:
a. Reference article means the "topic" article which all the other articles in this folder are citing.
b. Citing article refers to the EXACT filename.
c. The citances are in no particular order.
1. FOR THE 2016 TASK, DO NOT USE *ann.txt. This uses character offsets to identify citing and reference sentences. Citation/Reference offsets are meaningless and should be ignored. Evaluation should be done through text matching
Citance Number: 1 | Reference Article: X96-1048.txt | Citing Article: A97-1028.txt | Citation Marker Offset: 1751-1759 | Citation Marker: Sundheim ... 1995b | Citation Offset: 1596-1767 | Citation Text: Named Entity evaluation began as a part of recent Message Understanding Conferences (MUC), whose objective was to standardize the evaluation of IE tasks (Sundheim, 1995b). | Reference Offset: ['8715-8768'] | Reference Text: In addition, there are plans to put evaluations on line, with public access, starting with the NE evaluation; this is intended to make the NE task familiar to new sites and to give them a convenient and low-pressure way to try their hand at following a standardized test procedure | Discourse Facet: Implication_Citation | Annotator: Muthu Kumar Chandrasekaran, NUS |
2. FOR THE 2016 TASK, PLEASE USE THE annv3.txt files. This has Reference offsets and Citation offsets as SENTENCE ID's of reference document
Citance Number: 1 | Reference Article: X96-1048.txt | Citing Article: A97-1028.txt | Citation Marker Offset: ['10'] | Citation Marker: Sundheim, 1995b | Citation Offset: ['10'] | Citation Text: Named Entity evaluation began as a part of recent Message Understanding Conferences (MUC), whose objective was to standardize the evaluation of IE tasks (Sundheim, 1995b). | Reference Offset: ['357'] | Reference Text: In addition, there are plans to put evaluations on line, with public access, starting with the NE evaluation; this is intended to make the NE task familiar to new sites and to give them a convenient and low-pressure way to try their hand at following a standardized test procedure | Discourse Facet: Implication_Citation | Annotator: Muthu Kumar Chandrasekaran, NUS |
Below, we have addressed some FAQs. Please feel free to contact us with follow-up questions.
Thank you for all your feedback which has helped us to improve our corpus, and we sincerely apologize for the trouble you've faced. We hope you can submit some results to us in time for presentation at the TAC workshop (Deadline: 26th October 2014) or atleast in time for the final paper submission (Deadline: mid-January, 2015)
Regards
Kokil Jaidka
Coordinator, SciSumm
____________________________________________________________________
FAQ
Q. I noticed "..." in the annotations, what does that mean?
"..." followed the BioMedSumm standard practice of indicating discontiguous texts.
In Citation Text and Reference Text fields, the "..." means that there is a gap between two text spans (citation spans or reference spans). They may be on different pages, so the gap might be a text. There might be a formula or a figure there, or some text encoding which is not a part of the annotation.
Q. I notice there are 10 training topics (folders), where is the evaluation set?
The development and test sets will be released in the following months. Meanwhile you can pilot your system based on this training set.
Q. There are errors in parsing the file - such as misspelt words, spaces within words, sentences in the wrong place and so on.
Unfortunately the errors you see are OCR parsing errors in the "txt" version of the Documents, available in the Documents_TXT folder, and the XML version in Reference_XML folder. This is not in our control. We recommend that your string matching should be lenient enough to confront such problems.
Q. The citation/reference offset numbers are not matching the actual values in the txt files.
Please refer to the sentence ids in the XML files, which precede every sentence. We will be using those to evaluate how accurate the citation and reference matching is.