- SubTask-1 [Subtask1_Bert.ipynb, Subtask1_BiLSTM.ipynb & preProcess.ipynb(preprocessing data)]
- SubTask-2 [T2_BERT.ipynb & T2_NER_bilstm_CRf.ipynb] are in "code" folder.
EVALUATION DATA Available in "SemEval_eval_input" folder for all 4 SubTasks
This folder contains the datasets that constitute MalwareTextDB-V2.0. The dataset is used in SemEval-2018 Task 8: Semantic Extraction from CybersecUrity REports using Natural Language Processing (SecureNLP)
It contains 5 subfolders:
- train: contains the training materials released to the participants
- dev: contains the development data used in the Practice phase
- test_1: contains the gold data for SubTask 1 and 2
- test_2: contains the gold data for SubTask 3
- test_3: contains the gold data for SubTask 4
The following are the explanation of content inside each of the subfolder
plaintext/
(train only)
- contains 65 plaintext files after the APT PDF reports are processed with PDFMiner
tokenized/
-
Contains the tokenized reports with the annotated labels in IOB format
-
3 different types of labels are used: Entity, Action, Modifier
-
If a token is not fallen under any label type, it is labeled as O(means the outside of the labels)
-
If a token is the first word of a label, then it is labeled as B-<Label_Type> (means the beginning of a label)
-
If a token is the subsequent word of the text span of a label, then it is labeled as I-<Label_Type> (means the inside of a label)
-
Example, to O direct B-Action site B-Entity visitors I-Entity
-
For more details about the IOB format, please refer http://www.nltk.org/book/ch07.html section 2.6
annotations/
-
contains the plaintext files with XML tags denoting nonsentence sections such as headings and covers
-
contains the annotations files (.ann) for each plaintext file; the positions of the annotations are based on character counts
-
In .ann files, 3 different annotation ID types are used : T(text-bound annotation), R(relation), A(attribute)
-
Different attribute labels are ActionName, Capability, StrategicObjectives, TacticalObjectives
-
Example,
Sentence : The dynamic analysis showed the malware sample contacted the C&C server, but wasn't sending any URL parameters (id1, id2).
Annotations : T34 Action 47 56 contacted T28 Subject 28 46 the malware sample R23 SubjAction Subject:T28 Action:T34 T30 Object 57 71 the C&C server R24 ActionObj Action:T34 Object:T30
- more information regarding the annotation files format can be seen in http://brat.nlplab.org/standoff.html
additional_plaintext/
(train only)
- contains additional 84 plaintext files after the APT PDF reports are processed with PDFMiner
- If the participants need to generate special features such brown clusters, these textfiles can be used
- Tokenized versions of this set of files are not provided in the tokenized folder