UMUC: UP Multilayer UNSC Corpus

The UP Multilayer UNSC Corpus (UMUC) is a corpus for the analysis of diplomatic speeches given in the UN Security Council (UNSC). Our corpus contains a small subset of speeches selected from the original UN Security Council Debates Corpus containing over 25 years of digitizing meeting notes.

We preprocessed the speeches deleting unnecessary line breaks, removed text that are not the speech, and segmented the texts into either Elementary Discourse Units or sentences. In addition to the raw texts, we present annotations for different phenomena: verbal Conflicts, discourse structures using Rhetorical Structure Theory, and automatic average Sentiments using a dictionary-based approach.

The section "Other Projects" lists some links to other projects (Argumentation Mining, NER, Knowledge Graph) done with other parts of the UNSC Debates Corpus which were developed in cooperation with/at our AngCL group at Potsdam Univeristy.

For more information about this work, please see our papers.

Karolina Zaczynska, Peter Bourgonje, and Manfred Stede. How Diplomats Dispute: The UN Security Council Conflict Corpus. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Turin, Italy, 2024. GitHub, Bibtex, PDF
Karolina Zaczynska and Manfred Stede. Rhetorical Strategies in the UN Security Council: Rhetorical Structure Theory and Conflicts. In Proceedings of the SIGDIAL 2024 conference. Kyoto, 2024. GitHub, Bibtex, PDF

Corpus Structure and Speeches Selection for UMUC

The dataset contains 87 speeches taken and preprocessed from the UN Security Debates Corpus. We organize our corpus into three sub-corpora, organized into two csv files and rs3 files.

We selected two topics with different expected potential for conflicts. The first agenda is the Ukraine conflict in 2014 after the annexation of Crimea (and before the Minsk II agreement). The second agenda is the Women, Peace and Security (WPS) agenda. For both topics we selected debates such that they further maximize the probability of finding expressions of conflict in the speeches. We focused on speeches from permanent members of the UNSC, and for some debates included additionally speeches from countries having more than one contribution to the debate.

Raw Data

Raw

Directories containing selected raw speeches (one .txt file per speech) from the original UN Security Debates Corpus.

Preproc_Text

Directories containing one .txt file per speech preprocessed with 02_preprocess.py.

EDUs and Sentences

Directories containing one .txt file per speech with preprocessed, newline-seperated EDUs / sentences.

main_edus.csv

CSV-Table with one EDU per row.

filename: filename in /EDUs directory with countryname
fileid: Basename of file (without .txt) as given in the original UN Security Debates Corpus.
char_start_offset_edu: Character offset start of EDU
char_start_offset_edu: Character offset end of EDU
speech_sentence_id: Counter ID for sentence inside the speech
paragraph_id: Counter ID for paragraph inside the speech
speech_edu_id: Counter ID for EDU inside the speech
text_edu: EDU string from speech

main_sents.csv

CSV-Table with one sentence per row.

filename and fileid same as in main_edus.csv
char_start_offset: Character offset start of sentence
char_end_offset: Character offset end of sentence
text: Sentence string from speech
tokenized: Tokenized sentence from speech

main_para.csv

CSV-Table with one pragraph per row.

filename: Filename in /Preprocess folder with countryname
fileid: Basename of file (without .txt) as given in the original UN Security Debates Corpus
char_start_offset: Character Offset Start of Paragraph
char_end_offset: Character Offset End of Paragraph
paragraph_id: Counter ID for paragraph inside the speech
text: Paragraph string from speech

Annotated Data

UNSCon: UNSC Conflicts Corpus: Conflicts

A dataset of 87 speeches given in the UNSC with annotations for (verbal) conflicts, specifically tailored to diplomatic language. We define a conflict as an expression of critique or distancing from the positions or actions of another country present at the Council during the debate. There are four main types of conflicts annotated:

Direct Negative Evaluation: Describe Conflicts where the speaker directly directs the critique to another country.
Indirect Negative Evaluation: Describe Conflicts where some intermediate entity serving as a proxy is criticized instead of the other country directly.
Challenge: Challenging statements accuse another country of not telling the truth.
Correction: Corrections rectify the allegedly false statement.

For more information on the annotation guidelines, see our paper. This repository includes a corrected version of the original UNSC Conflicts corpus UNSCon (GitHub).

main_conflicts_not_preprocessed.csv

Table containing the speeches with Conflict annotations and metadata, with evaporate labels columns, derived from the original annotation output. A more concise representation of the dataset is in main_conflicts.csv.

Metadata:

filename, fileid, char_start_offset_edu, char_end_offset_edu, speech_edu_id, text_edu same as in main_edus.csv

Conflict Annotations:

A0_Negative_Evaluation: Conflict labels Indirect_Negeval or Direct_NegEval
A2_Target_Council: Council Target Types (Speaker or Speech, Country, Group of Countries, UNSC, Self-targeting, Underspecified)
A3_Target_Intermediate: Intermediate Target Types (Policy or Law, Person, UN-Organization, NGO, Other)
A4_Country_Name: Name of Target Country
B1_ChallengeType: Conflict labels Challenge or Correction
B2_Target_Challenge: Council Target Types
B3_Country_Name: Name of Target Country

Taken from main_edus.csv:

char_start_offset_edu_original and char_end_offset_edu_original: Character offset taken from main_edus.csv
speech_sentence_id and paragraph_id: taken from main_edus.csv

main_conflicts.csv

Table with speeches, Conflict labels and metadata, with summarized and renamed columns for labels. The Table is the output from 04_conflicts_table_preprocessing.py taking main_conflicts_not_preprocessed.csv as input.

Metadata:

filename, fileid, char_start_offset_edu, char_end_offset_edu, speech_edu_id, text_edu same as in main_edus.csv

Conflict Annotations:

Conflict_Type: Conflict labels: Indirect_Negeval, Direct_NegEval, Challenge or Correction
Conflict_Target: Council Target Types (Speaker or Speech, Country, Group of Countries, UNSC, Self-targeting, Underspecified)
Target_Country_Name: Name of Target Country

main_conflicts_sents.csv

EDU-based Conflict annotations mapped to sentences. For overlapping labels, we simply used the first label for the sentence.

UNSC-RST: Rhetorical Structures

The corpus contains 87 speeches given in the UNSC analyzed from the perspective of Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) to study rhetorical style in diplomatic speech. RST aims to capture the structure of a text by combining its elementary discourse units (EDUs) into one single, hierarchical tree structure.

For more information on the annotation guidelines, see our paper. This repository includes a corrected version of the original RST corpus (GitHub), as well as versions mapped to other RST relation sets, namely to the RST-DT and GUM relation classes.

RST_original

Folder with rs3 file per speech, using annotation labels as described in our paper.

RST_RSTDT-relations and RST_GUM-relations

Folder with rs3 file per speech, automatically mapped to RST-DT classes and RST-GUM relations using mapping_relations.py.

Merged UNSC-RST and UNSCon table: conflicts_rst_aligned.csv

Table with aligned Conflicts and simplified RST label annotations.

Metadata:

filename, fileid, char_start_offset_edu, char_end_offset_edu, speech_edu_id, text_edu same as in main_edus.csv Conflict Annotations:
Conflict_Type, Conflict_Target, Target_Country_Name same as in main_conflicts.csv RST:
rstree_nodeid_chain: A list of node IDs extracted from rs3 files. The node IDs lists are organized starting from the leaf nodes and going up to the root node.
rstree_relation_leave: The leaf node relation for the EDU. For example, if the relation is 'Circumstance', the EDU is annotated as describing the circumstances related to the content of the EDU to which the relation points.
rstree_relation_chain: A list of relations extracted from rs3 files. The relations lists are organized starting from the leaf nodes and going up to the root node.
rstree_edges: Number of edges starting from the leaf node, going up to the root node.
sat_value_rstree: Number of Satelites starting from the leaf node, going up to the root node.
rstree_nodeid_chain_subtree and rstree_relation_chain_subtree and rstree_edges_subtree and sat_value_subtree: same as the columns before, but only for the paragraph subtree.

Other:

tokenized_edus: EDUs tokenized using SpaCy.
len_tokens_edus: Number of tokens in EDU.
paragraph_id_consecutive: Since the original paragraph IDs also counted double newlines, which were then later removed as they did not contain any text, the original IDs have gaps. This column provides consecutive IDs without gaps.
paragraph_id_consecutive_per_file: Same as the column before, but per file.
sentence_id_consecutive: consecutive IDs without gaps starting at the first EDU and ending at the last EDU of teh corpus.
sentence_id_consecutive_per_file: Same as the column before, but per file.

Code

Prequirements

To reproduce the corpus preprocessing steps, download the requirements by typing in your terminal: pip -r requirements.txt

For SpaCy, download language model: python -m spacy download en_core_web_lg

03_corpus_structure.py

Script that takes the raw texts and output files main_sents.csv and main_para.csv. main_edus.csv was created using output from Inception annotation tool.

rst

Scripts to map RST relations for out project to RST-DT classes and GUM relations.

04_conflicts_table_preprocessing.py

Script to preprocess the conflict table, condensing it by summarizing Conflict label columns. Takes main_conflicts_not_preprocessed.csv as input and gives main_conflicts.csv as output.

Other Projects on the UNSC Debates corpus

UNSC-NE

UNSC-NE is a Named Entity (NE) add-on to the UNSC Debates corpus using DBpedia-spotlight. The code and dataset is described in the article:

Luis Glaser, Ronny Patz, and Manfred Stede (2022). UNSC-NE: A Named Entity Extension to the UN Security Council Debates Corpus. In: Journal for Language Technology and Computational Linguistics 35.2, pp. 51–67.

UNSC-Graph

With the UNSC-Graph we presented an extensible knowledge graph for the UNSC corpus. It was created with SWI-Prolog and currently consists of the sets of facts described in:

Stian Rødven-Eide et al. (Sept. 2023). The UNSC-Graph: An Extensible Knowledge Graph for the UNSC Corpus. In: Proceedings of the 3rd Workshop on Computational Linguistics for the Political and Social Sciences.

The code and dataset are available here.

The graph combines previously disconnected data sources including from the UNSC Repertoire, the UN Library, Wikidata, and from metadata extracted from the speeches themselves like topics and participants. The graph also includes country mentions in a speech, geographical neighbours of countries mentioned, as well as sentiment scores. By linking the graph to Wikidata, the graph includes additional geopolitical information and extract various country name aliases to extend the coverage of country mentions beyond existing NER-based approaches.

Political Argument Mining

This project is focused on argumentation mining framed through the tasks of argument detection (predict whether the utterance is an argument or not) and argument component identification (predict whether the argumentative utterance is a claim or a premise). As part of the project, a novel corpus of argument annotations was created based on diplomatic speeches given during gatherings of the UNSC. The corpus contains 144 speeches from 2014 to 2018, dedicated to the conflict in Ukraine, named UC(Ukraine Conflict)-UNSC. The speeches were annotated analogically to USElecDeb and the labels include claims, premises or none of these.

The dataset and code are available here.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Code		Code
Corpora		Corpora
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMUC: UP Multilayer UNSC Corpus

Corpus Structure and Speeches Selection for UMUC

Raw Data

Raw

Preproc_Text

EDUs and Sentences

main_edus.csv

main_sents.csv

main_para.csv

Annotated Data

UNSCon: UNSC Conflicts Corpus: Conflicts

main_conflicts_not_preprocessed.csv

main_conflicts.csv

main_conflicts_sents.csv

UNSC-RST: Rhetorical Structures

RST_original

RST_RSTDT-relations and RST_GUM-relations

Merged UNSC-RST and UNSCon table: conflicts_rst_aligned.csv

Code

Prequirements

03_corpus_structure.py

rst

04_conflicts_table_preprocessing.py

Other Projects on the UNSC Debates corpus

UNSC-NE

UNSC-Graph

Political Argument Mining

About

Releases

Packages

Contributors 2

Languages

discourse-lab/UMUC

Folders and files

Latest commit

History

Repository files navigation

UMUC: UP Multilayer UNSC Corpus

Corpus Structure and Speeches Selection for UMUC

Raw Data

EDUs and Sentences

Annotated Data

UNSCon: UNSC Conflicts Corpus: Conflicts

UNSC-RST: Rhetorical Structures

RST_RSTDT-relations and RST_GUM-relations

Merged UNSC-RST and UNSCon table: conflicts_rst_aligned.csv

Code

Prequirements

Other Projects on the UNSC Debates corpus

UNSC-NE

UNSC-Graph

Political Argument Mining

About

Resources

Stars

Watchers

Forks

Languages