Skip to content

discourse-lab/UMUC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UMUC: UP Multilayer UNSC Corpus

The UP Multilayer UNSC Corpus (UMUC) is a corpus for the analysis of diplomatic speeches given in the UN Security Council (UNSC). Our corpus contains a small subset of speeches selected from the original UN Security Council Debates Corpus containing over 25 years of digitizing meeting notes.

We preprocessed the speeches deleting unnecessary line breaks, removed text that are not the speech, and segmented the texts into either Elementary Discourse Units or sentences. In addition to the raw texts, we present annotations for different phenomena: verbal Conflicts, discourse structures using Rhetorical Structure Theory, and automatic average Sentiments using a dictionary-based approach.

The section "Other Projects" lists some links to other projects (Argumentation Mining, NER, Knowledge Graph) done with other parts of the UNSC Debates Corpus which were developed in cooperation with/at our AngCL group at Potsdam Univeristy.

For more information about this work, please see our papers.

Corpus Structure and Speeches Selection for UMUC

The dataset contains 87 speeches taken and preprocessed from the UN Security Debates Corpus. We organize our corpus into three sub-corpora, organized into two csv files and rs3 files.

We selected two topics with different expected potential for conflicts. The first agenda is the Ukraine conflict in 2014 after the annexation of Crimea (and before the Minsk II agreement). The second agenda is the Women, Peace and Security (WPS) agenda. For both topics we selected debates such that they further maximize the probability of finding expressions of conflict in the speeches. We focused on speeches from permanent members of the UNSC, and for some debates included additionally speeches from countries having more than one contribution to the debate.

Raw Data

Directories containing selected raw speeches (one .txt file per speech) from the original UN Security Debates Corpus.

Directories containing one .txt file per speech preprocessed with 02_preprocess.py.

Directories containing one .txt file per speech with preprocessed, newline-seperated EDUs / sentences.

CSV-Table with one EDU per row.

  • filename: filename in /EDUs directory with countryname
  • fileid: Basename of file (without .txt) as given in the original UN Security Debates Corpus.
  • char_start_offset_edu: Character offset start of EDU
  • char_start_offset_edu: Character offset end of EDU
  • speech_sentence_id: Counter ID for sentence inside the speech
  • paragraph_id: Counter ID for paragraph inside the speech
  • speech_edu_id: Counter ID for EDU inside the speech
  • text_edu: EDU string from speech

CSV-Table with one sentence per row.

  • filename and fileid same as in main_edus.csv
  • char_start_offset: Character offset start of sentence
  • char_end_offset: Character offset end of sentence
  • text: Sentence string from speech
  • tokenized: Tokenized sentence from speech

CSV-Table with one pragraph per row.

  • filename: Filename in /Preprocess folder with countryname
  • fileid: Basename of file (without .txt) as given in the original UN Security Debates Corpus
  • char_start_offset: Character Offset Start of Paragraph
  • char_end_offset: Character Offset End of Paragraph
  • paragraph_id: Counter ID for paragraph inside the speech
  • text: Paragraph string from speech

Annotated Data

UNSCon: UNSC Conflicts Corpus: Conflicts

A dataset of 87 speeches given in the UNSC with annotations for (verbal) conflicts, specifically tailored to diplomatic language. We define a conflict as an expression of critique or distancing from the positions or actions of another country present at the Council during the debate. There are four main types of conflicts annotated:

  • Direct Negative Evaluation: Describe Conflicts where the speaker directly directs the critique to another country.
  • Indirect Negative Evaluation: Describe Conflicts where some intermediate entity serving as a proxy is criticized instead of the other country directly.
  • Challenge: Challenging statements accuse another country of not telling the truth.
  • Correction: Corrections rectify the allegedly false statement.

For more information on the annotation guidelines, see our paper. This repository includes a corrected version of the original UNSC Conflicts corpus UNSCon (GitHub).

Table containing the speeches with Conflict annotations and metadata, with evaporate labels columns, derived from the original annotation output. A more concise representation of the dataset is in main_conflicts.csv.

Metadata:

  • filename, fileid, char_start_offset_edu, char_end_offset_edu, speech_edu_id, text_edu same as in main_edus.csv

Conflict Annotations:

  • A0_Negative_Evaluation: Conflict labels Indirect_Negeval or Direct_NegEval
  • A2_Target_Council: Council Target Types (Speaker or Speech, Country, Group of Countries, UNSC, Self-targeting, Underspecified)
  • A3_Target_Intermediate: Intermediate Target Types (Policy or Law, Person, UN-Organization, NGO, Other)
  • A4_Country_Name: Name of Target Country
  • B1_ChallengeType: Conflict labels Challenge or Correction
  • B2_Target_Challenge: Council Target Types
  • B3_Country_Name: Name of Target Country

Taken from main_edus.csv:

  • char_start_offset_edu_original and char_end_offset_edu_original: Character offset taken from main_edus.csv
  • speech_sentence_id and paragraph_id: taken from main_edus.csv

Table with speeches, Conflict labels and metadata, with summarized and renamed columns for labels. The Table is the output from 04_conflicts_table_preprocessing.py taking main_conflicts_not_preprocessed.csv as input.

Metadata:

  • filename, fileid, char_start_offset_edu, char_end_offset_edu, speech_edu_id, text_edu same as in main_edus.csv

Conflict Annotations:

  • Conflict_Type: Conflict labels: Indirect_Negeval, Direct_NegEval, Challenge or Correction
  • Conflict_Target: Council Target Types (Speaker or Speech, Country, Group of Countries, UNSC, Self-targeting, Underspecified)
  • Target_Country_Name: Name of Target Country

EDU-based Conflict annotations mapped to sentences. For overlapping labels, we simply used the first label for the sentence.

UNSC-RST: Rhetorical Structures

The corpus contains 87 speeches given in the UNSC analyzed from the perspective of Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) to study rhetorical style in diplomatic speech. RST aims to capture the structure of a text by combining its elementary discourse units (EDUs) into one single, hierarchical tree structure.

For more information on the annotation guidelines, see our paper. This repository includes a corrected version of the original RST corpus (GitHub), as well as versions mapped to other RST relation sets, namely to the RST-DT and GUM relation classes.

Folder with rs3 file per speech, using annotation labels as described in our paper.

Folder with rs3 file per speech, automatically mapped to RST-DT classes and RST-GUM relations using mapping_relations.py.

Merged UNSC-RST and UNSCon table: conflicts_rst_aligned.csv

Table with aligned Conflicts and simplified RST label annotations.

Metadata:

  • filename, fileid, char_start_offset_edu, char_end_offset_edu, speech_edu_id, text_edu same as in main_edus.csv Conflict Annotations:
  • Conflict_Type, Conflict_Target, Target_Country_Name same as in main_conflicts.csv RST:
  • rstree_nodeid_chain: A list of node IDs extracted from rs3 files. The node IDs lists are organized starting from the leaf nodes and going up to the root node.
  • rstree_relation_leave: The leaf node relation for the EDU. For example, if the relation is 'Circumstance', the EDU is annotated as describing the circumstances related to the content of the EDU to which the relation points.
  • rstree_relation_chain: A list of relations extracted from rs3 files. The relations lists are organized starting from the leaf nodes and going up to the root node.
  • rstree_edges: Number of edges starting from the leaf node, going up to the root node.
  • sat_value_rstree: Number of Satelites starting from the leaf node, going up to the root node.
  • rstree_nodeid_chain_subtree and rstree_relation_chain_subtree and rstree_edges_subtree and sat_value_subtree: same as the columns before, but only for the paragraph subtree.

Other:

  • tokenized_edus: EDUs tokenized using SpaCy.
  • len_tokens_edus: Number of tokens in EDU.
  • paragraph_id_consecutive: Since the original paragraph IDs also counted double newlines, which were then later removed as they did not contain any text, the original IDs have gaps. This column provides consecutive IDs without gaps.
  • paragraph_id_consecutive_per_file: Same as the column before, but per file.
  • sentence_id_consecutive: consecutive IDs without gaps starting at the first EDU and ending at the last EDU of teh corpus.
  • sentence_id_consecutive_per_file: Same as the column before, but per file.

Code

Prequirements

To reproduce the corpus preprocessing steps, download the requirements by typing in your terminal: pip -r requirements.txt

For SpaCy, download language model: python -m spacy download en_core_web_lg

Script that takes the raw texts and output files main_sents.csv and main_para.csv. main_edus.csv was created using output from Inception annotation tool.

Scripts to map RST relations for out project to RST-DT classes and GUM relations.

Script to preprocess the conflict table, condensing it by summarizing Conflict label columns. Takes main_conflicts_not_preprocessed.csv as input and gives main_conflicts.csv as output.

Other Projects on the UNSC Debates corpus

UNSC-NE

UNSC-NE is a Named Entity (NE) add-on to the UNSC Debates corpus using DBpedia-spotlight. The code and dataset is described in the article:

Luis Glaser, Ronny Patz, and Manfred Stede (2022). UNSC-NE: A Named Entity Extension to the UN Security Council Debates Corpus. In: Journal for Language Technology and Computational Linguistics 35.2, pp. 51–67.

UNSC-Graph

With the UNSC-Graph we presented an extensible knowledge graph for the UNSC corpus. It was created with SWI-Prolog and currently consists of the sets of facts described in:

Stian Rødven-Eide et al. (Sept. 2023). The UNSC-Graph: An Extensible Knowledge Graph for the UNSC Corpus. In: Proceedings of the 3rd Workshop on Computational Linguistics for the Political and Social Sciences.

The code and dataset are available here.

The graph combines previously disconnected data sources including from the UNSC Repertoire, the UN Library, Wikidata, and from metadata extracted from the speeches themselves like topics and participants. The graph also includes country mentions in a speech, geographical neighbours of countries mentioned, as well as sentiment scores. By linking the graph to Wikidata, the graph includes additional geopolitical information and extract various country name aliases to extend the coverage of country mentions beyond existing NER-based approaches.

Political Argument Mining

This project is focused on argumentation mining framed through the tasks of argument detection (predict whether the utterance is an argument or not) and argument component identification (predict whether the argumentative utterance is a claim or a premise). As part of the project, a novel corpus of argument annotations was created based on diplomatic speeches given during gatherings of the UNSC. The corpus contains 144 speeches from 2014 to 2018, dedicated to the conflict in Ukraine, named UC(Ukraine Conflict)-UNSC. The speeches were annotated analogically to USElecDeb and the labels include claims, premises or none of these.

The dataset and code are available here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages