Skip to content

Large Ghanaian Dataset

Keith Alcock edited this page Sep 22, 2023 · 2 revisions

Introduction

A corpus of news articles has been collected and processed in a way that may be useful for TPI or PWLWP tasks in that both causal assertions and beliefs have been extracted. The result is a dataset (now with causal assertions, beliefs, locations, and dates) that can be further analyzed, and this documentation, particularly descriptions of the columns, is intended to aid in that endeavor. The other information should help if the dataset needs to be updated, enhanced, or just recreated.

This dataset has also been the subject of two PowerPoint presentations: datasetv3.pptx and PWLWP.pptx. All of the files are in a single folder, in fact.

Pipeline

Several pieces of software need to work together to get articles from their source, through various analyses, and to the resulting dataset. Here is a brief roadmap of what happens to get that done.

  • Scrape articles - This habitus subproject is used to download news articles. There are six stages involving the downloading and then scraping of html files for an initial search, the returned index of articles, and then the articles themselves. The result of this stage is a collection of JSON files in which the URL, title, dateline, byline, and text of articles has been captured.
  • Write causes - Causal relations are identified by Eidos, which is run on all of the scraped JSON files. The process is fairly lengthy, so descriptions of the relations are written out to JSONLD files for use in the next stage, which is fairly flexible and can present the results in many ways when re-run.
  • Read causes - The causes recorded in all the JSONLD files are read back in and reorganized into table form with the hierarchy of documents, sentences, and causal events flattened so that the texts and other properties of causes and effects can be enumerated.
  • Add beliefs and locations - Each sentence in the table is then run through belief and location stages which classify sentences as expressing a belief or not and identify any locations involved. The results are added to the final version of the dataset.
  • Interpret dates - Some dates are extracted from articles in their written out form (e.g., January 14, 2021). In this stage, which was added after the others, dates are converted to numbers (e.g., 2021-01-14). See below for formatting details.

Sources

News articles were collected from eight sources. Each source first presents an index of search results with some number of articles listed on each index page. The number of articles per index page differs between sources as does the maximum number of index pages that will be shown. That may be limited by the number of articles, the quality of the matches, or just some maximum. Up to 100 pages of index results were collected from each source for this dataset. It is not clear how the articles are sorted. It can't be ruled out that date has an effect so that having more recent articles does not necessarily mean recent interest in a topic. Here are the sources:

Search Terms

Only very simple search terms were used. It is not known to what extent any kind of operators are supported or even what happens if spaces are used in the search. GhanaWeb does offer some settings, but they didn't seem to work at the time. Gold often matched in the context of sporting events, but there wasn't an obvious way to prevent that. These search terms were used:

  • galamsey
  • gold
  • mining
  • price

Counts

These counts describe the dataset size in various dimensions:

Description Count
Search matches 36284
Articles downloaded* 34065
Articles processed by Eidos+ 34057
Sentences 636966
Causal sentences 44631
Belief sentences 57579
Sentences both causal and belief 5431

*The number of articles downloaded is smaller mainly because of deduplication, but also because some articles with special characters in their titles could not be retrieved. Some could be downloaded but not scraped because of formatting issues.
+Some articles could not be read because of a bug in the processing of holidays.

The articles identified, before deduplication, are distributed across search terms as such:

Search Term Count
galamsey 5233
gold 10853
mining 9566
price 10632

The search matches are distributed across sources like this:

Source Count
3News 3907
Adom Online 12473
The Chronicle 1778
CITI FM 3606
e.tv ghana 1467
Ghana News Agency 6488
GhanaWeb 3720
Happy FM 2845

Finally, sentences are spread out across article publication date like this:

Year Count
2012 8
2013 74
2014 2957
2015 3794
2016 6521
2017 79729
2018 39293
2019 34071
2020 53172
2021 76139
2022 172252
2023 181385

Columns

The dataset contains quite a few columns. Several are intended to address the TPI use case in which causes and effects can be increasing or decreasing.

  • url - The URL from which the article was downloaded.
  • terms - The term (or terms separated by a space) which led to the page. Articles are deduplicated per source on matching URLs.
  • date - The dateline from the article, if found, verbatim. A canonicalized version is now available in a later column.
  • sentenceIndex - The index of each sentence per article as tokenized by Eidos.
  • sentence = The text of the sentence. Line feeds and tabs have been replaced with spaces to make the tsv file easier to read.
  • causal - A Boolean to indicate whether the sentence includes a causal relation.
  • causalIndex - A sentence can contain multiple causal relations, so each is numbered and listed separately, with each of the preceding columns duplicated.
  • negationCount - A Boolean to indicate whether the relation is negated. That would mean, for example, that it didn't apply or happen.
  • causeIncCount - The number of phrases indicating that the cause increased or that there is more of it. They include words such as accelerate, boost, promote, and strengthen. They are called increase_triggers.
  • causeDecCount - The number of phrases indicating that the cause decreased or that there is less of it. They include words such as abate, curtail, decrease, and prohibit. They are called decrease_triggers.
  • causePosCount - The number of phrases indicating that the cause is positive positive_affect_triggers. Words to that end include aide, ease, relieve, better, and good.
  • causeNegCount - The number of phrases indicating that the cause is negative negative_affect_triggers. Words to that end include challenge, threaten, and worse.
  • effectIncCount - As causeIncCount, but for the effect.
  • effectDecCount - As causeDecCount, but for the effect.
  • effectPosCount - As causePosCount, but for the effect.
  • effectNegCount - As causeNegCount, but for the effect.
  • causeText - The text of the cause.
  • effectText - The text of the effect.
  • belief - A Boolean indicating whether the sentence (with possible help from the previous sentence if something like "they" or "this" needs to be resolved) contains a belief.
  • sent_locs - The geophysical location mentioned in the sentence, if any. This location and the next are comma-separated lists of some region and then in parentheses the latitude and longitude associated with the region: for example, Abronye (7.69381, -1.9091), Efuanta (5.28527, -2.00557).
  • context_locs - Locations mentioned within the previous or next three sentences around the belief.
  • canonicalDate - The article's date from the date column is converted here to one of three formats (so far): YYYY-MM-DDTHH:MM:SS or YYYY-MM-DDTHH:MM if a time was specified at all for the article, or YYYY-MM-DD if there was only a date.

Boolean values are written "True" and "False".

If causal is false, then columns for causalIndex, negationCount, causeIncCount, causeDecCount, causePosCount, causeNegCount, effectIncCount, effectDecCount, effectPosCount, and effectNegCount are empty. If there are no locations for sent_locs or context_locs, those columns are also empty.

Materials

Some files are borrowed from other projects:

  • belief model - The model should be downloaded automatically when needed, but it can also be done in advance.
  • locations file - This is small enough to include with the source code.