Extracting Table of Contents (TOCs) for Articles #3

mgns · 2018-01-23T13:10:08Z

Description

Each Wikipedia article is structured by headings and subheadings. These structures indicate the relevance of certain aspects for the described entity. Extracting such data can help in categorizing the entities and facts about the entity. E.g. cities usually have paragraphs on History, Geography and Demographics, while soccer clubs have paragraphs on Honours, Players and Stadiums. Obviously, there are pitfalls: E.g. these paragraphs are not uniformly captioned, thus an alignment (ideally to DBpedia resources) between variations would be helpful. The newly created dataset should follow Linked Data principles, e.g. a sufficiently expressive vocabulary should be used to describe TOCs (ideally as resources), the order of TOC entries, etc.
Optionally, it would be interesting to apply the dataset for a meaningful application, e.g. generating missing types.

Goals

Extract TOCs from article pages and produce an RDF dataset describing the article TOCs in a comprehensive way.

Impact

A new dataset which can be used in various ways. Insights in aspects of DBpedia entities.

Warm up tasks

Run Extraction Framework Run Extraction Framework #8

Mentors

Magnus

Keywords

extraction

pratyusha972 · 2018-02-14T18:30:38Z

@mgns , I am interested in working on this project, can you please guide me on how to start working on the same.

mgns · 2018-02-15T15:27:39Z

I added a warmup task to this idea. There are mainly two approaches to go for solving this task:

write a new extractor that extracts the TOCs from wikitext
write a script which processes the latest NIF dataset

As the first one is the more straightforward solution, you should familiarize with the extraction framework.

When writing your proposal, you will have to describe your suggested solution for the problem.

icemc · 2018-02-24T21:37:06Z

Hello @mgns I'm also interested in working on this. When I am done with the warm up task, how should I let you know about my progress?

mgns · 2018-03-13T10:29:35Z

Simply summarize your findings in a Google Doc and share it with me.

khikmatullaev · 2018-03-14T14:22:07Z

@mgns would you like to give me your gmail? I want to share the result of the warm up task.
By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction?

hrishikeshh · 2018-03-14T14:40:34Z

Hi @khikmatullaev ,
For joining slack forum of DBPedia, go to this link.
Enter your e-mail ID and verify.

mgns · 2018-03-15T16:54:52Z

Just share it to: [email protected] Thanks!

…

Am 14.03.2018 um 15:22 schrieb Akmal Khikmatullaev ***@***.***>: @mgns would you like to give me your gmail? I want to share the result of the warm up task. By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mgns · 2018-03-16T09:30:48Z

Just invite me: [email protected] Thanks!

…

Am 14.03.2018 um 15:22 schrieb Akmal Khikmatullaev ***@***.***>: @mgns would you like to give me your gmail? I want to share the result of the warm up task. By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

shubhamvb · 2018-03-24T02:44:21Z

Hi @mgns , I read the project description and I am interested in working on it. I just wanted to clear a few queries I had.
I am not sure I understand when you say the articles are not uniformly captioned and hence an alignment is necessary. What does alignment refer to in this case? Can you please elaborate a bit.

Thanks!

mgns · 2018-03-26T09:51:33Z

The project first should extract a TOC for each article in Wikipedia. The TOC should contain all headings and subheadings of the article with the respective label and order.
In order to make these TOC entries better comparable it would be nice, they were mapped to some common vocabulary. Take for example the entry "Life" in the article on Vincent van Gogh and "Biography" in the article on Aline Charigot. Both entries denote similar or equal concepts. This would be cool to be captured in the dataset. E.g. one could map both to a DBpedia resource such as http://dbpedia.org/resource/Biography. Sometimes a specific resource is used as heading, e.g. "Munich International Airport" in the article on Munich.
Such a mapping might be partial.

mgns added gsoc-2018 Google Summer of Code 2018. project This project has been accepted for the GSoC. labels Jan 23, 2018

mommi84 closed this as completed Dec 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Table of Contents (TOCs) for Articles #3

Extracting Table of Contents (TOCs) for Articles #3

mgns commented Jan 23, 2018 •

edited

Loading

pratyusha972 commented Feb 14, 2018

mgns commented Feb 15, 2018

icemc commented Feb 24, 2018

mgns commented Mar 13, 2018

khikmatullaev commented Mar 14, 2018

hrishikeshh commented Mar 14, 2018

mgns commented Mar 15, 2018 via email

mgns commented Mar 16, 2018 via email

shubhamvb commented Mar 24, 2018 •

edited

Loading

mgns commented Mar 26, 2018 •

edited

Loading

Extracting Table of Contents (TOCs) for Articles #3

Extracting Table of Contents (TOCs) for Articles #3

Comments

mgns commented Jan 23, 2018 • edited Loading

Description

Goals

Impact

Warm up tasks

Mentors

Keywords

pratyusha972 commented Feb 14, 2018

mgns commented Feb 15, 2018

icemc commented Feb 24, 2018

mgns commented Mar 13, 2018

khikmatullaev commented Mar 14, 2018

hrishikeshh commented Mar 14, 2018

mgns commented Mar 15, 2018 via email

mgns commented Mar 16, 2018 via email

shubhamvb commented Mar 24, 2018 • edited Loading

mgns commented Mar 26, 2018 • edited Loading

mgns commented Jan 23, 2018 •

edited

Loading

shubhamvb commented Mar 24, 2018 •

edited

Loading

mgns commented Mar 26, 2018 •

edited

Loading