Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting Table of Contents (TOCs) for Articles #3

Closed
mgns opened this issue Jan 23, 2018 · 10 comments
Closed

Extracting Table of Contents (TOCs) for Articles #3

mgns opened this issue Jan 23, 2018 · 10 comments
Labels
gsoc-2018 Google Summer of Code 2018. project This project has been accepted for the GSoC.

Comments

@mgns
Copy link
Member

mgns commented Jan 23, 2018

Description

Each Wikipedia article is structured by headings and subheadings. These structures indicate the relevance of certain aspects for the described entity. Extracting such data can help in categorizing the entities and facts about the entity. E.g. cities usually have paragraphs on History, Geography and Demographics, while soccer clubs have paragraphs on Honours, Players and Stadiums. Obviously, there are pitfalls: E.g. these paragraphs are not uniformly captioned, thus an alignment (ideally to DBpedia resources) between variations would be helpful. The newly created dataset should follow Linked Data principles, e.g. a sufficiently expressive vocabulary should be used to describe TOCs (ideally as resources), the order of TOC entries, etc.
Optionally, it would be interesting to apply the dataset for a meaningful application, e.g. generating missing types.

Goals

Extract TOCs from article pages and produce an RDF dataset describing the article TOCs in a comprehensive way.

Impact

A new dataset which can be used in various ways. Insights in aspects of DBpedia entities.

Warm up tasks

Mentors

Magnus

Keywords

extraction

@mgns mgns added gsoc-2018 Google Summer of Code 2018. project This project has been accepted for the GSoC. labels Jan 23, 2018
@pratyusha972
Copy link

@mgns , I am interested in working on this project, can you please guide me on how to start working on the same.

@mgns
Copy link
Member Author

mgns commented Feb 15, 2018

I added a warmup task to this idea. There are mainly two approaches to go for solving this task:

  1. write a new extractor that extracts the TOCs from wikitext
  2. write a script which processes the latest NIF dataset

As the first one is the more straightforward solution, you should familiarize with the extraction framework.

When writing your proposal, you will have to describe your suggested solution for the problem.

@icemc
Copy link

icemc commented Feb 24, 2018

Hello @mgns I'm also interested in working on this. When I am done with the warm up task, how should I let you know about my progress?

@mgns
Copy link
Member Author

mgns commented Mar 13, 2018

Simply summarize your findings in a Google Doc and share it with me.

@khikmatullaev
Copy link

@mgns would you like to give me your gmail? I want to share the result of the warm up task.
By the way, I did not find how I can add myself to the slack chat of DBpedia? Would you like to give me instruction?

@hrishikeshh
Copy link

Hi @khikmatullaev ,
For joining slack forum of DBPedia, go to this link.
Enter your e-mail ID and verify.

@mgns
Copy link
Member Author

mgns commented Mar 15, 2018 via email

@mgns
Copy link
Member Author

mgns commented Mar 16, 2018 via email

@shubhamvb
Copy link

shubhamvb commented Mar 24, 2018

Hi @mgns , I read the project description and I am interested in working on it. I just wanted to clear a few queries I had.
I am not sure I understand when you say the articles are not uniformly captioned and hence an alignment is necessary. What does alignment refer to in this case? Can you please elaborate a bit.

Thanks!

@mgns
Copy link
Member Author

mgns commented Mar 26, 2018

The project first should extract a TOC for each article in Wikipedia. The TOC should contain all headings and subheadings of the article with the respective label and order.
In order to make these TOC entries better comparable it would be nice, they were mapped to some common vocabulary. Take for example the entry "Life" in the article on Vincent van Gogh and "Biography" in the article on Aline Charigot. Both entries denote similar or equal concepts. This would be cool to be captured in the dataset. E.g. one could map both to a DBpedia resource such as http://dbpedia.org/resource/Biography. Sometimes a specific resource is used as heading, e.g. "Munich International Airport" in the article on Munich.
Such a mapping might be partial.

@mommi84 mommi84 closed this as completed Dec 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc-2018 Google Summer of Code 2018. project This project has been accepted for the GSoC.
Projects
None yet
Development

No branches or pull requests

7 participants