Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NikkanGeadai #689

Open
wants to merge 1 commit into
base: add-sankei-shimbun
Choose a base branch
from
Open

Add NikkanGeadai #689

wants to merge 1 commit into from

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Jan 22, 2025

This PR also adds a new utility function to transform <br> into <p> elements. Maybe @addie9800 wants to give it a shot for TheNamibian and also TimesOfIndia.

@MaxDall MaxDall requested a review from addie9800 January 22, 2025 13:49
@MaxDall MaxDall changed the base branch from master to add-sankei-shimbun January 22, 2025 13:49
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this addition 👍


@attribute
def topics(self) -> List[str]:
return generic_topic_parsing(self.precomputed.meta.get("keywords"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the 'keywords' aren't really useful, since they're the same for all articles I've seen:
['Nikkan Gendai DIGITAL', 'Nikkan Gendai', 'Politics', 'Society', 'Entertainment', 'Entertainment', 'Celebrities', 'Sports', 'Column', 'Life', 'Health', 'Business', 'Gendai', 'Gendai', 'gendai', 'Nikkan Gendai']
Instead, I would use the related keywords list, that exists for some articles, e.g. https://www.nikkan-gendai.com/articles/view/news/366688 (search for 関連キーワード)

for child in element:
element.remove(child)
element.tail = None
element.text = "Test"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
element.text = "Test"

I think this is leftover :)

@@ -271,6 +271,55 @@ def get_meta_content(root: lxml.html.HtmlElement) -> Dict[str, str]:
return metadata


def transform_breaks_to_paragraphs(element: lxml.html.HtmlElement, **attribs: str) -> lxml.html.HtmlElement:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good idea, but it isn't (yet) viable for TimesOfIndia. Not necessarily all paragraphs are separated with <br> blocks, but also follow <div> blocks for example, which is currently producing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants