Add `NikkanGeadai` #689

MaxDall · 2025-01-22T13:49:23Z

This PR also adds a new utility function to transform <br> into <p> elements. Maybe @addie9800 wants to give it a shot for TheNamibian and also TimesOfIndia.

addie9800

Thanks for this addition 👍

addie9800 · 2025-01-23T18:59:28Z

src/fundus/publishers/jp/nikkan_geadai.py

+
+        @attribute
+        def topics(self) -> List[str]:
+            return generic_topic_parsing(self.precomputed.meta.get("keywords"))


I feel like the 'keywords' aren't really useful, since they're the same for all articles I've seen:
['Nikkan Gendai DIGITAL', 'Nikkan Gendai', 'Politics', 'Society', 'Entertainment', 'Entertainment', 'Celebrities', 'Sports', 'Column', 'Life', 'Health', 'Business', 'Gendai', 'Gendai', 'gendai', 'Nikkan Gendai']
Instead, I would use the related keywords list, that exists for some articles, e.g. https://www.nikkan-gendai.com/articles/view/news/366688 (search for 関連キーワード)

addie9800 · 2025-01-23T19:11:26Z

src/fundus/parser/utility.py

+        for child in element:
+            element.remove(child)
+        element.tail = None
+        element.text = "Test"


Suggested change

element.text = "Test"

I think this is leftover :)

addie9800 · 2025-01-23T20:14:38Z

src/fundus/parser/utility.py

@@ -271,6 +271,55 @@ def get_meta_content(root: lxml.html.HtmlElement) -> Dict[str, str]:
    return metadata


+def transform_breaks_to_paragraphs(element: lxml.html.HtmlElement, **attribs: str) -> lxml.html.HtmlElement:


This is a really good idea, but it isn't (yet) viable for TimesOfIndia. Not necessarily all paragraphs are separated with <br> blocks, but also follow <div> blocks for example, which is currently producing issues.

add NikkanGeadai

cf0d958

MaxDall requested a review from addie9800 January 22, 2025 13:49

MaxDall changed the base branch from master to add-sankei-shimbun January 22, 2025 13:49

addie9800 requested changes Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `NikkanGeadai` #689

Add `NikkanGeadai` #689

MaxDall commented Jan 22, 2025

addie9800 left a comment

addie9800 Jan 23, 2025

addie9800 Jan 23, 2025

addie9800 Jan 23, 2025

		@@ -271,6 +271,55 @@ def get_meta_content(root: lxml.html.HtmlElement) -> Dict[str, str]:
		return metadata


		def transform_breaks_to_paragraphs(element: lxml.html.HtmlElement, **attribs: str) -> lxml.html.HtmlElement:

Add NikkanGeadai #689

Are you sure you want to change the base?

Add NikkanGeadai #689

Conversation

MaxDall commented Jan 22, 2025

addie9800 left a comment

Choose a reason for hiding this comment

addie9800 Jan 23, 2025

Choose a reason for hiding this comment

addie9800 Jan 23, 2025

Choose a reason for hiding this comment

addie9800 Jan 23, 2025

Choose a reason for hiding this comment

Add `NikkanGeadai` #689

Add `NikkanGeadai` #689