Skip to content

Commit

Permalink
Add support for alternate links with hreflang (#55)
Browse files Browse the repository at this point in the history
* Add support for alternate hreflang link

* add docs

* lint

* Improve docs

* Improve tests

* Add changelog entry
  • Loading branch information
freddyheppell authored Jan 20, 2025
1 parent 32c6478 commit 8da4ed9
Show file tree
Hide file tree
Showing 6 changed files with 217 additions and 3 deletions.
7 changes: 7 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
Changelog
=========

Upcoming
--------

**New Features**

* Added support for :ref:`alternate localised pages <sitemap-extra-localisation>` with ``hreflang``.

v1.0.0 (2025-01-13)
-------------------

Expand Down
23 changes: 23 additions & 0 deletions docs/reference/formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ The Google News extension provides additional information to describe the news s

If the page contains Google News data, it is stored as a :class:`~usp.objects.page.SitemapNewsStory` object in :attr:`SitemapPage.news_story <usp.objects.page.SitemapPage.news_story>`.

.. _google-image-ext:

Google Image
""""""""""""

Expand All @@ -150,6 +152,27 @@ If the page contains Google Image data, it is stored as a list of :class:`~usp.o

.. _xml date:

Additional Features
^^^^^^^^^^^^^^^^^^^

Beyond the Sitemap specification, USP also supports some non-standard features used by large sitemap consumers (e.g. Google).

.. _sitemap-extra-localisation:

Alternate Localised Pages
"""""""""""""""""""""""""

- `Google documentation <https://developers.google.com/search/docs/specialty/international/localized-versions#sitemap>`__

.. dropdown:: Example
:class-container: flush

.. literalinclude:: formats_examples/hreflang.xml
:emphasize-lines: 3,7-10,15-18
:language: xml

Alternate localised pages specified with the ``<link>`` tag will be stored as a list in :attr:`SitemapPage.alternates <usp.objects.page.SitemapPage.alternates>`. Language codes are not validated.

Date Time Parsing
^^^^^^^^^^^^^^^^^

Expand Down
20 changes: 20 additions & 0 deletions docs/reference/formats_examples/hreflang.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://example.org/en/page</loc>
<lastmod>2024-01-01</lastmod>
<xhtml:link
rel="alternate"
hreflang="fr-FR"
href="https://example.org/fr/page"/>
</url>
<url>
<loc>https://example.org/fr/page</loc>
<lastmod>2024-01-02</lastmod>
<xhtml:link
rel="alternate"
hreflang="en-GB"
href="https://example.org/en/page"/>
</url>
</urlset>
118 changes: 118 additions & 0 deletions tests/tree/test_xml_exts.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,121 @@ def test_xml_image(self, requests_mock):
print(tree)

assert tree == expected_sitemap_tree


class TestXMLHrefLang(TreeTestBase):
def test_hreflang(self, requests_mock):
requests_mock.add_matcher(TreeTestBase.fallback_to_404_not_found_matcher)

requests_mock.get(
self.TEST_BASE_URL + "/robots.txt",
headers={"Content-Type": "text/plain"},
text=textwrap.dedent(
f"""
User-agent: *
Disallow: /whatever
Sitemap: {self.TEST_BASE_URL}/sitemap.xml
"""
).strip(),
)

requests_mock.get(
self.TEST_BASE_URL + "/sitemap.xml",
headers={"Content-Type": "text/xml"},
text=textwrap.dedent(
f"""
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>{self.TEST_BASE_URL}/en/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" hreflang="fr-FR" href="{self.TEST_BASE_URL}/fr/page"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/fr/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" hreflang="en-GB" href="{self.TEST_BASE_URL}/en/page"/>
</url>
</urlset>
"""
).strip(),
)

tree = sitemap_tree_for_homepage(self.TEST_BASE_URL)

pages = list(tree.all_pages())
assert pages[0].alternates == [
("fr-FR", f"{self.TEST_BASE_URL}/fr/page"),
]
assert pages[1].alternates == [
("en-GB", f"{self.TEST_BASE_URL}/en/page"),
]

def test_missing_attrs(self, requests_mock):
requests_mock.add_matcher(TreeTestBase.fallback_to_404_not_found_matcher)

requests_mock.get(
self.TEST_BASE_URL + "/robots.txt",
headers={"Content-Type": "text/plain"},
text=textwrap.dedent(
f"""
User-agent: *
Disallow: /whatever
Sitemap: {self.TEST_BASE_URL}/sitemap.xml
"""
).strip(),
)

requests_mock.get(
self.TEST_BASE_URL + "/sitemap.xml",
headers={"Content-Type": "text/xml"},
text=textwrap.dedent(
f"""
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>{self.TEST_BASE_URL}/en/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" href="{self.TEST_BASE_URL}/fr/page"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/en/page2</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link hreflang="fr-FR" href="{self.TEST_BASE_URL}/fr/page2"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/fr/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" hreflang="en-GB"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/fr/page2</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link hreflang="en-GB" href="{self.TEST_BASE_URL}/en/page2"/>
</url>
</urlset>
"""
).strip(),
)

tree = sitemap_tree_for_homepage(self.TEST_BASE_URL)

pages = list(tree.all_pages())
assert pages[0].alternates is None
assert pages[1].alternates is None
assert pages[2].alternates is None
assert pages[3].alternates is None
20 changes: 20 additions & 0 deletions usp/fetch_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,6 +643,7 @@ class Page:
"news_keywords",
"news_stock_tickers",
"images",
"alternates",
]

def __init__(self):
Expand All @@ -659,6 +660,7 @@ def __init__(self):
self.news_keywords = None
self.news_stock_tickers = None
self.images = []
self.alternates = []

def __hash__(self):
return hash(
Expand Down Expand Up @@ -763,13 +765,18 @@ def page(self) -> Optional[SitemapPage]:
for image in self.images
]

alternates = None
if len(self.alternates) > 0:
alternates = self.alternates

return SitemapPage(
url=url,
last_modified=last_modified,
change_frequency=change_frequency,
priority=priority,
news_story=sitemap_news_story,
images=sitemap_images,
alternates=alternates,
)

__slots__ = ["_current_page", "_pages", "_page_urls", "_current_image"]
Expand Down Expand Up @@ -801,6 +808,19 @@ def xml_element_start(self, name: str, attrs: Dict[str, str]) -> None:
"Page is expected to be set before <image:image>."
)
self._current_image = self.Image()
elif name == "link":
if not self._current_page:
raise SitemapXMLParsingException(
"Page is expected to be set before <link>."
)
if "rel" not in attrs or attrs["rel"] != "alternate":
log.warning(f"<link> element is missing rel attribute: {attrs}.")
elif "hreflang" not in attrs or "href" not in attrs:
log.warning(
f"<link> element is missing hreflang or href attributes: {attrs}."
)
else:
self._current_page.alternates.append((attrs["hreflang"], attrs["href"]))

def __require_last_char_data_to_be_set(self, name: str) -> None:
if not self._last_char_data:
Expand Down
32 changes: 29 additions & 3 deletions usp/objects/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import datetime
from decimal import Decimal
from enum import Enum, unique
from typing import List, Optional
from typing import List, Optional, Tuple

SITEMAP_PAGE_DEFAULT_PRIORITY = Decimal("0.5")
"""Default sitemap page priority, as per the spec."""
Expand Down Expand Up @@ -331,6 +331,7 @@ class SitemapPage:
"__change_frequency",
"__news_story",
"__images",
"__alternates",
]

def __init__(
Expand All @@ -341,6 +342,7 @@ def __init__(
change_frequency: Optional[SitemapPageChangeFrequency] = None,
news_story: Optional[SitemapNewsStory] = None,
images: Optional[List[SitemapImage]] = None,
alternates: Optional[List[Tuple[str, str]]] = None,
):
"""
Initialize a new sitemap-derived page.
Expand All @@ -357,6 +359,7 @@ def __init__(
self.__change_frequency = change_frequency
self.__news_story = news_story
self.__images = images
self.__alternates = alternates

def __eq__(self, other) -> bool:
if not isinstance(other, SitemapPage):
Expand All @@ -380,6 +383,9 @@ def __eq__(self, other) -> bool:
if self.images != other.images:
return False

if self.alternates != other.alternates:
return False

return True

def __hash__(self):
Expand Down Expand Up @@ -442,10 +448,30 @@ def change_frequency(self) -> Optional[SitemapPageChangeFrequency]:

@property
def news_story(self) -> Optional[SitemapNewsStory]:
"""Get the Google News story attached to the URL."""
"""Get the Google News story attached to the URL.
See :ref:`google-news-ext` reference
"""
return self.__news_story

@property
def images(self) -> Optional[List[SitemapImage]]:
"""Get the images attached to the URL."""
"""Get the images attached to the URL.
See :ref:`google-image-ext` reference
"""
return self.__images

@property
def alternates(self) -> Optional[List[Tuple[str, str]]]:
"""Get the alternate URLs for the URL.
A tuple of (language code, URL) for each ``<xhtml:link>`` element with ``rel="alternate"`` attribute.
See :ref:`sitemap-extra-localisation` reference
Example::
[('fr', 'https://www.example.com/fr/page'), ('de', 'https://www.example.com/de/page')]
"""
return self.__alternates

0 comments on commit 8da4ed9

Please sign in to comment.