Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ShadowDOM #555

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Support ShadowDOM #555

wants to merge 2 commits into from

Conversation

mantou132
Copy link

@mantou132 mantou132 commented Feb 23, 2022

Copy link
Contributor

@bodinsamuel bodinsamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR 🚀

Can you provide a test and way to enable/disable the behavior via the API?
We probably won't use this feature on our side so we most likely want it to be disabled by default.

If you prefer I can take over this PR but it will probably take longer ☺️

const rootAttr = [...document.documentElement.attributes]
.map(({ name, value }) => `${name}="${value}"`)
.join(' ');
const innerContent = (document.documentElement as any).getInnerHTML();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it needs to {includeShadowRoots: true}?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The open mode ShadowDOM can be obtained without passing this parameter.

In order to preserve encapsulation semantics, any closed shadow roots within an element will not be serialized by default.

The default behavior seems to be what we want

@mantou132
Copy link
Author

@bodinsamuel Would love to have ShadowDOM support for Docsearch since my docs site uses WebComponents.

If ShadowDOM is not supported by default, is it possible to enable this feature in Crawler config?

Glad you can take over this PR, i don't have much time to perfect it.

Thank you for your team's work

@bodinsamuel
Copy link
Contributor

ah it's for DocSearch, I wasn't aware. In that case we might want to use it indeed ahah
Can you share your website? ☺️

@mantou132
Copy link
Author

mantou132 commented Feb 24, 2022

I have several websites using WebComponents:

I used to use fork docsearch-scraper.

@mantou132 mantou132 force-pushed the patch-1 branch 2 times, most recently from baacb1e to b724543 Compare January 10, 2024 18:49
@mantou132
Copy link
Author

mantou132 commented Jan 10, 2024

I'm not sure how docsearch retrieves information from the HTML string. If use a parser to analyze the HTML and then query through the DOM API, we need to remove the <template shadowrootmode> strings to prevent them from being re-parsed as ShadowDOM, e.g:

document.body.
  .replaceAll('<template shadowrootmode="open">', '')
  .replaceAll('</template>', '')

Update:

It seems that the Crawler is using Cheerio, so there is no need to remove the <template> tags.

Use `getInnerHTML`
@@ -271,7 +271,23 @@ export class BrowserPage {
return await promiseWithTimeout(
(async (): Promise<string | null> => {
const start = Date.now();
const content = await this.#ref?.content();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mantou132
Copy link
Author

@bodinsamuel now use standard method getHTML

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants