Support ShadowDOM #555

mantou132 · 2022-02-23T02:52:49Z

Use getInnerHTML

bodinsamuel

Thanks for your PR 🚀

Can you provide a test and way to enable/disable the behavior via the API?
We probably won't use this feature on our side so we most likely want it to be disabled by default.

If you prefer I can take over this PR but it will probably take longer ☺️

bodinsamuel · 2022-02-23T14:08:45Z

src/lib/browser/Page.ts

+        const rootAttr = [...document.documentElement.attributes]
+          .map(({ name, value }) => `${name}="${value}"`)
+          .join(' ');
+        const innerContent = (document.documentElement as any).getInnerHTML();


Shouldn't it needs to {includeShadowRoots: true}?

The open mode ShadowDOM can be obtained without passing this parameter.

In order to preserve encapsulation semantics, any closed shadow roots within an element will not be serialized by default.

The default behavior seems to be what we want

mantou132 · 2022-02-23T16:49:35Z

@bodinsamuel Would love to have ShadowDOM support for Docsearch since my docs site uses WebComponents.

If ShadowDOM is not supported by default, is it possible to enable this feature in Crawler config?

Glad you can take over this PR, i don't have much time to perfect it.

Thank you for your team's work

bodinsamuel · 2022-02-23T18:45:54Z

ah it's for DocSearch, I wasn't aware. In that case we might want to use it indeed ahah
Can you share your website? ☺️

mantou132 · 2022-02-24T02:31:04Z

I have several websites using WebComponents:

Web framework: https://gemjs.org/
Documentation website builder: https://book.gemjs.org/
UI library: https://duoyun-ui.gemjs.org/
Panel layout: https://panel.gemjs.org/

I used to use fork docsearch-scraper.

mantou132 · 2024-01-10T18:57:20Z

I'm not sure how docsearch retrieves information from the HTML string. If use a parser to analyze the HTML and then query through the DOM API, we need to remove the <template shadowrootmode> strings to prevent them from being re-parsed as ShadowDOM, e.g:

document.body.
  .replaceAll('<template shadowrootmode="open">', '')
  .replaceAll('</template>', '')

Update:

It seems that the Crawler is using Cheerio, so there is no need to remove the <template> tags.

Use `getInnerHTML`

mantou132 · 2024-01-10T19:06:47Z

src/lib/browser/Page.ts

@@ -271,7 +271,23 @@ export class BrowserPage {
      return await promiseWithTimeout(
        (async (): Promise<string | null> => {
          const start = Date.now();
-          const content = await this.#ref?.content();


Page.content(): https://github.com/microsoft/playwright/blob/19a4f15eb67fd82a0b78b12dd94e3564504f83f9/packages/playwright-core/src/server/frames.ts#L866

mantou132 · 2024-04-12T11:47:07Z

@bodinsamuel now use standard method getHTML

bodinsamuel reviewed Feb 23, 2022

View reviewed changes

mantou132 force-pushed the patch-1 branch 2 times, most recently from baacb1e to b724543 Compare January 10, 2024 18:49

Support ShadowDOM

a2f1eb2

Use `getInnerHTML`

mantou132 force-pushed the patch-1 branch from b724543 to a2f1eb2 Compare January 10, 2024 18:59

mantou132 commented Jan 10, 2024

View reviewed changes

chore: use getHTML

8d6104f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ShadowDOM #555

Support ShadowDOM #555

mantou132 commented Feb 23, 2022 •

edited

Loading

bodinsamuel left a comment

bodinsamuel Feb 23, 2022

mantou132 Feb 23, 2022

mantou132 commented Feb 23, 2022

bodinsamuel commented Feb 23, 2022

mantou132 commented Feb 24, 2022 •

edited

Loading

mantou132 commented Jan 10, 2024 •

edited

Loading

mantou132 Jan 10, 2024

mantou132 commented Apr 12, 2024

Support ShadowDOM #555

Are you sure you want to change the base?

Support ShadowDOM #555

Conversation

mantou132 commented Feb 23, 2022 • edited Loading

bodinsamuel left a comment

Choose a reason for hiding this comment

bodinsamuel Feb 23, 2022

Choose a reason for hiding this comment

mantou132 Feb 23, 2022

Choose a reason for hiding this comment

mantou132 commented Feb 23, 2022

bodinsamuel commented Feb 23, 2022

mantou132 commented Feb 24, 2022 • edited Loading

mantou132 commented Jan 10, 2024 • edited Loading

mantou132 Jan 10, 2024

Choose a reason for hiding this comment

mantou132 commented Apr 12, 2024

mantou132 commented Feb 23, 2022 •

edited

Loading

mantou132 commented Feb 24, 2022 •

edited

Loading

mantou132 commented Jan 10, 2024 •

edited

Loading