-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does this support waitUntil: networkidle? #36
Comments
I'm not exactly sure if I understood you correctly. But I believe a special |
Sorry, I should've been more specific. I was asking about something akin to the Although I don't particularly appreciate the closed set of events in puppeteer, they're good enough. Take If at all possible however, what I'm really out for is the ability to wait until there are X amount of active requests, for Y amount of time. If there's a way to get how many network connections there are I could implement the wait myself. As a huge plus it would be really awesome if one could get a list of active network requests instead of just a count, with e.g. their URL, and/or its type of request- if known, like if it's XHR. This way you could also decide if waiting for a e.g. stylesheet is really necessary, and in my particular use case I'd ignore it because I only need to wait for dynamic content, i.e. ajax requests. I have no idea if this is a possibility at all though. Sidenote: I couldn't find |
Right now
If I read the puppeteer source code right, these are just aliases
While it would be possible to get access to all the current active
you could subscribe to the events you're interested in on the
So if I understood you correctly, you would like similar options in in the You're visiting a page that uses a lot of dynamic content, so the page would be considered |
Informative response, thank you!
If I understood this right, I could subscribe to a "network" event, and catch all network requests as they complete? Or, is there also a way to catch them once they fire rather than when they finish? If so, then that already solves my usecase, I'd only suggest adding a hint on the docs to this because I do believe it's a great feature.
Yes, pretty close, but I'll try to explain why exactly. Basically, to build a fully functional web crawler today, for some sites you need to be able to run javascript, otherwise you might miss out on certain dynamic content, or the ability to crawl at all because the website runs purely through dynamic content. One way would be to implement a javascript engine, but then you also need the DOM aspect to make such javascript function- like google has done internally, I've decided against this because it's quite a bit of effort. Another way to do this is through controlling a webdriver, but here's the problem: You have an unknown variable of time to wait for dynamic content, you can't just tell the webdriver to visit the page and start grabbing content immediately, because you'll lose out on a lot of content. To solve this, you need to force a wait for X amount of time, to hopefully catch all dynamic content. Some pages take over 3-5 seconds to load everything, some more, however most of course take way less. But because it's an unknown variable of time, you need to wait the same amount for every page. As you can imagine, this makes crawling javascript-based websites extremely slow. Although there is a mediocre solution to this- by inserting a MutationObserver script, but this has its own flaws. The alternative would then be to use Chrome DevTools Protocol, as it has more access in terms of deciding when exactly the page, including dynamic content has finished loading. So, to speed up crawling to the utmost possibility here, you'd want to be able to tell when all important dynamic content has fully been loaded in. This way, some pages may take 3 seconds to crawl, some take only 100ms etc. It speeds it up a goooood amount. I hope that makes it clearer, and I'd definitely be interested in hacking together a solution if it's possible, which could be implemented into this library in the end! |
I kinda have the same problem. I am grabbing a screenshot after doing a |
I saw references to NetworkIdle in the source, I wonder if it's supported yet to wait until network has been idle X amount of time. This is a huge benefit of puppeteer vs webdriver, especially for JS web crawlers, or to just simplify actions on dynamic content. If it is supported and I'm not blind/stupid- examples or documentation on this particular feature would really help boost this library I believe! :)
The text was updated successfully, but these errors were encountered: