-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to reuse local Chrome user dir/cookies #39
Comments
An update...after running the above, it appears that the cookies on the wiki site at the target URI of the crawl has been removed and I needed to log in again. This is a case of crawling-considered-harmful and an unfortunate side-effect. EDIT: It appears to have affected the retention of other site cookies (e.g., facebook.com) as well. |
@N0taN3rd Per your suggestion, I pulled 9bbc461 and re-installed with the same (above) config.json. The crawl finished within a couple minutes with an error, which I did not mention in the ticket description but may be relevant for debugging Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
A Fatal Error Occurred
Error: options.stripFragment is renamed to options.stripHash
- index.js:35 module.exports
[Squidwarc]/[normalize-url]/index.js:35:9
- _createHybrid.js:87 wrapper
[Squidwarc]/[lodash]/_createHybrid.js:87:15
- puppeteer.js:155 PuppeteerCrawler.navigate
/private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11
Please Inform The Maintainer Of This Project About It. Information In package.json Upon re-launching Chrome, some sites where I would have a cookie (inclusive of the ws-dl wiki site) showed that I was no longer logged in. Viewing the WARC showed that the URI specified to be archived was not present but a capture of the wiki login page was present and replay-able. |
As discussed via Slack, making a duplicate of my profile might help resolve this issue. I did so via:
...then ran My macOS 10.14.2 Chrome reports version 72.0.3626.81. wsdlwiki.config is the same as above but with the path changed to
It's interesting and potentially problematic that Squidwarc/puppeteer is trying to use Chrome 73.0.3679.0 per the error. Think the version difference is the issue, @N0taN3rd, or something else? |
TBH I am completely unsure at this point. I have had success on Linux using the same browser (do have to continually re-sign in ever time The best bet I can think of now is to use a completely new user data dir by initially launching the version of chrome you want with the That way when you start the crawl that completely new user data dir is unique to that browser. |
This will require some additional changes to Squidwarc but I suspect that the issue is with setting the user data dir itself rather than letting the normal browser resolution of that directories path take place. So if their were an config option to not do anything data-dir / password related the browser would figure it out correctly. //cc @N0taN3rd |
Having to re-sign-in somewhat defeats the purpose of reusing the user data directory. It's reminiscent of the Webrecorder approach (:P) and is not nearly as powerful as reusing existing cookies/logins, if possible. With regard to the delta of Chrome versions for the system vs. what is used in Squidwarc, is there currently a way to tell Squidwarc to use a certain version of Chrome(ium)? Having that match up and reusing the data dir might be one needed test to see if it persists. |
For the truly desperate, I was able to load some cookies by doing the following:
This lets you "load" cookies into the session. |
Are you submitting a bug report or a feature request?
Bug report.
What is the current behavior?
https://github.com/N0taN3rd/Squidwarc/blob/master/manual/configuration.md#userdatadir states that a
userDataDir
attribute can be specified to reuse the user directory for a system's Chrome. I use a logged in version of Chrome on my system, so wanted to leverage my logged-in cookies to crawl contents behind authentication using Squidwarc. I specify a config file for Squidwarc:Crawler Operating In page-all-links mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Running user script
Crawler Generating WARC
Crawler Has 18 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Running user script
Crawler Generating WARC
Crawler Has 17 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Running user script
Crawler Generating WARC
Crawler Has 16 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Running user script
Crawler Generating WARC
Crawler Has 15 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Running user script
Crawler Generating WARC
Crawler Has 14 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Running user script
Crawler Generating WARC
Crawler Has 13 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Running user script
Crawler Generating WARC
Crawler Has 12 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Running user script
Crawler Generating WARC
Crawler Has 11 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Running user script
Crawler Generating WARC
Crawler Has 10 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Running user script
Crawler Generating WARC
Crawler Has 9 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Running user script
Crawler Generating WARC
Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Crawler Navigated To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Running user script
Crawler Generating WARC
Crawler Has 7 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Running user script
Crawler Generating WARC
Crawler Has 6 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Running user script
Crawler Generating WARC
Crawler Has 5 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Running user script
Crawler Generating WARC
Crawler Has 4 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/
A Fatal Error Occurred
Error: options.stripFragment is renamed to options.stripHash
index.js:35 module.exports
[Squidwarc]/[normalize-url]/index.js:35:9
_createHybrid.js:87 wrapper
[Squidwarc]/[lodash]/_createHybrid.js:87:15
puppeteer.js:155 PuppeteerCrawler.navigate
/private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11
Please Inform The Maintainer Of This Project About It. Information In package.json
What is the expected behavior?
Squidwarc uses my local Chrome's cookies and captures the page behind authentication, per the manual.
What's your environment?
macOS 10.14.2
Squidwarc a402335 (current master)
node v10.12.0
Other information
We discussed this informally via Slack. Previously, I experienced this config script borking my Chrome's user directory (i.e., conventionally using Chrome would no longer allow creds to "stick") but can no longer replicate this.
The text was updated successfully, but these errors were encountered: