Unable to reuse local Chrome user dir/cookies #39

machawk1 · 2018-12-07T18:13:54Z

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

https://github.com/N0taN3rd/Squidwarc/blob/master/manual/configuration.md#userdatadir states that a userDataDir attribute can be specified to reuse the user directory for a system's Chrome. I use a logged in version of Chrome on my system, so wanted to leverage my logged-in cookies to crawl contents behind authentication using Squidwarc. I specify a config file for Squidwarc:

{ "use": "puppeteer", "headless": true, "script": "./userFns.js", "mode": "page-all-links", "depth": 1, "seeds": [ "https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly" ], "warc": { "naming": "url", "append": true }, "connect": { "launch": true, "host": "localhost", "port": 9222, "userDataDir": "/Users/machawk1/Library/Application Support/Google/Chrome" }, "crawlControl": { "globalWait": 5000, "inflightIdle": 1000, "numInflight": 2, "navWait": 8000 } }

...in an attempt to preserve https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly, a URI that will provide a login page if not authenticated. I get the following result on stdout:

Running Crawl From Config File /Users/machawk1/Desktop/squidwarcWithCookies.json With great power comes great responsibility! Squidwarc is not responsible for ill behaved user supplied scripts!

Crawler Operating In page-all-links mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly
Running user script
Crawler Generating WARC
Crawler Has 18 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one
Running user script
Crawler Generating WARC
Crawler Has 17 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput
Running user script
Crawler Generating WARC
Crawler Has 16 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery=
Running user script
Crawler Generating WARC
Crawler Has 15 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle
Running user script
Crawler Generating WARC
Crawler Has 14 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly
Running user script
Crawler Generating WARC
Crawler Has 13 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page
Running user script
Crawler Generating WARC
Crawler Has 12 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal
Running user script
Crawler Generating WARC
Crawler Has 11 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events
Running user script
Crawler Generating WARC
Crawler Has 10 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges
Running user script
Crawler Generating WARC
Crawler Has 9 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random
Running user script
Crawler Generating WARC
Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Crawler Navigated To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
Running user script
Crawler Generating WARC
Crawler Has 7 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage
Running user script
Crawler Generating WARC
Crawler Has 6 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages
Running user script
Crawler Generating WARC
Crawler Has 5 Seeds Left To Crawl
Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes
Running user script
Crawler Generating WARC
Crawler Has 4 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/
A Fatal Error Occurred
Error: options.stripFragment is renamed to options.stripHash

index.js:35 module.exports
[Squidwarc]/[normalize-url]/index.js:35:9
_createHybrid.js:87 wrapper
[Squidwarc]/[lodash]/_createHybrid.js:87:15
puppeteer.js:155 PuppeteerCrawler.navigate
/private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11

Please Inform The Maintainer Of This Project About It. Information In package.json

The resulting WARC does not contain any records related to the specified URI, oddly, since anonymous access results in an HTTP 200. The URI https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random, however, is shown in the WARC. Replaying this page shows a login interface, indicative that my browser's cookies were not used.

What is the expected behavior?

Squidwarc uses my local Chrome's cookies and captures the page behind authentication, per the manual.

What's your environment?

macOS 10.14.2
Squidwarc a402335 (current master)
node v10.12.0

Other information

We discussed this informally via Slack. Previously, I experienced this config script borking my Chrome's user directory (i.e., conventionally using Chrome would no longer allow creds to "stick") but can no longer replicate this.

The text was updated successfully, but these errors were encountered:

machawk1 · 2018-12-07T18:15:33Z

An update...after running the above, it appears that the cookies on the wiki site at the target URI of the crawl has been removed and I needed to log in again. This is a case of crawling-considered-harmful and an unfortunate side-effect.

EDIT: It appears to have affected the retention of other site cookies (e.g., facebook.com) as well.

machawk1 · 2018-12-17T17:56:45Z

@N0taN3rd Per your suggestion, I pulled 9bbc461 and re-installed with the same (above) config.json.

The crawl finished within a couple minutes with an error, which I did not mention in the ticket description but may be relevant for debugging

Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
A Fatal Error Occurred
  Error: options.stripFragment is renamed to options.stripHash

  - index.js:35 module.exports
    [Squidwarc]/[normalize-url]/index.js:35:9

  - _createHybrid.js:87 wrapper
    [Squidwarc]/[lodash]/_createHybrid.js:87:15

  - puppeteer.js:155 PuppeteerCrawler.navigate
    /private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11


Please Inform The Maintainer Of This Project About It. Information In package.json

Upon re-launching Chrome, some sites where I would have a cookie (inclusive of the ws-dl wiki site) showed that I was no longer logged in. ~~Others, e.g., gmail.com, retained my cookie.~~ EDIT: Google reported a cookie error on subsequent logins (pic).

Viewing the WARC showed that the URI specified to be archived was not present but a capture of the wiki login page was present and replay-able.

machawk1 · 2019-02-02T02:00:19Z

As discussed via Slack, making a duplicate of my profile might help resolve this issue. I did so via:

cp -r /Users/machawk1/Library/Application Support/Google/Chrome /tmp/Chrome

...then ran ./bootstrap.shand ./run-crawler.sh -c wsdlwiki.config from the root of my Squidwarc working directory of current master a2f1d63.

My macOS 10.14.2 Chrome reports version 72.0.3626.81.

wsdlwiki.config is the same as above but with the path changed to /tmp/Chrome.

./run-crawler.sh -c wsdlwiki.config
Running Crawl From Config File wsdlwiki.config
With great power comes great responsibility!
Squidwarc is not responsible for ill behaved user supplied scripts!

Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc in appending mode
Crawler Will Be Generating WARC Files Using the filenamified url
A Fatal Error Occurred
  Error: Failed to launch chrome!
  dlopen /private/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.ap  p/Contents/MacOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework: dlopen(/priv  ate/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.app/Contents/M  acOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework, 261): image not found
  TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

  - Launcher.js:360 onClose
    [Squidwarc]/[puppeteer]/lib/Launcher.js:360:14

  - Launcher.js:349 Interface.helper.addEventListener
    [Squidwarc]/[puppeteer]/lib/Launcher.js:349:50

  - events.js:187 Interface.emit
    events.js:187:15

  - readline.js:379 Interface.close
    readline.js:379:8

  - readline.js:157 Socket.onend
    readline.js:157:10

  - events.js:187 Socket.emit
    events.js:187:15

  - _stream_readable.js:1094 endReadableNT
    _stream_readable.js:1094:12

  - next_tick.js:63 process._tickCallback
    internal/process/next_tick.js:63:19


Please Inform The Maintainer Of This Project About It. Information In package.json

It's interesting and potentially problematic that Squidwarc/puppeteer is trying to use Chrome 73.0.3679.0 per the error. Think the version difference is the issue, @N0taN3rd, or something else?

N0taN3rd · 2019-02-05T05:03:49Z

TBH I am completely unsure at this point.

I have had success on Linux using the same browser (do have to continually re-sign in ever time ), tho I do believe if you switch from stable <-> dev <-> unstable that does cause some issues.

The best bet I can think of now is to use a completely new user data dir by initially launching the version of chrome you want with the --user-data-dir=<path to whereever you want it>, sign into your google profile in chrome and then any of the sites you want to crawl.

That way when you start the crawl that completely new user data dir is unique to that browser.

N0taN3rd · 2019-02-05T05:15:03Z

This will require some additional changes to Squidwarc but I suspect that the issue is with setting the user data dir itself rather than letting the normal browser resolution of that directories path take place.

So if their were an config option to not do anything data-dir / password related the browser would figure it out correctly.

//cc @N0taN3rd

machawk1 · 2019-02-05T10:35:57Z

Having to re-sign-in somewhat defeats the purpose of reusing the user data directory. It's reminiscent of the Webrecorder approach (:P) and is not nearly as powerful as reusing existing cookies/logins, if possible.

With regard to the delta of Chrome versions for the system vs. what is used in Squidwarc, is there currently a way to tell Squidwarc to use a certain version of Chrome(ium)? Having that match up and reusing the data dir might be one needed test to see if it persists.

Mauville · 2021-11-08T08:40:55Z

For the truly desperate, I was able to load some cookies by doing the following:

Set a breakpoint on a line on the project and start debugging
Wait for the browser to load
Manually login into the sites you want in another tab to store the cookie
Resume execution

This lets you "load" cookies into the session.

machawk1 added the bug label Dec 7, 2018

N0taN3rd added a commit that referenced this issue Dec 16, 2018

hot fix for #39

9bbc461

machawk1 mentioned this issue Dec 17, 2018

Error: options.stripFragment is renamed to options.stripHash #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reuse local Chrome user dir/cookies #39

Unable to reuse local Chrome user dir/cookies #39

machawk1 commented Dec 7, 2018

machawk1 commented Dec 7, 2018 •

edited

Loading

machawk1 commented Dec 17, 2018 •

edited

Loading

machawk1 commented Feb 2, 2019

N0taN3rd commented Feb 5, 2019

N0taN3rd commented Feb 5, 2019

machawk1 commented Feb 5, 2019

Mauville commented Nov 8, 2021

Unable to reuse local Chrome user dir/cookies #39

Unable to reuse local Chrome user dir/cookies #39

Comments

machawk1 commented Dec 7, 2018

Are you submitting a bug report or a feature request?

What is the current behavior?

What is the expected behavior?

What's your environment?

Other information

machawk1 commented Dec 7, 2018 • edited Loading

machawk1 commented Dec 17, 2018 • edited Loading

machawk1 commented Feb 2, 2019

N0taN3rd commented Feb 5, 2019

N0taN3rd commented Feb 5, 2019

machawk1 commented Feb 5, 2019

Mauville commented Nov 8, 2021

machawk1 commented Dec 7, 2018 •

edited

Loading

machawk1 commented Dec 17, 2018 •

edited

Loading