Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with decoding .html file #11

Open
vgzhn opened this issue Feb 25, 2021 · 6 comments
Open

Issue with decoding .html file #11

vgzhn opened this issue Feb 25, 2021 · 6 comments

Comments

@vgzhn
Copy link

vgzhn commented Feb 25, 2021

Welcome! Extracting video urls from Takeout. Traceback (most recent call last): File "youtube_history.py", line 369, in <module> analysis.run() File "youtube_history.py", line 348, in run self.download_data() File "youtube_history.py", line 155, in download_data soup = BeautifulSoup(watch_history.read_text(), 'html.parser') File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\pathlib.py", line 1236, in read_text return f.read() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3321: character maps to < undefined>

I've tried locating position 3321 and couldn't find anything obvious to remove, also inserting "file = open(filename, errors="ignore")" didn't work for me.

I'm an absolute beginner with python.
Maybe that could be avoided by using the .json takeout?

@Jessime
Copy link
Owner

Jessime commented Feb 27, 2021

Hey, @vgzhn, I'd assume that the .json file would have the same character somewhere. Is there any chance you could share the file/contents? I'm not sure how to debug without playing around with the data.

@Luminoxis
Copy link

I had the same error, but changing the beautifulsoup encoding to utf-8 seemed to fix it
soup = BeautifulSoup(watch_history.read_text(encoding="'utf-8'"), 'html.parser')
After this the code got about 10000 results deep when it came across a very similar error (which i unfortunately lost) but for a different position (not 3321)
some googling implied that changing the readline formatting to latin-1 would help
line = p.stdout.readline().decode('latin-1').strip()
which gave me Matplotlib is building the font cache using fc-list. This may take a moment
restarted the shell, and now gives, OSError: invalid face handle
This may be unrelated since changing the encodings back still leaves the error which it wasnt having before, but honestly Im not sure what its doing anymore, and I may just have messed it up somehow

@rstebee
Copy link

rstebee commented Jan 4, 2022

I'm having the same exact problem and I have no idea what to do

@Jessime
Copy link
Owner

Jessime commented Jan 6, 2022

I'm having the same exact problem and I have no idea what to do

@rstebee if you can post some or all of the data that's causing trouble, that'll help a lot.

@barbatoz0220
Copy link

Hi @Jessime,

First of all, thank you very much for this awesome work.

I was trying out this project last night and got into the exact error posted here, and after a bit of looking around I found 2 threads on StackOverflow that helped me with finding a workaround:

So, in the file youtube_history.py, I went to the line soup = BeautifulSoup(watch_history.read_text(), 'html.parser') and modified it as follows:

with open(watch_history, encoding='utf8') as history:
   soup = BeautifulSoup(history, 'html.parser', from_encoding="utf8")

It seems like the watch-history.html was encoded in UTF-8, and, like the error said, the default encoding of Windows machines could not decode the character 0x9d, which is the " (right double quote) character.

Hope this helps with your problem. I'm also a Python noob so please feel free to propose a better solution 👏 .

@Jessime
Copy link
Owner

Jessime commented Feb 12, 2022

Hey all, this commit should fix things up:

615b48f

Thanks for reporting the issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants