Issue with decoding .html file #11

vgzhn · 2021-02-25T12:57:05Z

Welcome! Extracting video urls from Takeout. Traceback (most recent call last): File "youtube_history.py", line 369, in <module> analysis.run() File "youtube_history.py", line 348, in run self.download_data() File "youtube_history.py", line 155, in download_data soup = BeautifulSoup(watch_history.read_text(), 'html.parser') File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\pathlib.py", line 1236, in read_text return f.read() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3321: character maps to < undefined>

I've tried locating position 3321 and couldn't find anything obvious to remove, also inserting "file = open(filename, errors="ignore")" didn't work for me.

I'm an absolute beginner with python.
Maybe that could be avoided by using the .json takeout?

The text was updated successfully, but these errors were encountered:

Jessime · 2021-02-27T23:29:37Z

Hey, @vgzhn, I'd assume that the .json file would have the same character somewhere. Is there any chance you could share the file/contents? I'm not sure how to debug without playing around with the data.

Luminoxis · 2021-03-06T03:48:51Z

I had the same error, but changing the beautifulsoup encoding to utf-8 seemed to fix it
soup = BeautifulSoup(watch_history.read_text(encoding="'utf-8'"), 'html.parser')
After this the code got about 10000 results deep when it came across a very similar error (which i unfortunately lost) but for a different position (not 3321)
some googling implied that changing the readline formatting to latin-1 would help
line = p.stdout.readline().decode('latin-1').strip()
which gave me Matplotlib is building the font cache using fc-list. This may take a moment
restarted the shell, and now gives, OSError: invalid face handle
This may be unrelated since changing the encodings back still leaves the error which it wasnt having before, but honestly Im not sure what its doing anymore, and I may just have messed it up somehow

rstebee · 2022-01-04T23:17:58Z

I'm having the same exact problem and I have no idea what to do

Jessime · 2022-01-06T05:39:28Z

I'm having the same exact problem and I have no idea what to do

@rstebee if you can post some or all of the data that's causing trouble, that'll help a lot.

barbatoz0220 · 2022-02-12T03:25:54Z

Hi @Jessime,

First of all, thank you very much for this awesome work.

I was trying out this project last night and got into the exact error posted here, and after a bit of looking around I found 2 threads on StackOverflow that helped me with finding a workaround:

So, in the file youtube_history.py, I went to the line soup = BeautifulSoup(watch_history.read_text(), 'html.parser') and modified it as follows:

with open(watch_history, encoding='utf8') as history:
   soup = BeautifulSoup(history, 'html.parser', from_encoding="utf8")

It seems like the watch-history.html was encoded in UTF-8, and, like the error said, the default encoding of Windows machines could not decode the character 0x9d, which is the " (right double quote) character.

Hope this helps with your problem. I'm also a Python noob so please feel free to propose a better solution 👏 .

Jessime · 2022-02-12T18:26:22Z

Hey all, this commit should fix things up:

615b48f

Thanks for reporting the issues!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with decoding .html file #11

Issue with decoding .html file #11

vgzhn commented Feb 25, 2021

Jessime commented Feb 27, 2021

Luminoxis commented Mar 6, 2021

rstebee commented Jan 4, 2022

Jessime commented Jan 6, 2022

barbatoz0220 commented Feb 12, 2022

Jessime commented Feb 12, 2022

Issue with decoding .html file #11

Issue with decoding .html file #11

Comments

vgzhn commented Feb 25, 2021

Jessime commented Feb 27, 2021

Luminoxis commented Mar 6, 2021

rstebee commented Jan 4, 2022

Jessime commented Jan 6, 2022

barbatoz0220 commented Feb 12, 2022

Jessime commented Feb 12, 2022