From de57ac1d878ef1e8944741e891288c144b8eff0c Mon Sep 17 00:00:00 2001 From: Vincent Barbaresi Date: Tue, 2 Jan 2024 14:01:51 +0100 Subject: [PATCH] Drop invalid XML element attributes (#462) * drop invalid XML element attributes Fixes issue #375 The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""` We can workaround it by manually dropping those bad elements. I hope it doesn't impact performance too much To reproduce: `trafilatura -u https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml` Minimal reproduction example: ``` echo 'Testing