Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newlines not collapsed from HTML #31

Open
Numerlor opened this issue Jan 24, 2021 · 6 comments
Open

Newlines not collapsed from HTML #31

Numerlor opened this issue Jan 24, 2021 · 6 comments

Comments

@Numerlor
Copy link

After 97c78ef was merged, the newlines in the parsed HTML are no longer collapsed into normal spaces, resulting in erroneous line breaks in the output.

import markdownify

print(repr(markdownify.markdownify("""\
continuous
line of
text
""")))

In 0.6.1 the above code outputs 'continuous line of text' like it'd look like when rendered in a browser,
while in 0.6.3 it preserves the newlines and outputs 'continuous\nline of\ntext'
This causes issues when the html is wrapped to some length or linebreaks are used to separate out tags, for example

text before link
<a href="link">link text</a>
continued text
@Numerlor Numerlor changed the title Newlines not collapsed in HTML Newlines not collapsed from HTML Jan 24, 2021
@AlexVonB
Copy link
Collaborator

Hi! When rendered with markdown

continuous
line of
text

could render as a single line: https://www.markdownguide.org/basic-syntax#line-breaks and https://github.github.com/gfm/#soft-line-breaks It is up to the markdown parser to handle it the way it wants:

A conforming parser may render a soft line break in HTML either as a line break or as a space.

A quick test here shows that the GitHub renderer decides that it should be a hard linebreak:

continuous
line of
text

I am not sure if we are up to spec here, but it seems like we are. I'm open to all feedback on this issue!

Best,
Alex

@Numerlor
Copy link
Author

Numerlor commented Feb 2, 2021

I'm not very familiar with the spec here; maybe a switch to trigger the line break behaviour would be most fitting.
I've noticed this issue when parsing autogenerated html from python docs, for example in the description tag in https://docs.python.org/3/library/stdtypes.html#str, there are newlines in the strings which results in quite a few additional (and unnecessary) newlines with the new handling

@AlexVonB
Copy link
Collaborator

AlexVonB commented Feb 7, 2021

I'll look into that. On a related note, did you try the source of the generated docs? It's rst, which could be easily converted to markdown using pandoc or something similar: https://github.com/python/cpython/blob/master/Doc/library/stdtypes.rst Right now you convert rst converted to html to markdown.

@IlyaBizyaev
Copy link

Related is how headings and paragraphs are handled.

Example 1

md("<h2>Some Heading</h2>\n<p>Some text</p>", heading_style='ATX')

Expected:

## Some Heading\n\nSome text\n\n

Actual:

## Some Heading\n\n\nSome text\n\n

Example 2

md("<p>Paragraph 1</p>\n<p>Paragraph 2</p>")

Expected:

Paragraph 1\n\nParagraph 2\n\n

Actual:

Paragraph 1\n\n\nParagraph 2\n\n

@mirabilos
Copy link

mirabilos commented May 30, 2023

I ran into similar issues and now use it like this in my local Feediverse clone:

def cleanup(text):
    text = re.sub('\r+\n?', '\n', text)
    text = re.sub(' *\n *', '\n', text)
    text = text.replace('\n', '\1')
    text = re.sub('\1\1\1+', '\n\n', text)
    text = re.sub('\1+ *', ' ', text).strip()
    text = markdownify(text, strip=['img']).strip()
    text = re.sub('  \n  \n', '\n\n', text)
    text = re.sub(' *\n\n+', '\n\n', text)
    return text

This somewhat normalises newlines in the input before handing it to markdownify (assuming no <pre> tags are present) then post-processes the output to fix more whitespace issues, assuming a renderer that creates a hard linebreak when provided with a newline in the input paragraph (which is natural most comment forms etc. but also Fediverse clients, as it allows one to mostly post naturally).


Update: the snippet above breaks whitespace in pre tags, though. I have a more complex wrapper around Markdownify now; I guess I’ll have to put my patched Feediverse online some day.

@chrispy-snps
Copy link
Collaborator

@Numerlor - the reflowing enhancement from #169 will allow you to set a wrap width of None:

print(
    repr(
        markdownify.markdownify("""
continuous
line of
text
""", wrap=True, wrap_width=None)
    )
)

which results in this (I see the leading/trailing spaces do survive though):

' continuous line of text '

Would you consider this issue fixed as a result?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants