Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadron inserts extra newlines in output #7

Open
mightybyte opened this issue Oct 8, 2017 · 2 comments
Open

Hadron inserts extra newlines in output #7

mightybyte opened this issue Oct 8, 2017 · 2 comments

Comments

@mightybyte
Copy link
Contributor

I'm playing around with hadron and seeing some weird behavior where it puts extra newlines in the output files. Instead of emitting a LF, it emits a CRLFLF. I have a fix that solves the problem for local runs, but I'm unsure about opening a PR because it seems like there was probably a reason it was done this way.

TaktInc/hadron@master...TaktInc:fix-newlines

Any ideas what's going on here?

@ozataman
Copy link
Member

ozataman commented Oct 8, 2017

Yeah, the newlines were definitely put there to comply with hadoop's expectations. At least when hadron was in heavy use and development, hadoop-streaming had lots of weird expectations regarding the data passing through it with fairly little in the way of documentation. The entire Protocol machinery (and the Controller way of writing chains of MR jobs) was born out of trying to put a layer of abstraction/separation over hadoop's warts. As a side note, the low level core module is not really meant to be used in stand-alone fashion.

Re/ this specific fix - I can't remember the exact details around newlines, but I do remember that your MR job would crash if hadron did not get it exactly right. I think the overall point was that hadoop streaming expected your data to be newline separated (also realize how this means your records can't contain any newlines themselves, hence the need for the Protocol abstraction) - as it needed a way to separate records as the fundamental unit of data.

All that said, hadron has not been actively developed in some time; I highly recommend deploying changes on a real hadoop-streaming instance and testing there before deciding on permanent changes.

@mightybyte
Copy link
Contributor Author

I got everything working with Amazon EMR again. PR #8 is for the minimum number of changes needed to accomplish that. But I went further and tried it with the fix-newlines branch I mentioned above and it looks like everything works. So I think we should be able to merge the fix-newlines branch and remove the annoying newline behavior pending third-party verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants