You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm playing around with hadron and seeing some weird behavior where it puts extra newlines in the output files. Instead of emitting a LF, it emits a CRLFLF. I have a fix that solves the problem for local runs, but I'm unsure about opening a PR because it seems like there was probably a reason it was done this way.
Yeah, the newlines were definitely put there to comply with hadoop's expectations. At least when hadron was in heavy use and development, hadoop-streaming had lots of weird expectations regarding the data passing through it with fairly little in the way of documentation. The entire Protocol machinery (and the Controller way of writing chains of MR jobs) was born out of trying to put a layer of abstraction/separation over hadoop's warts. As a side note, the low level core module is not really meant to be used in stand-alone fashion.
Re/ this specific fix - I can't remember the exact details around newlines, but I do remember that your MR job would crash if hadron did not get it exactly right. I think the overall point was that hadoop streaming expected your data to be newline separated (also realize how this means your records can't contain any newlines themselves, hence the need for the Protocol abstraction) - as it needed a way to separate records as the fundamental unit of data.
All that said, hadron has not been actively developed in some time; I highly recommend deploying changes on a real hadoop-streaming instance and testing there before deciding on permanent changes.
I got everything working with Amazon EMR again. PR #8 is for the minimum number of changes needed to accomplish that. But I went further and tried it with the fix-newlines branch I mentioned above and it looks like everything works. So I think we should be able to merge the fix-newlines branch and remove the annoying newline behavior pending third-party verification.
I'm playing around with hadron and seeing some weird behavior where it puts extra newlines in the output files. Instead of emitting a LF, it emits a CRLFLF. I have a fix that solves the problem for local runs, but I'm unsure about opening a PR because it seems like there was probably a reason it was done this way.
TaktInc/hadron@master...TaktInc:fix-newlines
Any ideas what's going on here?
The text was updated successfully, but these errors were encountered: