Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot work with newlines with latest parsing update #3

Open
LogicalChaos opened this issue Jan 22, 2015 · 7 comments
Open

Cannot work with newlines with latest parsing update #3

LogicalChaos opened this issue Jan 22, 2015 · 7 comments
Labels

Comments

@LogicalChaos
Copy link
Contributor

I forked to implement a Perforce parser. I have it completed, but when I went to merge with your latest which splits the parsing into chunks, I'm unable to get it functioning again. I've found if I remove all newlines except those between change sets, I can get it parsing again. But, that involves quite a bit of data massaging to clean up the perforce log. Do you have any suggestions? If you look on my perforce branch, you can see my grammar.

@adamtornhill
Copy link
Owner

Sounds cool with a Perforce parser - would definitely be a good addition.
First some background on my latest change: Instaparse is quite memory hungry. When I parsed the complete grammar in one pass, Code Maat run out of memory on larger logfiles. That's why I chose to split the log into smaller parts and feed those to Instaparse one by one (you'd run into the same problem with your current Perforce parser).

I see the you re-used the hiccup-based-parser. As you probably noticed, that's the one that does the chunking. In the current version, I split the log on each blank line (see function extend-when-complete). That works fine for both Git and Mercurial that don't have any blank lines within their entries. But, it won't work for Perforce that includes several blank lines in each entry.
I'd suggest that you identify a different criterion that's capable of identifying the end of a Perforce entry. Then you have to parameterize the hiccup-based-parser with that criterion (end-of-log-entry?perhaps).
Does that sound resonable?

@LogicalChaos
Copy link
Contributor Author

What you said makes perfect sense, but is beyond me :-) I changed the log generation to make it consistent with the the others re blank lines ... | xargs -I commitid -n1 sh -c 'p4 describe -s commitid | grep -v "^\s*$" && echo "'. If you're up for it (I'd need major help), I'd like to add churn capabilities. The output Perforce spits out adds the following to each change set described.

Differences ...
==== //depot/project/Command.cpp#9 (text) ====
add 1 chunks 10 lines
deleted 0 chunks 0 lines
changed 0 chunks 0 / 0 lines

Thoughts?

@LogicalChaos
Copy link
Contributor Author

Hmmm... With the new parser, I'm getting Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, which is obviously different from the previous OOM problems. This is with change logs that failed previously. I'm looking to see if I can pinpoint what in the data is causing this. I suspect it's a change set with ~26000 files associated with it from a copy/merge operation.

@adamtornhill
Copy link
Owner

I've seen the GC overhead limit exception as well on the earlier version of Code Maat before the memory optimization. Did you manage to get the chunking working now? That should solve this issue as well.
I'll have a look at your pull request during next week - thanks for the contribution!

@LogicalChaos
Copy link
Contributor Author

Yes, I got the chunking working. The problem occurs with a change list of ~1400 with 35k lines when any individual change list goes over ~50 files. I can privately send you a problem file if you want.

@adamtornhill
Copy link
Owner

Yes, please do that and I'll have a look. You can contact me at adam at adamtornhill dot com

@LogicalChaos
Copy link
Contributor Author

I've sent two files, ~1MB compressed total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants