Indefero

Issue 627: Regression in the Diff parser

Reported by Thomas Keller, Mar 8, 2011

A regression originating from commit 336faa4503 breaks some commit 
views for the monotone backend (for interested parties, monotone 
commit 3d96895e9ff2ea7d2eaa5ad2c46f60cfbf146461).

The issue is that the parser won't recognize any next file diff in a 
multi-file diff output as long as the variable $indiff stays `true`. 
Reverting the previously mentioned change fixes the issue for mtn, 
but possibly breaks again with svn.

Maybe we should rework the parser to ultimatively recognize 
"+++" as start of a new file diff and not any of the 
ever-changing header lines of all the different SCMs. And we should 
really add some unit tests for the parser to ensure we do not break 
with any SCM's diff output again in the future.

Comment 1 by Patrick Georgi, Mar 8, 2011

Attached patch is a rewrite of the diff parser.

It parses diffs more directly:

1. skip non-diff data
2. read header (--- and +++ lines)
2a. if header malformed, go to 1
3. while hunk header is at the current position, parse chunk
3a. if hunk malformed, try if a new hunk starts at the current 
location
4. go to 1

So it tries to be robust in light of misformatted diffs, but I 
didn't test that yet. For common errors (eg. extra newlines in the 
text) it will drop some hunks that it could recover with some extra 
complexity, but the patch would be broken regardless.

Tested with a couple of mtn and git repos, but I don't have svn or 
hg around for testing.

Status: Started

Comment 2 by Patrick Georgi, Mar 8, 2011

Actually, I found an svn repo - looks good.
So only mercurial is still untested.

Comment 3 by Thomas Keller, Mar 9, 2011

We definitely need more tests for this change. There is at least one 
cornercase I found where the hunk parser will fail - for single line 
adds / deletes. GNU diff leaves ",1" optional, so lines 
like "@@ -l -l,s @@" or "@@ -l,s -l @@" are 
valid and must be parsable as well. In fact, if you create a new 
single line file in monotone and add it, it will return you such a 
hunk.

Other than that very good work. To get this into 1.1 we just need 
more confidence that it really works for at least all the cases it 
worked previously :)

Comment 4 by Thomas Keller, Mar 9, 2011

...and of course I meant 

"@@ -l +l,s @@" and "@@ -l,s +l @@"

Sorry for the noise.

Comment 5 by Patrick Georgi, Mar 9, 2011

Ah, wasn't aware of that quirk - the new patch has support for it.

I tested the issue with an ad-hoc test suite, but that requires more 
work before it's ready for public consumption.

Comment 6 by Patrick Georgi, Mar 16, 2011

Attached:
- Updated parser (slightly more compatible, though it shouldn't 
matter in real life)
- Patch to hopefully reduce memory use (by discarding references to 
string representations early)
- Test suite: Unpack to indefero toplevel directory, run with 
"php testdiff.php". Compares src/IDF/Diff.php to the 
original version (added as Diff.original.php)

Result: Identical behaviour except for svn, where some additional 
bits are added to the internal data structures (which are not used 
later). These additional bits are misinterpretations 
("empty" files for empty file additions which svn emits by 
its svn-header only, without a diff; svn metadata is added to the 
diff data, where it definitely doesn't belong)
 
I refuse to implement _that_ quirk in the new parser, too.

Comment 7 by Thomas Keller, Mar 16, 2011

The first two patches have been committed in revision baa8841. I 
added the missing method to set the strip path level for 
IDF_Scm_Mercurial as well and fixed a couple of whitespace issues.

I'll add some phpunit test infrastructure and the Diff tests in a 
few.

Comment 8 by Thomas Keller, Mar 17, 2011

Test infrastructure and test itself have been added in commit 61d7b4a
.
Status: Fixed

Created: 2 years 2 months ago by Thomas Keller

Updated: 2 years 2 months ago

Status: Fixed

Owner: Patrick Georgi

Labels:
Type:Defect
Milestone:Release1.1
Priority:Critical