Issue with NUL characters being appended to EAFs

Hi there, just wanted to report a minor issue. Not certain that this is a problem with ELAN itself, but it may be, so better safe than sorry.

While doing a Multiple Files Search across a fairly large domain (about 150 EAF files), the search failed with parsing errors on 4 specific files. Looking at the log, it turned out to be a “Content is not allowed in trailing section” error. From my understanding, that’s an error from Java and/or the XML parsing library, indicating that the XML is badly formed, with value content outside of closing tags.

On inspecting the offending files, it turned out they each had a string of NUL characters (i.e. ASCII 0 or U+0000) appended, after the closing </ANNOTATION_DOCUMENT> tag. In one case it was only a handful, in the other three cases it was hundreds.

I couldn’t find any references to this exact issue here on these forums. But looking around elsewhere, I found people using Java XML libraries reporting similar issues, especially when writing buffers that had undergone some kind of encoding conversion. That might be a coincidence of symptoms, though. I thought I’d report it here, to be safe.

Any advice about manually fixing these files would be appreciated - it’s straightforward enough to open the EAFs with a text editor and remove the trailing NUL characters, and testing this with a copy of the files didn’t seem to cause any issues (and did fix the searching problem) but I just wondered if it’s considered safe to do that, or if there is a better way.

Update: after doing some tests on copies of our corpus files, I discovered you can fix the files by opening one, making some arbitrary change, saving the file, reverting the change by hand, and then saving again. This re-writes all the XML without any of the trailing NUL characters.

Our team are using 5.1 currently. We will likely be updating to 5.3 at a future date. I can’t be sure that it was ELAN that did this, although no-one is supposed to be using any other script or editor on the files. The “last modified” date of the four affected files were all different.

Any advice or theories about how this occurred gratefully received.

I’m not familiar with this problem. As to the solution, I guess both suggestions are fine (save again with ELAN or edit in a (unicode) text editor).

I wouldn’t know how ELAN could create such files with trailing NULL characters, or why it would do so sometimes and not all the times. If you would have a theory (e.g. based on what those 4 files have in common?) it would be something we could test.

-Han

Thanks Han. The files don’t seem to have anything in particular in common - they’re stored in different places, represent different participants and different tasks, and have different “last modified” dates …

If this is caused by ELAN (I’m not certain it is) and I find a way of reliably reproducing it, I’ll feed that back.

Cheers

A bit more data: it seems that different team members may be using different versions of ELAN - most are on 4.9.4 but there may also be instances of 5.1.

The method I described for fixing a broken file (i.e. one with long strings of NUL characters appended) does not work if you use 4.9.4. For example, if you add a single character to an annotation and save the file, the string of NUL characters shrinks by one. Remove the extra character, and it again grows by one. However if you make any changes to an affected file in 5.3 and save, all the bad NUL characters disappear.

It may be that these problem files were edited by someone using 5.1 and then later edited by someone with 4.9.4. It is almost as though the Mac 4.9.4 version is saving the entire loaded buffer, even when the data has been edited to be shorter, which results in the appended NUL characters.

We need to upgrade to 5.3 anyway to make use of the new controlled vocab features so hopefully that will solve this problem as well.

I can’t explain this. As far as I can see, the same XML libraries are used in those versions of ELAN and I believe saving a file has not changed after 4.9.4.
I’m now wondering if multiple file search in 4.9.4 had a problem with those 4 files too (but no need to test this).