Adding mediafile into local corpus

Hi I have a lot of files in MP3 format, it seems Arbil doesn´t allows this extensions? If doesn´t how can I solve it?

Hi Yago. A list of supported file types can be found in Appendix A of the LAMUS manual. You can also override the typechecker decision for a particular resource by right clicking a resource in the working directories and selection that option from the context menu. Be aware that if you do this your corpus may be rejected at a later stage if your goal is to upload it into the archive.

Will this appendix be updated with better specifications of the allowed file formats at some point?

For the more complex formats, such as video it has been somewhat of a hit and miss whether the IMDI-specification will allow a certain file or not.

In the video section of the supported file types list only containers are listed, which is not sufficient. I know from first hand experience that not all mpeg2 files will be allowed, depending on how they were encoded (mpeg is as far as I know unfortunately both the name of a codec and the container).

Most video containers allow for several variations of the audio and video stream etc. If I remember correctly, the problem when trying to upload mpeg seemed to be the audio stream - mp3 audio would not work whereas mp2 worked fine and mp3 is the more efficient of the two.

Since the supported file check in LAMUS is done after the upload process is done (or test-linked in Arbil) it can be a somewhat frustrating situation turning into trial and error with smaller test files for larger batches.

What does “not accepted for new data, tolerated for legacy data for the moment” in the list mean for the end user or a data manager, by the way? That MPI might not accept these files if deposited whereas external projects using their own corpus server with MPI software could opt to accept them by overriding server settings? (sorry, this should probably have been asked in the LAMUS forum)

Hello Jens, for MPEG2 (and MPEG1) in principle any audio encoding that is allowed according to the specification should be accepted, so also “layer 3” (mp3). What is not accepted are encodings that are not part of the MPEG2 standard such as Dolby AC3, which is often used in camcorders. Files are also rejected sometimes if the header is not according to the standard. You are right though that the list requires further elaboration, but particularly for MPEG4, where we only accept a subset of what is allowed according to the spec. In any case, these type checker rules are indeed valid for the MPI archive and other archives may have different requirements. In such cases you can override the type checker decision in Arbil, as Twan mentioned. The legacy data remark refers to the fact that we do have such formats already in the archive but do not allow any further deposits of them.

The file type check in LAMUS can unfortunately only be done on the server side after upload, but checking in Arbil is of course local on your machine and should not take too long for a small number of files.

Yes, for MP4 I think Handbrake’s headers caused some problems in our case when trying out h264. We’ll have to check again later as we’re currently waiting for software upgrades.

I wonder if it would be possible to create an offline batch file checker for larger data sets (and for those who might manage the data on an entire corpus server using the MPI software) and if there is some kind of validation scheme similar to the IMDI spec one available somewhere. That’s probably a request to a colleague of mine rather than the MPI, though. Arbil could possibly benefit from such a feature but I realize it might be a bit niche.

Anyway, thank you for the information.

Hi Jens,

the part that does the validation in Arbil and LAMUS is a separate library that can also be run from the command line, so you could relatively easy use this to build your own batch file checker. If you’re interested in obtaining this library, send me an email to Alexander.Koenig@mpi.nl and I will send it to you.