Home Forums Software ELAN regex/capturing parentheses in multi-file, multi-layer search

regex/capturing parentheses in multi-file, multi-layer search

This topic contains 8 replies, has 4 voices, and was last updated by  Han 6 days, 15 hours ago.

Viewing 9 posts - 1 through 9 (of 9 total)
Author Posts
Author Posts
August 22, 2013 at 04:38 #8796

Tom Honeyman

Hi,

I can see that for regular expression matching in the multi-layer search I can use capturing parentheses to match a previous match within the same annotation. I can’t seem to figure out how this might work across columns or layers. Is it possible to capture the match from one annotation and match it in another? If not, it would be a very useful addition!

For instance, I would like to find an annotation ending in (\w+)$ in the context of that same word occurring at, say, the end of the next annotation \1$ (or a million other useful searches I can think of!).

If, in the single layer search it was possible to treat all annotations as being separated by a newline character then such a search would be easily implemented.

Cheers,
Tom Honeyman

September 3, 2013 at 16:10 #8831

Han

Yes, this sounds like a useful extension. I don’t know how difficult it would be to build that in, but we’ll look into that. Some of what you want could maybe be achieved with the new variable match mode, if that could be combined with regular expressions.
I’ll add it to the wish list.

-Han

September 9, 2013 at 16:59 #8841

Eric

Hi Tom, you could consider using tiers with one word per annotation or use word-boundary regexp metacharacters to implement your “search as if there were line breaks between annotations” suggestion. I am tempted to also suggest N-gram search (with # as word wildcard) but cannot fit that into your specific example, so maybe my intuition is wrong here. However, there is another interesting new function that could help you:

Since the multicon project, complex multi layer search can use variables. If I assume that you have a word wise annotation tier where the final punctuation also is a separate annotation, you could search for:

$1 directly followed by “.” anywhere before $1 directly followed by “.”

In other words two sentences ending with the same word. While you can not do this in ELAN or Trova yet, the limitation is only in the user interface: At the moment, you cannot mix field types (variable, regexp, exact, substring) within one query because that would clobber your screen with buttons next to each field. The engine would allow it, though. So if you have an idea for the user interface, we could enable it. Note that using a variable in the first field of a query can make it much slower, because all annotations have to be considered as value for that variable then.

Another trick could be using both a word tier and a sentence tier, matching two arbitrary but adjacent sentences in the latter and any but identical words in the former. Add time overlap constraints to say that the words have to end when the sentences end, then you match “last words in sentences”. This works with all fields set to variable, already with the current version of ELAN and Trova: “In any sentence tier, $1 directly followed by $2; in any word tier $3 ends when $1 ends and later $3 ends when $2 ends”.

This query uses 2×2 fields, 3 different variable names, $3 used twice. Directly followed is “0 annotations between”, later as is “at any later point”. You can also add a constraint saying that the sentence and tier constraints have to be connected, as siblings or parent-child.

  • This reply was modified 5 years, 3 months ago by  Eric. Reason: make last example more readable
September 10, 2013 at 01:08 #8846

Tom Honeyman

Hi Eric,

Thanks very much (and thanks to Han for the original response too!).

I’ll have to play around with variable match. I hadn’t noticed that new feature (that seems to be happening a lot lately! Many new features). Is it greedy? i.e. if I search across > N multiple annotations, can I make it stop searching for any subsequent $1 if something else comes up in the meantime (e.g. a ‘.’)?

The actual search I am trying to do is a little more complex. I just simplified it for the question, because I was primarily concerned about the functionality. I am actually looking for instances of what’s called `tail-head linkage’. So for instance in a narrative text, the end of one sentence starts the next e.g.:

“He ties the rope and tests it. Having tested it, he climbs down and…’

The words I am matching might be slightly different (e.g. ‘tests’ versus ‘tested’) but predictable in the language I am matching. And so I’m not just blind matching words at the end of sentences, and in fact I may be looking for portions of words with predictable mutations. And what gets repeated and where is a little variable too, so actually it could be a number of complex searches.

So your second trick doesn’t quite work for what I’m after either, but I can see how it works and might think of uses for it, thanks. Am I right in thinking that variable mode basically allows me equivalence/non-equivalence matching, and then I can fine tune this using the options in the drop down boxes? I can’t at any point also search for a string? The simplest version of the search that I actually want is that the last word from one sentence is the second last word from the next. Even better if I can actually specify that last word as it’s always the same. I can’t see how to do this using this trick.

As for the interface, thats a tricky one. It is already quite busy. And I’m no fan of right-clicking to turn features on and off. It would be a bit of a kludge, but what about changing the input fields to combo boxes with text input where the drop down menu gave you options to set the match type?

Being a regex user, I’m happy enough only using regexps! But I realise that it’s not for everyone. It would be better in my opinion if ‘variable’ matches were incorporated into regexps, but I admit it would make simple equivalence matching pretty complex for the average user.

I can see that carrying over regex matched variables to other fields would be a nightmare to program too. e.g. if there are capturing parentheses in multiple fields then where do you start numbering the matches? And could a match in one field apply in another while a match in that other field apply in the first? Hence my suggestion of treating a single tier as being separated by newline characters (or null characters?) – that way it’s an non-complex search domain and you wouldn’t need to extract portions of one match and insert them into others. Of course, it would be a problem if people added newlines to their annotations… and you wouldn’t be able to search for aligned annotations on multiple tiers (or it would be a pain to code matching up the annotations again).

But what about using regex named capturing groups? e.g. (?<NAME>X). That way the user could explicitly design the search with respect to matching between fields. And only named groups would be accessible across fields? I think it’s a java 7 feature only though. I think you’d probably have to restrict named variables to one field and make the match available in others, or it would be too complex to program, and at any rate it could be very very slow.

August 1, 2014 at 05:41 #9684

Tom Honeyman

On a different but related note to this, what about being able to define re-useable portions of regexp for use in a search pattern?

For instance, say I wanted to define a set of vowels or consonants relevant to what I’m searching for, and then re-use that portion of regexp multiple times across a search. It’s much cleaner to say define them:

$v = “[aeiou]”
$c = “[ptkmnb]”

and then reuse the variables in a match:

\b$c$c?$v$c$c?$v$c?\b

to look for words with two syllables.

Basically the setup would swap in the contents of the variable before matching the regexp. This would greatly simplify building complex regular expressions. The swapped in version would then be:

\b[ptkmnb][ptkmnb]?[aeiou][ptkmnb][ptkmnb]?[aeiou][ptkmnb]?\b

Of course, this is a just a simple example, but the possibilities could be a lot more complex.

If the variables were stored against the search domain, then over time, a user could build a complex set of variables to help with searching within that search domain. Personally I would use it for all sorts of things from building custom character classes to defining verbal morphology, or more simply just for matching against controlled vocabularies for small classes of words (e.g. all the $pronouns, $demonstratives, etc).

  • This reply was modified 4 years, 4 months ago by  Tom Honeyman.
August 1, 2014 at 05:53 #9686

Tom Honeyman

If you like the idea, then I guess I’d also stretch it to include recursion, such that variables can exist in variables as well:

$v = “[aeiou]”
$c = “[ptkmnb]”
$disyllabic_word = “\b$c$c?$v$c$c?$v$c?\b”
$monosyllabic_word = “\b$c$c?$v$c?\b”

and so on…

August 3, 2014 at 22:49 #9690

Han

There is already an item on the wish list about storing regular expressions for later re-use, so that the same expressions don’t have to be entered over and over again. Your suggestion sounds like a more sophisticated version of this request. And it sounds useful, so we’ll add this to the existing request (without being able to tell if or when this can be implemented).

-Han

November 30, 2018 at 05:48 #12536

Christian Döhler

Has this been implemented in one of the ELAN versions since this thread?

What I am looking for seems to be related. For example: I want replace a with ä, but only word-finally or before t.

1. I search for a(\b|t)
2. In other software, I would then invoke my capture group when doing the replace. I would replace with: ä\1 or with ä$1

The second step does not work in ELAN. not in the single file search, nor in the multiple search&replace?

December 3, 2018 at 10:04 #12539

Han

I’m afraid we haven’t been able to dedicate any time/resources to implementing either of these wishes (storing regex and supporting backreferences in search/replace) for the past few years. But these items are still on the list, together with a few other improvements to the search (and replace) engine.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.