Ticket #1882 (closed enhancement: fixed)
PCRE search: escape sequence support in replacements, UTF8 support (just a flag for libPCRE)
Reported by: | vitalif | Owned by: | slavazanko |
---|---|---|---|
Priority: | minor | Milestone: | 4.8.0-pre1 |
Component: | mc-search | Version: | master |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Branch state: | merged | Votes for changeset: | committed-master |
Description
mc uses glib for PCRE search by default.
that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.
mc should use libpcre by default + PCRE_UTF8 flag.
Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.
Attaching a patch implementing these features.
Attachments
Change History
Changed 15 years ago by vitalif
- Attachment mc-4.7.0pre-searchPCRE-escape-sequences-and-UTF8.patch added
comment:1 follow-up: ↓ 4 Changed 15 years ago by zaytsev
How important is to remove these comments???
comment:2 Changed 15 years ago by slavazanko
that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.
What you mean? perl-compatible regexp'es of Glib library always turn on UTF-8 mode.
From http://library.gnome.org/devel/glib/stable/glib-Perl-compatible-regular-expressions.html:
Note that, unless you set the G_REGEX_RAW flag, all the strings passed to these functions must be encoded in UTF-8. The lengths and the positions inside the strings are in bytes and not in characters, so, for instance, "\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single character. If you set G_REGEX_RAW the strings can be non-valid UTF-8 strings and a byte is treated as a character, so "\xc3\xa0" is two bytes and two characters long.
Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.
Show version of your mc, please:
mc -V
I have worked special sequences. Test case:
- run mcedit
- press SPACE key twice, type dddd; press ENTER key
- press TAB key twice, type dddd; press ENTER key
- go at top of editing file
- press F7 key
- enter '\t\tddd' into search string
- select 'Regular expression' search type
- press ENTER key
If no any matches found at this point - we will continue discuss.
comment:3 Changed 15 years ago by vitalif
Я имел ввиду escape-последовательности в строке ЗАМЕНЫ, а не в строке, которую ищем - там-то они работают.
Тест - ввести в строку поиска например "\x0d", а в строку замены например "\n". Везде в тексте \r будет заменён на просто символ "n", что не есть логично.
Сейчас обновился до 4.7.0.1-1 из Debian'а, баг до сих пор присутствует.
comment:4 in reply to: ↑ 1 Changed 15 years ago by vitalif
Replying to zaytsev:
How important is to remove these comments???
Why do you need to remove them?
Is documented code worse than undocumented?
Changed 15 years ago by vitalif
- Attachment mc-4.7.0.1-regex-replace-escape-sequences.diff added
New version of patch
comment:5 Changed 15 years ago by zaytsev
You seemed to remove them, that's why I asked. But it looks like your previous patch was reversed, so you actually added them. I think that the new patch is OK.
Changed 14 years ago by vitalif
- Attachment mc-git20110403-regex-replace-escape-sequences.diff added
This same patch, ported to git version 12eb8b62...
comment:6 Changed 14 years ago by vitalif
Any progress here? Ticket is actual for the latest (git) version, so I posted a correct patch for it.
Comment 3 in english:
I mean escape sequence support inside _replacement_ strings, not inside the _search_ regex - of course they work inside regex.
Test is simple - enter, for example, "\x0d" as search string and "\n" as the replacement string, and \r will be replaced by just "n" character.
comment:7 Changed 14 years ago by slavazanko
- Status changed from new to accepted
- Owner set to slavazanko
comment:8 Changed 14 years ago by slavazanko
- Version changed from 4.7.0-pre4 to master
- Type changed from defect to enhancement
- severity changed from no branch to on review
- Milestone changed from 4.7 to 4.8.0-pre1
Created branch changeset:1882_esc_seq_in_replace_field
Initial changeset:be964ddc1ff1eb181c2e28655211607c1202691d
Review, please.
comment:9 Changed 14 years ago by slavazanko
- severity changed from on review to on rework
vitalif, can you switch to branch and see my changes? Can you run 'make check' in branch?
I have created some unit tests for your algorithm... but I don't sure that algorithm is clean... or may be tests is wrong :)
comment:10 Changed 14 years ago by vitalif
Despite of the fact that algorithm now ignores the absence of closing curly brace '}' (which probably isn't 100% correct), this should be checked in tests for replace_handle_esc_seq function, not process_escape_sequence - it is the replace_handle_esq_seq who decides whether it is an escape sequence or not.
Also, \x{4344} is usually a code for wide character (UTF-8), and not for "CD". So we can either ignore the higher bits, or generate wide character codes... The second would be convenient, but would also introduce a hard-coded UTF-8 charset.
comment:11 Changed 14 years ago by vitalif
I've attached a patch which fixes addressing these issues...
Also it removes *next_char parameter from handle_esc_seq, I think it's pointless there and just adds a non-trivial interface.
Changed 14 years ago by vitalif
- Attachment checkcurlybrace-widechars-tests.diff added
Check } curly braces, fix tests, add \x{4344} utf-8 wide chars in utf-8 match mode
comment:12 Changed 14 years ago by slavazanko
- severity changed from on rework to on review
Review branch, please
comment:14 Changed 14 years ago by slavazanko
- severity changed from on review to no branch
- Branch state set to on review
comment:15 Changed 14 years ago by slavazanko
Branch was rebased. New initial changeset:aa2d0b07edcfc884f0d778eaa89c201094c43458
comment:16 Changed 14 years ago by angel_il
- Votes for changeset changed from andrew_b to andrew_b angel_il
- Branch state changed from on review to approved
comment:17 Changed 14 years ago by slavazanko
- Status changed from accepted to testing
- Votes for changeset changed from andrew_b angel_il to committed-master
- Resolution set to fixed
- Branch state changed from approved to merged
Merge changeset:ba09144ce5210c1ae5a07389d4de5085938601c3
Show commits in branch:
git log --pretty=oneline bb12a12..e1edfd7
comment:19 Changed 13 years ago by vitalif
Sorry for silence, and big thanks for approving the patch :)
patch for src/search/regex.c