Regex - Problem mit [A-Z]*

OsunSeyi · 15 Dez. 2007

Hi!
Ich stelle immer wieder, sowohl bei egrep als auch sed, fest, daß der Ausdruck:
[A-Z] nicht korrekt funktioniert und bin gezwungen
[ABCDEFGHIJKLMNOPQRSTUVWXYZ] zu benutzen.
Wie kann das denn sein ?
viele Grüße,
tom

gameboy · 15 Dez. 2007

Hallo OsunSeyi,

hast Du ein konkretes Beispiel, welches Probleme macht?

Viele Grüße,
gameboy.

haveaniceday · 16 Dez. 2007

Setz mal:

export LC_COLLATE=C

und versuch es dann.
Bei z.B. LC_COLLATE="de_DE.UTF-8" ist die Reihenfolge [aAbBcC...zZ]
Wenn du also [A-Z] angibst hast du so ziemlich alle Buchstaben bis auf klein "a".

Haveaniceday.

PS: Ist besonders äzend, wenn man ein rm ./[A-Z]* versucht und nur die "a*" Dateien übrig bleiben.
Edit: mit "upper" und "lower" ist man eher auf der sicheren Seite:
Beispiel: echo "AaBb" | sed 's/[[:upper:]]//g'

Code:

Aus: http://cnswww.cns.cwru.edu/~chet/bash/COMPAT
...
13. The behavior of range specificiers within bracket matching expressions
in the pattern matcher (e.g., [A-Z]) depends on the current locale,
specifically the value of the LC_COLLATE environment variable. Setting
this variable to C or POSIX will result in the traditional ASCII behavior
for range comparisons. If the locale is set to something else, e.g.,
en_US (specified by the LANG or LC_ALL variables), collation order is
locale-dependent. For example, the en_US locale sorts the upper and
lower case letters like this:

AaBb...Zz

so a range specification like [A-Z] will match every letter except `z'.
Other locales collate like

aAbBcC...zZ

which means that [A-Z] matches every letter except `a'.

The portable way to specify upper case letters is [:upper:] instead of
A-Z; lower case may be specified as [:lower:] instead of a-z.

Look at the manual pages for setlocale(3), strcoll(3), and, if it is
present, locale(1).

You can find your current locale information by running locale(1):

caleb.ins.cwru.edu(2)$ locale
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_ALL=en_US

My advice is to put

export LC_COLLATE=C

into /etc/profile and inspect any shell scripts run from cron for
constructs like [A-Z]. This will prevent things like

rm [A-Z]*

from removing every file in the current directory except those beginning
with `z' and still allow individual users to change the collation order.
Users may put the above command into their own profiles as well, of course. 
...

Anonymous · 16 Dez. 2007

OsunSeyi schrieb:
Ich stelle immer wieder, sowohl bei egrep als auch sed, fest, daß der Ausdruck:
[A-Z] nicht korrekt funktioniert und bin gezwungen
[ABCDEFGHIJKLMNOPQRSTUVWXYZ] zu benutzen.
Wie kann das denn sein ?????

Schau auch mal hier in diesem altem Beitrag vorbei, nicht nur grep und sed, auch die Bash macht bisweilen scheinbar was sie will. :wink:

http://www.linux-club.de/viewtopic.php?t=29854

robi

OsunSeyi · 18 Dez. 2007

Vielen Dank :!:
werde das erstmal verinnerlichen (und mal schauen, bei welchen Fällen das Prob auftritt & wann nicht).
Melde mich wieder,
tom

Regex - Problem mit [A-Z]*

OsunSeyi

gameboy

haveaniceday

Anonymous

Gast

OsunSeyi