RfD: String comparison words version 1 (was version 0)

forth

    Sponsored Links

Re: RfD: String comparison words version 1 (was version 0)

Postby stephenXXX » Wed, 24 Nov 2010 07:49:15 GMT

On Mon, 22 Nov 2010 13:35:13 -0800 (PST), Alex McDonald




RTFM

The manual defines these in the order
  S=
  STR=
  IS=
  ISTR=

Stephen


-- 
Stephen Pelc,  XXXX@XXXXX.COM 
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web:  http://www.**--****.com/  - free VFX Forth downloads

Re: RfD: String comparison words version 1 (was version 0)

Postby Alex McDonald » Wed, 24 Nov 2010 08:54:33 GMT




IDHTFM (I don't have etc...) in an attempt to avoid polluting
Win32Forth with your stuff. We had a clf conversation about this years
ago.

> S=
>> STR=>
> IS>
> IST>=
>

Thank you.
>

> Stephen
>

>>
> >-
> Stephen Pelc,  XXXX@XXXXX.COM >m
> MicroProcessor Engineering Ltd - More Real, Less Ti>e
> 133 Hill Lane, Southampton SO15 5AF, Engla>d
> tel:+44 (0)23 8063 1441begin_of_the_skype_highlighting+44 (0)23 8063 1441end_of_the_skype_highlighting, fax: +44 (0)23 8033 96>1
> web: http://www.**--****.com/  free VFX Forth downloads


Re: RfD: String comparison words version 1 (was version 0)

Postby Coos Haak » Wed, 24 Nov 2010 09:43:30 GMT

Op Mon, 22 Nov 2010 13:35:13 -0800 (PST) schreef Alex McDonald:

<snip>

Mine differ too: COMPARE-CI  and  SEARCH-CI


: STR< COMPARE 0< ;
: STR> COMPARE 0> ;
: STR>= COMPARE -1 > ;

-- 
Coos

CHForth, 16 bit DOS applications
 http://www.**--****.com/ 

Re: RfD: String comparison words version 1 (was version 0)

Postby BruceMcF » Wed, 24 Nov 2010 11:11:03 GMT



Why not:

: STRING-PREFIX? ( c-addr1 u1 c-addr2 u2 -- flag )
   >>R R@ MIN 2>> STR= ;

Re: RfD: String comparison words version 1 (was version 0)

Postby Alex McDonald » Wed, 24 Nov 2010 13:51:47 GMT



Thanks; I got them the wrong way round. Noted for the next revision.



Re: RfD: String comparison words version 1 (was version 0)

Postby Alex McDonald » Wed, 24 Nov 2010 14:09:11 GMT





The original was from gforth as reported by Coos Haak.

Re: RfD: String comparison words version 1 (was version 0)

Postby Alex McDonald » Wed, 24 Nov 2010 15:23:42 GMT




> OT>OVER >> IF
> DROP DR>P FALSE
> LSE OVER +>SWAP ?DO
> >P C@ 32 OR
> >  >@ 32 >R
> > IF
> NL>OP DROP FALSE EX>T
> HEN CHAR+
> OOP DROP TRUE THEN ;

has obvious bug; changed to

: ISTR-LOWERC    ( char -- char' )
  DUP [CHAR] A [ CHAR Z 1+ ] LITERAL WITHIN IF 32 + THEN ;

: ISTR=          ( c-addr1 u1 c-addr2 ><2 flag)
   ROT OVER <> IF
     2DROP DROP FALSE
   ELSE OVER + SWAP ?DO
     DUP C@ ISTR-LOWERC
    ><I C@ ISTR-LOWERC
     <> IF
       UNLOOP DROP FALSE EXIT
     THEN CHAR+
   LOOP DROP TRUE THEN ;



Re: RfD: String comparison words version 1 (was version 0)

Postby Alex McDonald » Wed, 24 Nov 2010 15:23:42 GMT




> OT>OVER >> IF
> DROP DR>P FALSE
> LSE OVER +>SWAP ?DO
> >P C@ 32 OR
> >  >@ 32 >R
> > IF
> NL>OP DROP FALSE EX>T
> HEN CHAR+
> OOP DROP TRUE THEN ;

has obvious bug; changed to

: ISTR-LOWERC    ( char -- char' )
  DUP [CHAR] A [ CHAR Z 1+ ] LITERAL WITHIN IF 32 + THEN ;

: ISTR=          ( c-addr1 u1 c-addr2 ><2 flag)
   ROT OVER <> IF
     2DROP DROP FALSE
   ELSE OVER + SWAP ?DO
     DUP C@ ISTR-LOWERC
    ><I C@ ISTR-LOWERC
     <> IF
       UNLOOP DROP FALSE EXIT
     THEN CHAR+
   LOOP DROP TRUE THEN ;



Re: RfD: String comparison words version 1 (was version 0)

Postby Andrew Haley » Wed, 24 Nov 2010 18:58:27 GMT



You didn't address
https://groups.google.com/group/comp.lang.forth/msg/b8778bb21192d42f?hl=en

Summary: all we need is a case-insemsitive COMPARE.  We don't need
extra words.

Andrew.

Re: RfD: String comparison words version 1 (was version 0)

Postby Andrew Haley » Wed, 24 Nov 2010 18:58:27 GMT



You didn't address
https://groups.google.com/group/comp.lang.forth/msg/b8778bb21192d42f?hl=en

Summary: all we need is a case-insemsitive COMPARE.  We don't need
extra words.

Andrew.

Re: RfD: String comparison words version 1 (was version 0)

Postby Andreas » Wed, 24 Nov 2010 19:24:29 GMT

schrieb Andrew Haley:



I second that. And I don't like seeing Forth's word zoo increased.

When I want comfortable string processing, I don't use Forth. It would 
perhaps be different if modules/libraries were a standard Forth feature 
- but that's a different story.

Andreas

Re: RfD: String comparison words version 1 (was version 0)

Postby Alex McDonald » Wed, 24 Nov 2010 20:21:41 GMT

On Nov 23, 9:58m, Andrew Haley < XXXX@XXXXX.COM >





A case insensitive COMPARE can (and probably should) be addressed.

But this proposal is to significantly reduce the overhead of COMPARE
when the only result desired is equality (along with a case
insensitive equivalent). c-addr u1 c-addr2 u2 COMPARE 0= always
returns false when the lengths differ (u1<>u2), but we've still had to
access and compare min(u1,u2) characters of data. A more efficient
method is to check the lengths for equality prior to the test;

: ifpart if compare 0= else 2drop 2drop false then ;
: STR= dup 2 pick = ifpart ;
: STR= >r over r@ = r> swap ifpart ;
: STR= 2>r dup r@ = 2r> rot ifpart ;

Better, but probably only marginally so in most non-optimizing Forths
for lengths in the 10s and upwards of bytes, since there is much stack
juggling, probably unoptimised, required to get at u1. Ignore, for the
moment, the case-insensitive test ISTR= . Is there value in STR= ?
Having a case-insensitive COMPARE fixes a different problem.


Re: RfD: String comparison words version 1 (was version 0)

Postby Andrew Haley » Wed, 24 Nov 2010 22:54:18 GMT






I don't believe so.  As soon as you find a difference, you know which
string is lexicographically before the other.  As the spec says, "n is
minus-one (-1) if the first non-matching character in the string
specified by c-addr1 u1 has a lesser numeric value than the
corresponding character in the string specified by c-addr2 u2 and one
(1) otherwise."  So, as soon as you find a difference you can stop.


It's only significantly faster when you have two strings that differ
in length but share a common prefix.  How often does that happen?

Andrew.

Re: RfD: String comparison words version 1 (was version 0)

Postby stephenXXX » Thu, 25 Nov 2010 01:00:50 GMT

On Tue, 23 Nov 2010 07:54:18 -0600, Andrew Haley




Remember Ulrich Drepper's articles on loading shared objects?
   http://www.**--****.com/ 

There really are occasions when you compare vast numbers of strings
with a common prefix. Open Office suffers badly at start up.

Stephen


-- 
Stephen Pelc,  XXXX@XXXXX.COM 
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web:  http://www.**--****.com/  - free VFX Forth downloads

Re: RfD: String comparison words version 1 (was version 0)

Postby Andrew Haley » Thu, 25 Nov 2010 01:30:32 GMT




Okay, but does the standard Forth language really need a string
comparison operator that optimizes this particular case?  Really? ...

Andrew.

Similar Threads:

1.RfD: String comparison words version 1 (was version 0)

2.RfD: String comparison words version 1

Alex McDonald < XXXX@XXXXX.COM > writes:

> RfD: String comparison words (Draft version 1)
>
> Change history
> 2010-11-05 Initial proposal, incomplete
> 2010-11-22 Expanded Remarks section
>            Expanded Experience section
>            Correction of errors
>
> Problem
> -------
>
> Although ANS Forth provides COMPARE for string comparisons, it has two
> attributes that make it problematic; it is case sensitive, and
> expensive to execute for equality or inequality (the common cases) due
> to the need to complete the comparison to return greater-than or less-
> than return values.
>
> Add to the existing functionality of COMPARE to provide comparisons
> that are case-insensitive and and that only test for equality.
>
> STR= ( c-addr1 u1 c-addr2 u2 -- flag )  STRING-EXT

I still don't like standard words being contractions when there's no
problem using full form. "STRING=" is much better by all means except
length, and it is only 3 letters longer. You don't know if programmer
wants to use "STR" prefix for dynamic strings or not.

> Compare the string specified by c-addr1 u1 to the string specified by
> c-addr2 u2. c-addr1 and c-addr2 point at read-only areas, which must
> not be modified. If the strings are of different lengths (u1<>u2),
> flag is FALSE. Otherwise the strings are compared, beginning at the
> given addresses, character by character, up to the equal length of the
> strings or until a difference is found. If the two strings are
> identical, flag is TRUE.

"Identical" or "equal"?

"Identical" means that objects are the same, "equal" means that objects
are allowed to be different still equal. Look up "object identity"
in programming literature.

> ISTR= ( c-addr1 u1 c-addr2 u2 -- flag )  STRING-EXT
>
> Compare the string specified by c-addr1 u1 to the string specified by
> c-addr2 u2. c-addr1 and c-addr2 point at read-only areas, which must
> not be modified. If the strings are of different lengths, flag is
> FALSE. Otherwise the strings are fetched, beginning at the given
> addresses, character by character. Characters are considered to match
> if they have the same numeric value, or, if the characters fall
> between ASCII values 'A' thru 'Z', they are considered to be identical
> to the corresponding character values in the range 'a' thru 'z'.

This is unacceptible since it ignores natural string comparison rules.

> If
> the two strings are identical, flag is TRUE, otherwise FALSE.

Same as above.

> Remarks
> -------
>
> Why standardize these words? They can be defined in ANS Forth, for
> example;
>
> : STR= COMPARE 0= ;
>
> For the following reasons:
>
> Most uses of COMPARE are for string equality or inequality for string
> prefices.
>
> Using COMPARE to test for inequality is inefficient, as strings with
> unequal lengths can immediately be declared unequal; but COMPARE must
> continue to fetch and check characters to determine whether it should
> return greater-than or less-than, even though the result of this
> additional work will be discarded.
>
> For strings of equal length, the overhead is less significant, but the
> result of the comparison must still be adjusted to indicate the
> required result.

These are very weak arguments since all these cases are eliminated with
primitive peephole optimiser.

> Although string manipulation and handling is not employed extensively,
> text processing applications benefit significantly. Letting the
> compiler optimize uses of COMPARE 0= into a more efficient word is
> possible, but the programmer must employ an expensive COMPARE followed
> by tests to reduce the range of the result on systems that do not
> synthesize more efficient tests for equality.
>
> Case insensitive Forths require words to search the dictionary in a
> case-insensitive manner. These tests and tests for prefixes require
> that the tested argument is either converted to all upper case (or all
> lower case), which generally requires copying the original string to a
> transient area and performing a suitable case translation, followed by
> an expensive COMPARE operation.
>
> Why no case-insensitive COMPARE?
>
> There are a wide variety of case-insensitive words employed by Forths
> for this function; ICOMPARE, COMPARE(NC), UCOMPARE amongst others.
> Standardising such widely varying words would be problematic.

Standardising STRING= STR= S= $= and so on is still problematic, yet
you're writing this proposal somehow.

Standardising case insensitive COMPARE would solve remaining string
comparison problems.

> Why no STR<, STR>, STR>= and so on?
>
> The implementation of any test beyond equality requires inspecting all
> the characters for the length of the shortest.

Proper implementation of equality test for strings requires inspecting
all the characters anyway.

> The differentiation
> between greater-than and less-than is trivial for implementations of
> COMPARE to determine, as it is set on meeting the first non-equal
> character, or on exhausting one or other of the strings. All of these
> variants can be efficently written using COMPARE.
>
> : STR< COMPARE 0 > ;
> : STR> COMPARE 0 < ;
> : STR>= COMPARE 1 < ;
>
> and so on.
>
> The current proposal does not allow the synthesizing of case
> insensitive comparisons due to a lack of appropriate ICOMPARE (or
> COMPARE(NC) etc).

And this is major drawback of this proposal.

> Experience
> ----------
>
> As a case insensitive Forth, Win32Forth exposes ISTR= , used to search
> wordlists, as defined here, and supplies a STR= not based on COMPARE.
>
> MPE's VFX Forth supplies STR= S= and IS=. S= is a buffer compare with
> the signature ( c-addr1 c-addr2 u -- flag ); IS= is the case
> insensitive equivalent. S= and IS= can be efficiently synthesized from
> STR= and ISTR= respectively;
>
> : S= ( c-addr1 c-addr2 u -- flag ) TUCK STR= ;
> : IS= ( c-addr1 c-addr2 u -- flag ) TUCK ISTR= ;
>
> [ Does VFX Forth provide an equivalent to ISTR=? ]
>
> Gforth supplies STR= STR< and STRING-PREFIX?.
>
> STRING-PREFIX? can be synthesized from STR= ;
>
> : STRING-PREFIX? ( c-addr1 u1 c-addr2 u2 -- flag )
>   TUCK 2>R MIN 2R> STR= ;
>
> [ Information on other Forths required here ]

Create it. There're more freely available Forths than Win32Forth and Gforth,
some of them are portable.

> Comments
> --------
>
> The ANS definition of COMPARE does not explicitly declare whether the
> input strings are read-only. Since COMPARE states that characters are
> "compared", the assumption is that they are read-only since no
> reasonable implementation needs to employ a destructive test.

This makes grounds to review the practice and amend standard to require
non-destructive comparison.

> With
> case-sensitive string comparisons, this RfD makes it clear that they
> are read-only, as implementors might be tempted to lower- or upper-
> case one or both of the strings prior to comparison.
>
> Note that the implementations do not assign a meaning to the values of
> c-addr1 or c-addr2 when u1<>u2 (unequal length strings), or when
> u1=u2=0 (null strings which always return TRUE). Given that different
> implementations may address these in their own way, supplying invalid
> values of c-addr1 and c-addr2 in those cases (those that would cause
> an error if a single character was fetched from either of those
> addresses) is an ambiguous condition.
>
> Case-insensitivity only considers ASCII 'A' thru 'Z' to be equal to
> the corresponding ASCII characters 'a' thru 'z'. No other characters
> outside that range are considered equal.

This is definitly wrong. If your words are not useful for anything except
internal problems of your implementation and your programs, they should
not be standardised at all, let alone take useful names.


-- 
HE CE3OH...

3.RfD: String comparison words version 0

RfD: String comparison words

Change history
2010-11-05 Initial proposal, incomplete

Problem
-------

Although ANS Forth provides COMPARE for string comparisons, it has two
attributes that make it problematic; it is case sensitive and
expensive to execute due to the range of return values. Extend the
functionality of COMPARE to provide comparisons that are case-
insensitive, and comparisons that test for only equality.

STR= ( c-addr1 u1 c-addr2 u2 -- n )  STRING-EXT

Compare the string specified by c-addr1 u1 to the string specified by
c-addr2 u2. c-addr1 and c-addr2 point at read-only areas, which must
not be modified. If the strings are of different lengths (u1 is not
equal to u2), n is zero (0). Otherwise the strings are compared,
beginning at the given addresses, character by character, up to the
equal length of the strings or until a difference is found. Characters
are considered identical if they have the same numeric value. If the
two strings are identical, n is zero.

ISTR= ( c-addr1 u1 c-addr2 u2 -- n )  STRING-EXT

Compare the string specified by c-addr1 u1 to the string specified by
c-addr2 u2. c-addr1 and c-addr2 point at read-only areas, which must
not be modified. If the strings are of different lengths, n is zero
(0). If both strings are null (u1 and u2 are both zero), n is one (1).
Otherwise the strings are fetched, beginning at the given addresses,
character by character. Characters are considered identical if they
have the same numeric value, or if the characters fall between ASCII
values 'A' thru 'Z' they are considered to be identical to the
corresponding character values in the range 'a' thru 'z'. If the two
strings are identical, n is one (1); otherwise n is zero (0).

Remarks
-------

Why standardize these words? They can be defined in ANS Forth;

: STR= COMPARE 0= ;
: ISTR= <definition required> ;

For the following reasons:

    * Many systems define STR= or ISTR= (or the equiavlent with
another name)
    * Case insensitive Forths require such a word to search the
dictionary in a case-insensitive manner, and many expose these words
or their equivalents to the user.
    * The commonest use of COMPARE is in the form COMPARE 0=.
    * They will be implemented more efficiently on many systems.
Although string manipulation and handling is not employed extensively,
text processing applications benefit significantly.

Why no case-insensitive COMPARE?

While 'a' and 'A' can be considered equal, it is problematic to assign
a meaning to a comparison of 'a' against 'B' in terms of 'greater
than' or 'less than'. Numerically, 'B' (65 decimal) is less than
'a' (96 decimal), but collation sequences are normally defined in
terms of case-insensitive tests; 'A' precedes 'ab', which precedes
'B'. This RfD does not attempt to address these issues.

Note that the implementation of STR= and ISTR= does not describe the
values of c-addr1 or c-addr2 when u1 <> u2 (unequal length strings),
or when u1=u2=0 (null strings). Given that different implementations
may address these in their own way, supplying invalid values of c-
addr1 and c-addr2 in those cases (those that would cause an error if a
single character was fetched from either of those addresses) is an
ambigous condition.

Experience
----------

As a case insensitive Forth, Win32Forth exposes ISTR= , used to search
wordlists, as defined here.

<others>

4.RfD: Escaped Strings (Version 6.2)

5.RfD: Escaped Strings (Version 6)

6. RfD: Escaped Strings S\" (version 5)

7. RfD: Escaped Strings version 4

8. RfD: Ambiguity in FATAN2 version 1.2



Return to forth

 

Who is online

Users browsing this forum: No registered users and 78 guest