RfD: Escaped Strings version 4

forth

    Next

  • 1. Slow first-time access
    Hello, I use some Forth code to access the contents of many documents for an analysis. A folder typically contains 20,000 or more small(~few KB's each) documents. I open each document read-only, then use a read-file to get its contents. Here is the problem: the first time this runs it takes so long to do the reads. The cpu load is very small(with a very noisy HDD head) If I close my application and launch it again for a new run, the reading is much MUCH faster. Subsequent runs seem to have the high performance of the second one. I try the same code with the same documents on two machines, a desktop with 2GB of RAM and a 180GB NTFS drive, and a laptop with 768 MB of RAM and a 30GB NTFS drive, both running XP. Same behavior is noticed. I tried to use small folders(~1000 docs), and I still got the same story. I defragmented both drives, but it didn't help. I have the indexing service turned off, as well as system restore I have recently moved to XP. I donot recall this being a problem on 2000. Is there a fix to overcome such terribly slow first-time performance? Help is most appreciated... Thanks, Ahmed
  • 2. Programming languages and math
    I was thinking about programming languages today and it occurred to me that math people were put in charge of developing programming languages. When I was in college, for example, there was a Math/Comp Sci department. One of the problems that Chuck has pointed out is that programmers like to solve the more interesting general problem so that their particular problem falls out trivial. Mathematicians are trained to work out general solutions this way so by having them lead the development of computer science we got programming languages for math people. Great for expressing algorithms, maybe not so great for talking to computers. The tendency toward generalization led to abstraction, which created problems that could only be solved with more abstraction and more complexity. Things would have turned out differently if linguists and physicists had developed computer science instead. Okay, Chuck was a physicist. That's my theory on how we got to where we are. Mathematicians like to solve challenging problems, and there were plenty of them to be found in compiler writing. I guess having a culture that reveres complexity doesn't help. Brad
  • 3. Forth for Mac
    Hi, Does anyone know if there's a good forth compiler for Mac, Cocoa based? I could install gforth 6.2, and it's good, but I'm just curious if there something "Made for Mac". Thanks

RfD: Escaped Strings version 4

Postby Peter Knaggs » Sat, 11 Aug 2007 03:22:30 GMT

fD: Escaped Strings S\"
19 July 2007, Stephen Pelc

20070719 Modified ambiguous condition
Added ambiguous conditions to definition of S\"
Added test cases
Corrected Reference Implementation
20070712 Redrafted non-normative portions.
20060822 Updated solution section.
20060821 First draft.

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as discussed
in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as

1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.

The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.

The following list describes what is currently available in the
surveyed Forth systems that support escaped strings.

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here) is a fixed-width unit handled by EMIT and
friends as well as C@, C! and friends. A pchar corresponds to the
current ANS definition of a character. Characters that may be
wider than a pchar are called "extended characters" or xchars.
The xchars are an integer multiple of pchars. An xchar consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.

The consequences of this are:

1) No existing cod

RfD: Escaped Strings version 4

Postby Peter Knaggs » Sat, 11 Aug 2007 03:22:55 GMT

fD: Escaped Strings S\"
19 July 2007, Stephen Pelc

20070719 Modified ambiguous condition
Added ambiguous conditions to definition of S\"
Added test cases
Corrected Reference Implementation
20070712 Redrafted non-normative portions.
20060822 Updated solution section.
20060821 First draft.

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as discussed
in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as

1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.

The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.

The following list describes what is currently available in the
surveyed Forth systems that support escaped strings.

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here) is a fixed-width unit handled by EMIT and
friends as well as C@, C! and friends. A pchar corresponds to the
current ANS definition of a character. Characters that may be
wider than a pchar are called "extended characters" or xchars.
The xchars are an integer multiple of pchars. An xchar consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.

The consequences of this are:

1) No existing cod

Re: RfD: Escaped Strings version 4

Postby helmwo » Sat, 11 Aug 2007 05:09:46 GMT




They are missing HEX and something like
TESTING S\"

Regards,
-Helmar


Re: RfD: Escaped Strings version 4

Postby helmwo » Sat, 11 Aug 2007 05:13:33 GMT




They are missing HEX and something like
TESTING S\"

Regards,
-Helmar


Re: RfD: Escaped Strings version 4

Postby Peter Knaggs » Sat, 11 Aug 2007 17:17:59 GMT





The entire test suite is in HEX.

The test cases appear in the rationale for each individual word being 
tested, in a "Testing" section. I see no need for the TESTING heading. 
Anyhow this would be folded into the character tests (CHAR [CHAR] [ ] BL S")

Re: RfD: Escaped Strings version 4

Postby Peter Fth » Sat, 11 Aug 2007 23:44:28 GMT

> Translation rules:
I suggest also to define \u and \U for inputing 4 and 8 hex digits
unicode codepoints. In my system \u20AC (the euro sign) will insert
the utf8 sequence E282AC into the string.

Peter Fth


Re: RfD: Escaped Strings version 4

Postby anton » Sun, 12 Aug 2007 18:12:13 GMT

Peter Knaggs < XXXX@XXXXX.COM > writes:

Ok.


I have now changed the development version of Gforth so that it passes
the tests.


There were still some non-standard words in there.  I have
eliminated/defined all non-standard words and put the result on

 http://www.**--****.com/ 

This runs on the current development Gforth (not on Gforth-0.6.2 due
to the use of the # number prefix).

Concerning the question about the case sensitivity of the escapes,
both Gforth and the reference implementation treat them
case-sensitively.


That's ok, but the most of the justifications are nonsense.  A much
better justification is that this allows any sequence of bytes to be
generated with S\" even if that sequence is not a proper xchar string;
and one needs such binary strings in various applications.


If there are byte order issues when transmitting xchars (e.g., for
UTF-32), that has to be dealt with at transmission, not at generation
of strings containing xchars.


Since S\" is generating a string, the cell size is irrelevant, and
this is not an issue.


That's a very good justification.


\0 seems to be a better candidate, because it is more in line with the
usage in other languages (in particular, C and it's children, which
inspired this approach).


As in "3.2.1.2 Digit conversion" (i.e. only upper case is standard at
the moment) or as in the X:number-prefixes (case-insensitive).


You might also add

S\" \x0F0" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0F 30 }

which might catch some non-conformant implementations that the test
above doesn't catch.

- anton
-- 
M. Anton Ertl   http://www.**--****.com/ 
comp.lang.forth FAQs:  http://www.**--****.com/ 
     New standard:  http://www.**--****.com/ 
   EuroForth 2007:  http://www.**--****.com/ 

Re: RfD: Escaped Strings version 4

Postby stephenXXX » Tue, 14 Aug 2007 23:15:06 GMT

On Fri, 10 Aug 2007 07:44:28 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=




That suggestion leads to six forms, which is why I gave up and
define extended characters as a stream of primitive characters.
  UTF-8   encoding or char number?
  UTF-16  little or big-endian?
  UTF-32  little or big-endian?

Stephen


-- 
Stephen Pelc,  XXXX@XXXXX.COM 
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web:  http://www.**--****.com/  - free VFX Forth downloads

Re: RfD: Escaped Strings version 4

Postby Peter Fth » Tue, 14 Aug 2007 23:58:15 GMT




No it does not! What follows the \u is the 4 digit hex number of
the unicode code point. This is always the same and independent of
encoding or endianess. S\" will then translate this to the encoding
and endianess used in the specific system. If I write the string
S\" Please pay me 10\u20AC" this will be portable to whatever your
system uses for unicode encoding. On my system it stores E282AC in the
bytestream. On a Windows system using uft16 it will store AC20
at the character position. It is when I input individual bytes with
\x that I need to keep track of the 6 cases. I want to avoid this

Peter




Re: RfD: Escaped Strings version 4

Postby stephenXXX » Wed, 15 Aug 2007 01:23:32 GMT

On Mon, 13 Aug 2007 07:58:15 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=




My bad! Thanks for the explanation. I assume that /U is followed by an
8 digit hex number. Although this notation solves the problems of the
host, is it enough when the string is sent across a comms channel to
another box of the other endianness?

Stephen

-- 
Stephen Pelc,  XXXX@XXXXX.COM 
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web:  http://www.**--****.com/  - free VFX Forth downloads

Re: RfD: Escaped Strings version 4

Postby Peter Knaggs » Wed, 15 Aug 2007 05:34:59 GMT



This assumes the system is using unicode. There is nothing to mandate 
that at current. If you where providing a non-unicode system, would \u 
and \U reflect the native encoding or would you insist on a full unicode 
conversion?

Re: RfD: Escaped Strings version 4

Postby Peter Fth » Wed, 15 Aug 2007 05:50:59 GMT




Yes the /U is for 8 digits.

If the string is in utf8 there would be no problems with endianess.
I assume that for communicating there would be a protocol that
specify how strings are sent.

Peter




Re: RfD: Escaped Strings version 4

Postby Peter Fth » Wed, 15 Aug 2007 06:11:38 GMT





No \u and \U should always reflect the unicode codepoint. The system
should then try to convert this to the encoding in use. If this will
fail a predefined character would be inserting to show a failed
conversion. This could be a ? or box character. In a system with
Latin-1 all codes above $FF will fail all below will be a direct
translation. For other encodings the translation will require more
work. There are libraries in both Linux and Windows that can handle
this

Peter


Re: RfD: Escaped Strings version 4

Postby Peter Knaggs » Thu, 16 Aug 2007 07:47:37 GMT



Thanks, I have now removed PLACE and $, as they are no longer required.


The use of DECIMAL means that the # prefix is not required, I have 
removed it.



Agreed. Although if we are to consider \u and \U then escapes would have 
to be case sensitive.



This is an argument in favour of \u.



\0 is a side effect of allowing octal values. This naturally lead on to 
the ability to specify characters in decimal \ddd or hex \0xhh. If we 
are going to use number prefixes then we should use \#0 or \$00. I am 
not suggesting this, as there is little point in having multiple methods 
for entering specific character codes.



Good question, I would say upper-case as per the current document. If we 
change this to allow lower-case then so be it.



Good idea, done.



Return to forth

 

Who is online

Users browsing this forum: No registered users and 84 guest