regex: parsing out varying length substr from varying string

PERL

    Next

  • 1. Oracle DBI::ODBC and SELECT FOR UPDATE
    Has anyone successfully used, in Oracle, a 'SELECT ... FOR UPDATE' using the Perl DBI:: ODBC?
  • 2. SOAP::WSDL
    Hi, Does anyone have some tips for installing SOAP::WSDL with Strawberry Perl? When I try to install it, it appears an error window telling that the file libexpat.dll is missing. I have even tried copying it in the bin directory (taken from ActivePerl) but it gives the same error. (I have also tried to subscribe to Strawberry Perl's mailing list, but I haven't received any confirmation email until now.) Thanks. Octavian
  • 3. Saturdays in a month
    On Tue, Dec 22, 2009 at 8:51 AM, Johnson, Reginald (GTS) < XXXX@XXXXX.COM > wrote: > Is there a module that I can use that will tell me the number of Saturdays, > or any weekday, if I give it the month and year. > > The Date::Manip module has methods for calculating recurring dates (like "every Saturday"). If you can get a list of dates, then the count of the list is the number that you want. -- Robert Wohlfarth

regex: parsing out varying length substr from varying string

Postby Matt » Thu, 08 Dec 2005 12:02:03 GMT

I am trying to parse some HTML with a Perl script. The script is not
working correctly. The parsing portion of the script reads:

getDealers {
	my $URL = shift;
	my $fullString = get($URL);
	my $startIndex = 0;
	my $endIndex = 0;

	my $thisDealerURL;
	my $thisDealerTitle;

	# dealer URL
	my $urlMatch = 'Bookseller: <a
href=\"/servlet/BookDetailsPL\?bi=\d\d\d\d\d\d\d\d\d&tab=1&searchurl=bx%3Doff%26sts%3Dt%26ds%3D100%26bi%3D0%26isbn%3D0805074562';
	my $endUrlMatch = "\"";
	my $urlPrepend = $baseURL .
'/servlet/BookDetailsPL\?bi=\d\d\d\d\d\d\d\d\d&tab=1&searchurl=bx%3Doff%26sts%3Dt%26ds%3D100%26bi%3D0%26isbn%3D0805074562';

# dealer name
	my  $dealerMatch = ">";
	my  $endDealerMatch = "</A>";


	while (1) {
		print ".";
		# get the URL
		$startIndex = index($fullString,$urlMatch,$startIndex) +
length($urlMatch);
		if ($startIndex == (-1 + length($urlMatch)) ) {
			print "\n";
			last; # break!
		}
		$endIndex = index($fullString,$endUrlMatch,$startIndex);
		$thisDealerURL = $urlPrepend .
substr($fullString,$startIndex,($endIndex - $startIndex));
		$startIndex = $endIndex; # advance the starting index to where we
stopped

		# get the title
		$startIndex = index($fullString,$dealerMatch,$startIndex) +
length($dealerMatch);
		$endIndex = index($fullString,$endDealerMatch,$startIndex);
		$thisDealerTitle = substr($fullString,$startIndex,($endIndex -
$startIndex));
		$startIndex = $endIndex;

}

---------------------------------------------

Here's a sample of the HTML:

<td class="bookseller" width="25%">Bookseller: <a
href="/servlet/BookDetailsPL?bi=602397062&tab=1&searchurl=bx%3Doff%26sts%3Dt%26ds%3D100%26bi%3D0%26isbn%3D%252F0805074562">Snowbound
Books</a><br/>
      <span class="scndInfo">(Marquette, MI, U.S.A.)</span></td>

<td class="bookseller" width="25%">Bookseller: <a
href="/servlet/BookDetailsPL?bi=527331473&tab=1&searchurl=bx%3Doff%26sts%3Dt%26ds%3D100%26bi%3D0%26isbn%3D%252F0805074562">LDS
Heritage Books</a><br/>
      <span class="scndInfo">(Bountiful, UT, U.S.A.)</span></td>

I want the script to put the dealer names (in this case, "Snowbound
Books" and "LDS Heritage Books") into the variable $thisDealerTitle. I
am pulling the HTML fine; when I ask it to print the source, it does
so. But running the above script returns "0 dealers loaded." Can anyone
see where I've gone wrong? Mere hints are appreciated.

thanks,
matt


Re: regex: parsing out varying length substr from varying string

Postby Purl Gurl » Thu, 08 Dec 2005 12:51:05 GMT

att wrote:


(snipped - see original article)

Your code is performing a lot more work than needed. You can reduce
your code to just a few lines, maybe four to seven lines, at the most.

Below, a conceptual example, this is, an example which displays a
methodology, not a solution.

Parsing html is very challenging. For simple needs, you can often be
successful. For complex pages, you will rarely be successful.

General rule is do not mix regex matching and substring / index functions.
Use one, but not both. Regex matching for html documents is not a good
practice, unless needs are very simple. Substring and indexing are well
suited for pulling data out of an html page when there is consistent format.
This type of page with which you are working, are often very consistent
in format. Markup and coding are of a template nature; predictable.

A frequently successful approach is to look for "common flags" throughout
your data. A flag, a marker, a begin and an end, these are the precise same
data, near or at, your data of interest. Look at these snippets of your data:

">Snowbound Books</a>

">LDS Heritage Books</a>

What is both common and predictable for both data sets?

"> </a>

Additionally, your data displays a format of a singular newline (\n) separator.
Otherwords, what some, in error, call a "blank line" between data sets.

You have two markers, a begin and an end, for your data. Use those to find
your data by numerical position, then start taking slices. In this case example,
I slice out what is not needed, then slice out what is needed. You will note I
start at the end of data, then move backwards. Sometimes it is easier to start
at the beginning of data, and move forward.

Basic concept is I take out what is not needed until target data is easy to grab.
This involves taking out identical flags which are "in the way" of data recovery.
Different wording, peel the onion until you reach the heart.

$/ = "";

Set the record separator to paragraph mode; two or more concurrent newlines.

$_ =~ tr/\n/ /;

Get rid of newlines with a space, so "words" do not run together. Extra spaces can be removed, later.

substr ($_, rindex ($_, "</a>"), length ($_), "");

Slice off a big chunk from the end of data to clean up. I am reverse indexing (rindex) to the first </a>
found, setting a length greater than needed to slice off everything, regardless of actual length, from
the _end_ of the data. This leaves the end of each book title exposed, otherwords, the last letter in
each book title, is the end of the data, which is my "true beginning point" for moving backwards.

print substr ($_, rindex ($_, ">") + 1), "\n";

Move backwards through data until the first > is found, move forward one character, capture and print.

This is a conceptual example. Rather than look for what you can "match" with a regex, look for
what is common, what is repetitive, what is predictable. For your data samples, there is a very
clear predictable pattern. Take advantage of that.

Only rule is, you must have uniform predictable data, or data which can be made so.

Piece of cake, yes? Just a matter of a little cleaning up, if needed; frosting for your cake.

Purl Gurl


#!perl

$/ = "";

while (<DATA>)
{
$_ =~ tr/\n/ /;
substr ($_, rindex ($_, "</a>"), length ($_), "");
print substr ($_, rindex ($_, ">

Re: regex: parsing out varying length substr from varying string

Postby Matt » Thu, 08 Dec 2005 13:19:33 GMT

Thank you--the "no mixing of regex/substr" guideline is very helpful.
The full HTML is a bit more complex than the above, but I think I can
work it out. I'll let you know what I come up with.


Re: regex: parsing out varying length substr from varying string

Postby Purl Gurl » Thu, 08 Dec 2005 13:31:26 GMT




No problem. I enjoy these types of articles. Solving this type of common
problem with little used methods presents a chance for all readers to
learn something, or affirm what they already know. All benefit.

Easy to remember trick for this methodology; look for what is predictable.

There is a gotcha! In time, the author of a page may change his format.
No doubt, that will toss a greasy monkey-wrench in your spinning gears.
If you are planning to pull this information on a regular basis, be sure to
check returns periodically.

This is a highly related article of mine, which employs the same methodology
but is very subtle; there are no "visual" flags with which to work. Looking for
the predictable in data, often is not what you see, but what you discern.

 http://www.**--****.com/ 

You could use the method exemplified in my cited article to "narrow" down
your data to single lines, very easily.

Look for the predictable, which is often quite invisible.

Purl Gurl

Re: regex: parsing out varying length substr from varying string

Postby usenet » Thu, 08 Dec 2005 14:16:06 GMT



Before you completely reinvent the wheel, you might check out:

   perldoc -q "How do I remove HTML from a string?"

and be sure to check out HTML::Parser on CPAN.


Re: regex: parsing out varying length substr from varying string

Postby Matt » Fri, 09 Dec 2005 06:42:13 GMT

Here's what I've got...the next step is to make it loop through the
entire page. Right now, I'm counting from the beginning of the HTML; I
need to make it advance $startindex to the next "Bookseller:"
occurrence (and likewise for $endindex). Or is the below code
loop-unfriendly?

Unlike the original HTML that I posted, there is lots of junk in
between each <td>. The junk is a repeating pattern, but the patterns I
use below are not repeated in the junk.

my $startsellertext = "Bookseller:";
my $startindex = index($fullString,$startsellertext);
my $endsellertext = "</a>";
my $endindex = index($fullString,$endsellertext);
my $sellertext = substr($fullString,$startindex,($endindex -
$startindex));
my $sellername = substr($sellertext,138,(length($sellertext)-138));
print "seller name is $sellername";


Similar Threads:

1.reading file and storing information of lines with varying length

Hi!

I have a problem with a file. It looks like this
;r;information;more_info;1;key1;value1;
;s;more;more;2;key1;value1;key2;value2;
;t;info;and;4;key1;value1;key2;value1;key1;value3;key2;value4

I have to extract the information in key and value of each line and
compare it to some information  in another file.
I would like to read the file and store the key;value pairs into a hash
and then compare it later on with the other information I have.
As you can see (hopefully) in my example (i.e. line 3) I have some keys
that are used twice (or more often). My suggestion would be an hash of
arrays.
I think in the end it should "look" like this: (for line 3)
%name_of_hash = (
key1 =>  ["value1"], ["value3"],
key2 =>  ["value1"], ["value4"]
);
The file uses ";" as a delimiter and in field 4 you have information
about how many key;value-pairs there will be.

I hope I could make myself clear about the problem. I want to read a
file line by line and compare it with some other file. But at the
moment I don't know how to store the information in a hash!
If someone had an idea that would be great!

Greetings
Chris

2.varying hyperlink effects on one page

I have several types of text hyperlinks on one page, all 
on differing backgrounds. I would like their behaviour 
(alink, vlink etc) to be different aacording to their 
background. eg blue link on white background and white 
link on blue background.
I have an external css for the page in which the 
underlining (text decoration) is switched off.
I am using FP 2003.
Can anybody help please?

3.$sth->bind_columns() with varying amount of columns

I am making a subroutine to handle all the communication with the SQL-
server, however, i bumped into a problem. The thing is that depending
on which table im querying, the amount of columns in return isn't
alwaways the same, so $sth->bind_columns(\$field_1, \$field_2, \
$field_3) will not work when it's accessing a table with anything but
3 columns. I tried simply $sth->bind_columns(\@fields), but the array
still counts as only 1 ref. So... is there a way to:
a) do a workaround, to somehow use the array like that?
or:
b) check between execute() and bind_columns() how many columns i
should expect?


I've been scratching my head over this all day, and the coffee isn't
doing the trick anymore, so feedback and tips would be really
appreciated.
Thanks in advance.
-Kristian

4.Varying a Latitude/Longitude

A while ago while I was in the Army I co-wrote an exercise SIGINT
generator with a friend. We made a mess of a few functions called
VaryLat and VaryLong. The functions basically took in two arguments: a
Latitude or Longitude and an amount to vary it by(in seconds). The
second parameter could be any number positive or negative. The real
issues started rising when we figured out that crossing over either the
90' mark or the 180' mark was throwing off our numbers. The functions
turned out to be very long seeing as how neither of us was very
experience with Perl at the time. I wish I had the code to post to show
how messy this thing was but seeing as how it was on a Top Secret//SCI
machine security protocols made if impossible to bring a copy of this
source home at the time. Basically I was wondering if anyone else has
had this particular problem and came up with a good solution. It's been
bugging me ever since I wrote the stupid code. The function prototype
was something like this: VaryLat($theLatitude,$varyAmount).

5.URI queries with varied amounts of named values

I'm looking for some assistance from the perl folk out there.  I am a
perl hack who writes a few scripts a year in perl only when needed.
This seems like a job best suited for perl (most likely a hash) but I
am fumbling around more then I would like to be.  I'm sure this is
very simple for anyone well versed in perl.

Objective:
Take a list of named values and put them into a CSV file.  When there
isn't a named value there should just be an empty CSV slot.  There
might also be some entries on the same line that are somewhat
duplicated, where if there is one entry it should always trump the
other.  The CSV file will always have 7 possible entries in the CSV.
language,format,country,zip,category,ua,id

Problem:
The named values vary by line so there is never just X per line.  Some
will have just X, some will be X+1, X+5, some will be empty, etc.,
etc.

Example file:
l=en&format=xhtml
format=xml&country=US&ua=Mozilla
l=sp&zip=00000&category=books
l=en&format=xml&id=xyz

l=fr&country=US&alt-country=CA     # in this case we want the alt-
country to populate the country field

Example output:
en,xhtml,,,,,,
,xml,US,,,Mozilla,
sp,,,00000,books,,
en,xml,,,,,xyz
fr,,CA,,,,,

I have tried playing around with the URI perl module but haven't had
much luck.  I have also made some attempts on my own but I am just not
getting things right.  I know this is probably better suited to a hash
but I am very hash illiterate.  I can perform basic functions in
hashes and do simple stuff but I don't play around with perl enough to
have gotten any better.

foreach (@DATA_SET) {
        next unless /\S/;      # strip out blank lines, i.e. no named
values
        # print "$_\n";
        #$format = m/format=xml/;
        #my $format =~ /format=xml/;
        #print $_;
        #print "$format\n";
}

I also tried pushing the results to another function and doing some
work there but it didn't go well as you can see from how I ended up
completely commenting it out.

#sub string_analysis {
    #my (@DATA_RESULTS) = @_;
    #@DATA_RESULTS = split(/&/,$_ [0]);
    #print "@DATA_RESULTS\n";
    #while(@DATA_RESULTS){
    #    foreach (split/&/,$S)[0]){
    #        print "$DATA_RESULTS[1]\n";
    #    }
    #}
    #    push (@SPLIT_DATA_RESULTS = split(/\&/,$_));
    #}
    #while (@SPLIT_DATA_RESULTS) {
    #    print "$_\n";
    #}
    #my @DATA_STRING = split /&/, @fields[9];
    #print "@DATA_STRING[1], @DATA_STRING[2], @DATA_STRING[3],
@DATA_STRING[4], @DATA_STRING[5], @DATA_STRING[6], @DATA_STRING[7],
@DATA_STRING[8]\n";
#}

Any help would be greatly appreciated.

6. Sorting Numberic keys in a hash array - with varying number lengths.

7. String Length for password length testing?

8. How to get length of string? length() problems



Return to PERL

 

Who is online

Users browsing this forum: No registered users and 49 guest