Best way to parse HTML


    Sponsored Links


  • 1. Vanishing Dock Icons
    Recent PowerBook 12", 768M RAM, Tiger 10.4.2. Several times since the 10.4.2 upgrade, a few "keep in dock" icons have simply vanished. Last time it was the three main MS Office X icons (Word, PPT, Excel) which were adjacent in the dock; Entourage (which was not adjacent) was still there. Other icons (twenty or so) were not affected. It was easy to drag the icons back into the dock from Applications, but having them vanish is worrying. It could be correlated with software installation, but I don't think so. Have others had similar experience? -- ... Al Quirt ... Ottawa Canada ... ... Please remove anti-spam BIRD for email replies ...
  • 2. Change Google to Different Search Engine in Safari?
    I would like to try another search engine in Safari's Address Bar. Any way to do this? m-m
  • 3. Software Update not working properly
    I have OS X 10.4.2. I can't seem to get Software Update to reset the list of ignored updates; if I run SU, it says that there is no new software available, then immediately quits - without giving me the chance to reset ignored updates. How can I fix this? -- "No urban night is like the night [in NYC] is our poetry, for we have pulled down the stars to our will." - Ezra Pound, poet and critic, 9/18/1912, reflecting on New York City

Best way to parse HTML

Postby JF Mezei » Thu, 31 Dec 2009 15:46:28 GMT

I obtain an HTML filen (via curl)  that has (somewhere in it) the following:

From that, I need to have a variable that has:


On VMS, I have a command procedure that reads every line until it hits
both "<img" and "size50" on the same line. Then it reads until it finds
"src", and then it extracts the string based on the location of the
first "=" and "&".

What would be the recommended tools to use on Unix to perform this ?
This is to be a job that runs in background 24/7 on an Xserve).

Is AWK the proper (only?) tool to do this ?

This job will use curl to obtain html and the .gif file from a remote
course, then use imagemagick to process the image (crop and add text to it).

Re: Best way to parse HTML

Postby VAXman- » Thu, 31 Dec 2009 21:09:23 GMT

In article <00993738$0$16793$ XXXX@XXXXX.COM >, JF Mezei < XXXX@XXXXX.COM > writes:


I have some PHP code on a web site that needs to get such information upon
request from other web sites.  I used an HTTPRequest.class.php library I'd
found on the web (with a few mods as it was buggy) to access and pass data
through to other PHP functions for parsing.  If you're open to using PHP,
I can let you have that class lib.

VAXman- A Bored Certified VMS Kernel Mode Hacker    VAXman(at)TMESIS(dot)ORG

  "Well my son, life is like a beanstalk, isn't it?"

Re: Best way to parse HTML

Postby Tom Harrington » Fri, 01 Jan 2010 02:25:14 GMT

In article <00993738$0$16793$ XXXX@XXXXX.COM >,

No, Mac OS X comes with many scripting languages that would be up to the 
task.  If it were me I'd use Perl, but only because it's the scripting 
language I know best.  Python, Ruby, and others would be more than up to 
the task.

Most, maybe all, of these languages have add-ons that will simplify 
parsing the HTML, so your script can be a little more intelligent than 
just looking at one line after another until it sees something that 
looks like it's probably right.  If the HTML is proper XHTML then you 
may be able to just look up the URL directly via DOM-style XML access.

Tom "Tom" Harrington
Independent Mac OS X developer since 2002

Re: Best way to parse HTML

Postby AES » Fri, 01 Jan 2010 10:10:25 GMT

In article <tph-43AD81.10251430122009@localhost>,

Would BBEdit, working on the text file, be a candidate?  (Seems to have 
good GREP and HTML tools and macro capabilities)

Re: Best way to parse HTML

Postby Warren Oates » Fri, 01 Jan 2010 10:30:28 GMT

In article <tph-43AD81.10251430122009@localhost>,

Exactly. And you can do it in Javascript, which is built into every 
browser on every platform (modern platforms, modern browsers, of course).


Take a looks at the DOM examples; getElementById is your friend.
Very old woody beets will never cook tender.
  -- Fannie Farmer

Re: Best way to parse HTML

Postby Tom Harrington » Sat, 02 Jan 2010 02:56:04 GMT

In article < XXXX@XXXXX.COM >,

Maybe, I've never tried to use it for scripting.

Tom "Tom" Harrington
Independent Mac OS X developer since 2002

Re: Best way to parse HTML

Postby Richard.Williams.20 » Tue, 12 Jan 2010 23:43:54 GMT

You want to extract

from a web page ? Are you allowed to use scripting ? This would be
very quick in biterscripting.

var str html
cat " http://www.**--****.com/ ">> $html
stex -c -r "^base=&\&^" $html

You can try it by saving the script in some file and executing the
following in biterscripting.

script "/path/to/saved script"

Similar Threads:

1.10 Ways The Nokia N800 Is Better Than Apple's iPhone

2.Best ways to do this in XP

3.Yet Another Sickening Snit Circus [was Best ways to do this in XP]

Snit wrote:
> "Edwin" < XXXX@XXXXX.COM > stated in post
> 1WIJg.5039$ XXXX@XXXXX.COM  on 8/31/06 2:56 PM:
> > Snit wrote:
> >> "Edwin" < XXXX@XXXXX.COM > stated in post
> >> MMHJg.5020$ XXXX@XXXXX.COM  on 8/31/06 1:38 PM:
> >>
> >>> Snit wrote:
> >>>> "Edwin" < XXXX@XXXXX.COM > stated in post
> >>>> JyDJg.4322$ XXXX@XXXXX.COM  on 8/31/06 8:50 AM:
> >>>>

[toilet flush]

I knew it was a waste of time to attempt to carry on any type of
serious discussion with you, before I started.

I've had enough of your baseless accusations, insults, and all-around
general stupidity.

Go take a flying leap.

4.10 Ways A brick is Better than The Nokia N800 :-)

> 1. Price
> Nokia N800: $399
> Brick:   $1.29

> 2. Open Source
> The Nokia N800 is a Linux device which requires all developers
> to program exclusively in only the languages that Linux supports
> A Brick contains no OS restritions.  Simply write on it
> with *any* marker in *any* language you want.

> 3. Third Party Applications
> Nokia strongly encourages 3rd party developers...
> Anyone can throw a brick.  No developer it yourself!

> 4. Service Contract...
> With the Nokia N800, you can access via Wi-Fi or with a
> Bluetooth enabled mobile phone, but these both still
> typically require some sort of service contract & payments.
> The Brick lets you send your message for free, and can be
> thrown anywhere ... the leader in true worldwide coverage.

> 5. Storage
> The Nokia N800 has two memory slots that support a limited
> range of memory cards (SD, miniSD, microSD, MMC and RS-MMC).
> The Brick flattens any memory card type in just a couple of hits.

> 6. Better Audio and Video Codec Support
> The N800 supports a limited range of audio and video formats.
> The Brick, properly used, can even make a deaf mute scream.

> 7. It's Not a Cell Phone
> The N800 is not a cell phone.
> The Brick is the original 'not a cell phone'.

> 8. It is a VoIP Phone
> The N800 can be used to for making VoIP calls via Gizmo and
> GoogleTalk, where these services are available; Skype "soon".
> The Brick has been used sending messages for a millennia.

> 9. Webcam for Video Conferencing
> The N800 also has a built in webcam...
> Holding the Brick high gets your message across, even on CNN.

> 10. It's Available Now
> The Nokia N800...You can order one today.
> The Brick has been available far longer...centuries longer.

Coming next week:   The Zune versus a Motorcycle!  :-)


5.10 Ways The Nokia N800 Is Better Than Apple= 3Fs iPhone

6. 10 Ways The Nokia N800 Is Better Than =?UTF-8?B?QXBwbGXigJlz?= iPhone

7. Good news if you didn't hear yet, HTML support, Spotlight

8. Best html program

Return to mac


Who is online

Users browsing this forum: No registered users and 97 guest