Google Blogoscoped

Forum

Screen scraping with PHP

Justin Flavin [PersonRank 10]

Thursday, June 24, 2004
20 years ago

If you on PHP 4 and don't want to move to PHP 5 just yet , you can grab news headlines from RSS feeds for display in your site using the following two files. It a snap to set up and run:

---------------------------------------------------------------------
rdf_parse.inc :

<?
   $_item = array();
   $_depth = array();
   $_tags = array("dummy");
   # "dummy" prevents unecessary subtraction in the $_depth indexes

   function initArray() {
   global $_item;

   $_item = array ("TITLE"=>"", "LINK"=>"", "DESCRIPTION"=>"", "URL"=>"");
   }

   function startElement($parser, $name, $attrs) {
   global $_depth, $_tags, $_item;

   if (($name=="ITEM")||($name=="CHANNEL")||($name=="IMAGE")) {
   initArray();
   }
   $_depth[$parser]++;
   array_push($_tags, $name);
   }

   function endElement($parser, $name) {
   global $_depth, $_tags, $_item;
  
   array_pop($_tags);
   $_depth[$parser]--;
   switch ($name) {
   case "ITEM":
   echo "<br><A HREF="$_item[LINK]">$_item[TITLE]</A>";
echo "<br>".substr($_item[DESCRIPTION],0,250)."...<br/>n";
   initArray();
   break;

   case "IMAGE":
# echo "<A HREF="$_item[LINK]"><IMG SRC="$_item[URL]" ALT="$_item[TITLE]" BORDER=0></A>n<BR>n";
   initArray();
   break;

   case "CHANNEL":
# echo "<h3>$_item[TITLE]</h3>n";
   initArray();
   break;
   }
   }

   function parseData($parser, $text) {
   global $_depth, $_tags, $_item;

   $crap = preg_replace("/s/", "", $text);
   # is the data just whitespace?
   # if so, we don't want it!

   if ($crap) {
   $text = preg_replace("/^s+/", "", $text);
   # get rid of leading whitespace
   if ($_item[$_tags[$_depth[$parser]]]) {
   $_item[$_tags[$_depth[$parser]]] .= $text;
   } else {
   $_item[$_tags[$_depth[$parser]]] = $text;
   }
   }
   }

   function parseRDF($file) {
   global $_depth, $_tags, $_item;

   $xml_parser = xml_parser_create();
   initArray();

   # Set up event handlers
   xml_set_element_handler($xml_parser, "startElement", "endElement");
   xml_set_character_data_handler($xml_parser, "parseData");

   # Open up the file
   $fp = fopen($file, "r") or die("Could not open $file for input");

   while ($data = fread($fp, 4096)) {
   if (!xml_parse($xml_parser, $data, feof($fp))) {
   die(sprintf("XML error: %s at line %d",
   xml_error_string(xml_get_error_code($xml_parser)),
   xml_get_current_line_number($xml_parser)));
   }
   }

   fclose($fp);
   xml_parser_free($xml_parser);
   }

function write_rdf ($remote,$rdffile) {
// check if file exists – if not, create it
if (!file_exists($rdffile))
{
$handle=fopen ($rdffile,'w');
fwrite($handle,'');
}

// get the date of the file.
$rdf_file_date=date("F l j,Y",filemtime($rdffile));
$current_date=date("F l j, Y",time());

if (($rdf_file_date<$current_date) || (filesize($rdffile) == 0)) {
   $RDF = fopen($rdffile, "w") or die("Cannot open $rdffile");
   $FILE = fopen($remote, "r") or die("Cannot open $remote");
   while (!feof($FILE)) {
   fwrite($RDF, fgets($FILE, 1024));
   }
   fclose($RDF);
   fclose($FILE);
}
}

?>
-----------------------------------------------------------

news.php :
<?
$newsfeedArray[0]=array(
"title"=>"Yahoo Science",
"localfile"=>"yahoo-science.rss",
"remotefile"=>"http://rss.news.yahoo.com/rss/science"
);

$newsfeedArray[1]=array(
"title"=>"Slashdot",
"localfile"=>"slashdot.rss",
"remotefile"=>"http://slashdot.org/slashdot.rdf"
);
include("rdf-parse.inc");

foreach ($newsfeedArray as $news)
{
print "<hr>";
print "<b>".$news["title"]."</b><br>";
write_rdf($news["remotefile"],$news["localfile"]);
parseRDF($news["localfile"]);
}

?>

---------------------------------------------------------------------

If you get permissions problems do this

touch slashdot.rss (create blank file)
chown apache rss

(if your server runs as something else , say www-server , just do chown www-server rss)

Now run news.php in your browser again -you should see the hyperlinked news headlines appearing.

Regards,
justin
http://linuxnotes.blogspot.com

Forum home

Advertisement

 
Blog  |  Forum     more >> Archive | Feed | Google's blogs | About
Advertisement

 

This site unofficially covers Google™ and more with some rights reserved. Join our forum!