fluidthoughts developers' guild

fluid funk

howto / php / content syndication

Intro to web scraping with PHP

Imitation is the sincerest form of flattery

In this fast-paced modern world where information is power, having the power to use constantly-changing content can be essential. I have used PHP to read in content from many dynamic exterior sources, and it's not terribly difficult.

On the other hand, there are some restrictions and conflicted legal opinions questioning if you have the right to use other people's content.

Step A: reading in foreign content

For the simplest example, I can pull in the daily weather forecast, while ignoring all the other junk on the page. In my simplest example, I use fopen to establish an HTTP 1.0 connection to the server, from which I read all the lines, find where the important stuff begins and ends, then print out the prudent content.

Examples:

wunderground forecast: 48103
My stripped, lynx friendly rendition

Sample Code:

<html>
<head>
    <title>Ann Arbor Weather</title>
    <link rel="stylesheet" href="styles.css" type="text/css">
</head>
<body bgcolor="#ffffff">

<p><a href="http://www.crh.noaa.gov/forecasts/MIZ075.php">
http://www.crh.noaa.gov/forecasts/MIZ075.php</a></p>

<table cellpadding="7" cellspacing="0" border="0">
<tr>
    <td valign="top">
        <img src="http://www.bmcmedia.net/webcam/bmccam.jpg"
            width="352" height="288" alt="" border="0" />
        <br /><br />
        <img src="http://weather.yahoo.com/images/northeast_sat_440x297.jpg"
            width="440" height="297" alt="" border="0" />
    </td>
    <td>
<?php

    $src 
'http://www.wunderground.com/cgi-bin/findweather/getForecast?query=48103'

    
$stop 0;
    
$start 1;
    
$fp fopen ($src"r");
    while ((!
feof ($fp)) && (!$stop))
    {
        
$line fgets($fp4096);

        if (
preg_match("/Nowcast as of/"$line)) { $start 0; } 
        if (
preg_match("/Forecast for Washtenaw/"$line)) { $start 0; } 

        if (
preg_match("/Air Pollution/"$line)) { $stop 1; }
        if ( !
$start && ( preg_match"/smalltableheader/"$line )))
        { 
$stop 1; }

        if (!
$start
        { 
            if ( 
preg_match"/<table /"$line ))
            { 
$stop 1; }
            elseif ( 
preg_match"/<\/?table[^>]*>/"$line )) { ; }
            else 
            {
                
$line preg_replace("/<img src[^>]*>/"''$line); 
                
$line preg_replace("/<(\/)?td[^>]*>/""<$1p>"$line); 
                
$line preg_replace("/<\/?(tr[^>]*|font|center)>/"''$line); 
                
$line preg_replace("/<p><\/p>/"''$line); 
                echo 
$line
            } 
        }
    }
    
fclose($fp);
?>
    </td>
</tr>
</table>

</body>
</html>

Step B: processing what you need

A similar strategy can be applied toward stock quotes. The task here is to get a pure number for a specific ticker symbol. Again, regular expressions are your friend. I've been pulling my own numbers reports from yahoo's finance board.

Please select a stock ticker symbol to see the next example in action:


<?php
    $src 
"http://finance.yahoo.com/q?s=$_POST[symbol]&d=v1";

    echo 
"<p>\n";
    
$fp fopen ($src"r");
    while ((!
feof ($fp)) && (!$found))
    {
        
$line fgets($fp4096);

        if (
preg_match("/<font face=arial size=-1><a href=\"\/q\?s=/"$line))
        {
            
$found 1;
            
$Pieces preg_split("/<\/td>/"$line);
            
$Pieces preg_replace("/<[^>]*>/"""$Pieces);
            echo 
"<p>$Pieces[0]: $Pieces[2]</p>\n";
        }
    }
    
fclose($fp);

    if (!
$found) { echo "<p>Ticker symbol not found</p>\n"; }
    echo 
"</p>\n";

?>

Step C: Circumventing protection schemes

Google frowns upon "Automated Querying" and will block attempts to use fopen, as demonstrated above. One way to get around this is by using curl, or Client URL Library Functions. The search implemented in the navigation strip on this site was done using a curl call.

The functions used were:

  • curl_init
  • curl_setopt with parameters CURLOPT_URL, CURLOPT_HEADER, and CURLOPT_RETURNTRANSFER
  • curl_exec
  • curl_close

Unfortunately, google has once again changed their blocking strategy to nullify the above procedures, and I've had to once again disable the search ability on the site. Oh well, I guess you can always find what you want here.

 

$Id: content_syndication.html,v 1.7 2005/02/17 20:01:18 willn Exp $