How to extract URLs from Delicious: Web Scraping or Data Extraction Techniques using PHP



There are numerous techniques to extract data from the web and one of the powerful tool that is available is Webharvest (http://web-harvest.sourceforge.net/). Its a Java and XML based extraction system and you can write XML based configuration files to tell what exactly you want to extract and how. But, sometimes you just wish that these people had done more documentation. Probably, i'll write about it some other time in some other blog. As of now, I would discuss how I often extract data from search-engines and sites like delicious using PHP.

In the example here, I would extract links and title of each link from delicious that are tagged with some keyword, say, all those links that are tagged with "PHP". I started this project in an attempt to randomly pick links (Interesting links) from delicious, google etc and then display the content after removing HTML tags.

I'll go step by step here:

Step 1: This function would fetch the complete Delicious page that are tagged with some keyword.

<?php

function getDelicious($keyword, $page_num){

$string1 = "";

$odpurl
="http://del.icio.us/search/?all=".$keyword."&page=".$page_num;
//$odpurl = urlencode($odpurl);

$fp = fopen($odpurl, "r");
$string1 = join("", file( $odpurl));

fclose($fp);

$result = parseDelicious($string1);

if (($result == "") || ($result == NULL)) { return ""; } else { return $result;}

}
?>

In the function above, I call the complete "deliciou" page and store it in $string1. Now, I can move on to extracting URL from this fetched page. Now, I can pull pages from http://del.icio.us/ that are tagged with some keyword say PHP, Digg, Bush, Humour.

Step 2: This function would extract URLs from the fetched Delicious page.

<?php

function parseDelicious($string1){

$listUrl = "";

// Convert the HTML page (fetched from web) into an array by splitting it using space.

$ArrayText = explode(" ", $string1);

// iterate through each word in HTML page.

for($i=0; $i<count($ArrayText); $i++){

// Ignore all inertnal URLs having "http://del.icio.us" and advertizement keyword having "overture.com" and consider ones starting with "href=http://"

if ( !(strstr($ArrayText[$i], "http://del.icio.us")) && !(strstr($ArrayText[$i], "overture.com")) && !(strstr($ArrayText[$i], "http://blog.del.icio.us")) &&
(strstr($ArrayText[$i],"href=\"http://")))
{

$piece = substr($ArrayText[$i], 6);
$piece1 = substr($piece, 0, strpos($piece, "\""));
// $piece1 now has the URL extracted. Now, we shall extract the title also.

$j=1;
$url_title = "";
$end_found = false;
while (!$end_found){
$end_found = strstr($ArrayText[$i+$j], "</a>");

if ($j == 1){
$url_title_rel = substr($ArrayText[$i+1], 15);
}else {
$url_title_rel = $ArrayText[$i+$j];
}
if ($end_found){
$url_title .= substr($url_title_rel, 0, strpos($url_title_rel ,"<"));
}else{
$url_title .= $url_title_rel." ";
}
$j++;
}

//$url_title now has the title of the URL

$listUrl = $piece1."----".$url_title."<br> ";
echo $listUrl; // printing extracted URL and title from delicious
}

}
return $listUrl;

}

?>

In the function above, I explode (or simply, split) the fetched delicious HTML page using spaces and store it in array. Once done, now I can move through each keyword to find relevant information. Here, in this case it is URL and title.

Step 3: Thus, if I want to extract all links from delicious that are tagged with PHP, I can simply iterate as follows:

<?php

$i = 1;

while (true){

$cont = getDelicious('php', $i);

if ($cont == "") { break; }

$i++;

}

?>


Finally, the output for all URLs from delicious tagged with PHP on the first page is:

http://www.goodphptutorials.com/---GoodPHPTutorials.com - 40 PHP Tutorials
http://www.php.net/---PHP: Hypertext Preprocessor
http://www.phpfreaks.com/---PHP Help: PHP Freaks!
http://www.symfony-project.com/---symfony - open-source PHP5 web framework
http://www.phpbuilder.com/columns/vaska20050722.php3?aid=948---PHPBuilder.com, the best resource for PHP tutorials, templates, PHP manuals, content management systems, scripts, classes and more.
http://www.cakephp.org/---CakePHP : the rapid development php framework
http://www.phpit.net/article/ten-different-php-frameworks/---PHPit - Totally PHP » Taking a look at ten different PHP frameworks
http://www.modernmethod.com/sajax/---SAJAX - Simple Ajax Toolkit by ModernMethod - XMLHTTPRequest Toolkit for PHP
http://www.php.net/manual/en/---PHP: PHP Manual - Manual
http://cakephp.org/---CakePHP : the rapid development php framework





hi sir,
The method implemented by you to extract the urls is very useful.However we want to print the tags related to the urls.Will you please help us how to extract the related tags along with urls??