Sunday, March 4, 2018

Extract URL Contents With PHP and jQuery

How to extract url contents ? This tutorial will guide you how to extract url contents like many websites "Facebook, Twitter, Google" and retrieve the information about what any url title and description is about.

Extract URL Contents With PHP and jQuery

We will be creating following files:

  1. index.php, Contains html form that will allow us to submit a url for extraction.
  2. extract-contents.php, Will contain the code to fetch required data from submitted url.
  3. javascript.js, Will contain the code to send ajax request to extract-contents.php
  4. style.css, Contains all the style formatting for our html page and url data box.
 

index.php

<!DOCTYPE html>
<html>
    <head>
        <title>Extract URL Contents with PHP and jQuery Demo</title>
        <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
        <script type="text/javascript" src="js/jquery-3.1.1.min.js"></script>
        <script type="text/javascript" src="js/javascript.js"></script>
        <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
        <link rel="stylesheet" href="css/style.css" />
    </head>
    <body>
        <div class="main-container">
            <div class="extract-wrapper section">
                <label>Enter an absolute URL like http://www.codestacked.info</label>
                <form class="url-form">
                    <div class="fields-container">
                        <div class="loader">
                            <i class="fa fa-spinner fa-spin"></i>
                        </div>
                        <input type="url" class="form-control url-input" value="" required="required" placeholder="Enter a URL to extract contents" />
                        <button type="submit">Extract</button>
                    </div>
                </form>
                <div class="content-wrapper" id="content-wrapper"></div>
            </div>
        </div>
    </body>
</html>

So in extract-contents.php we first create a regular expression to validate the submitted url, If url is valid we will fetch the contents of submitted url and open a new dom document and load this fetched content as html into our newly opened dom document. We initially set Title, Description and image as empty. First we prepare an array of images in case there is no open graph image added to document we will use the first image on submitted url page. 

After that we for all three desired values we will first look for open graph meta tags, If they exist we will be using them for Title, Description and Image. Otherwise we will fallback to document meta tags for Title and Description and for Image we will use the first image on submitted url page. The new domxpath() will be used for accessing elements in loaded dom document using xpath queries. 

extract-contents.php

<?php
if($_POST){
$post = $_POST;
$url = strtolower($post["url"]);
$url = strpos($url,"http") !== false ? $url : "http://$url";

//=== regular expression to validate url
$regEx = "/^((https?|ftp):\/\/)(www\.)?[\w\-]+\.[a-z]{2,4}\/?[\w\/\-]*(\.[a-z]{2,4})?$/";

preg_match($regEx,$url,$hostname);

//=== Check if url is a valid url
if(preg_match($regEx,$url)){
//=== Get contents of url
$content [email protected]_get_contents($url);

//=== If failed to get contents show an error
if(!$content){
die('<div class="error">Error parsing the submitted URL.</div>');
}
$title = $description = "";

$images_arr = [];

//=== Open new dom document object
$dom = new domDocument("1.0", "UTF-8");

//=== Load url content to dom document object
@$dom->loadHTML($content);

//=== Get images from dom document
$images = $dom->getElementsByTagName("img");

//=== Loop through images and push them to images array
foreach ($images as $image)
{
$src = parse_url($image->getAttribute("src"));
if($src["path"])
$images_arr[]=$image->getAttribute("src");
}

//=== Open xpath object for current dom document
$xPath = new domxpath($dom);
$og_title = $xPath -> query("//meta[@property='og:title']");
$og_description = $xPath -> query("//meta[@property='og:description']");
$og_image = $xPath -> query("//meta[@property='og:image']");

$meta_description = @$xPath -> query("//meta[@name='description']");
$meta_title = @$xPath -> query("//title");

//=== Prepare title of document
if($og_title->length){
$title = $og_title -> item(0)->getAttribute("content");
}elseif($meta_title->length){
$title = $meta_title -> item(0)->textContent;
}

//=== Prepare description of document
if($og_description->length){
$description = $og_description -> item(0)->getAttribute("content");
}elseif($meta_description->length){
$description = $meta_description -> item(0)->getAttribute("content");
}

//=== Prepare image of document
if($og_image->length){
$image = $og_image -> item(0)->getAttribute("content");
}elseif($meta_description->length){
$image = reset($images_arr);
}?>
<div class="url-info-box">
<?php
if(!empty($image)){
//=== Handling the https urls for images
$image = (preg_match("/^(https?)/",$image)) || (preg_match("/^(\/\/)/",$image)) ? $image : $hostname[0].$image;
?>
<div class="image">
<img src="<?php echo $image;?>" class="img-responsive" />
</div>
<?php } ?>
<div class="data">
<div class="title">
<?php echo $title;?>
</div>
<div class="description"><?php echo $description; ?></div>
</div>
</div>
<?php
}else{
echo '<div class="error">Invalid URL submitted.</div>';
}
}
?>

javascript.js

$(document).ready(function(){
    $(".url-form").on("submit",function(e){
        e.preventDefault();
        var url = $(".url-input").val();
         $(".content-wrapper").hide();
         if(url != ''){
             $(".loader").fadeIn();
             $.ajax({
                url: "extract-contents.php",
                type: "POST",
                data:{
                    url: url
                },
                success: function(data){
                    $(".content-wrapper").html(data).slideDown();
                    $(".loader").fadeOut();
                }
             });
         }
    });
});

style.css

*{
box-sizing: border-box;
}
html,body{
margin: 0px;
padding: 0px;
}
body{
background: #f0f0f0;
font: normal normal 14px Open Sans,Verdana, Arial;
}
.main-container{
max-width: 1024px;
margin: 0px auto;
}
.extract-wrapper{
margin-bottom: 20px;
}
.fields-container {
position: relative;
margin-bottom: 20px;
display: flex;
}
.loader{
position: absolute;
font-size: 30px;
background: rgba(150,150,150,0.5);
width: 100%;
height: 100%;
z-index: 5;
padding: 0px 10px;
display: none;
color: #006699;
text-align: center;
}
.url-form button{
background: teal;
color: #fff;
border: none;
padding: 11px 15px;
font-weight: bold;
cursor: pointer;
}
input.form-control{
border: 1px solid #ddd;
padding: 10px;
color: #444;
font-size: 15px;
width: 100%;
}
.content-wrapper .error{
padding: 10px;
background: #e95454;
color: #fff;
}
.url-info-box{
background: #fefefe;
border: 1px solid #fefefe;
overflow: hidden;
font-size: 13px;
max-width: 300px;
}
.img-responsive{
max-width: 100%;
height: auto;
display: block;
margin: 0px auto;
}
.url-info-box .data{
padding: 15px;
background: #efefef;
}
.url-info-box .title{
font-weight: bold;
max-height: 35px;
overflow: hidden;
color: #3778cd;
}