News Scrapper

This library extract article/news information from a webpage including: title, main image, description, author, keywords, publish date and body (if possible)...

This library supports scrapping using standard structured meta data, like: Microdata, hAtom Microformat ..etc, along with custom selectors that can be specified to support unstructured webpages.

News-Scrapper requires PHP >= 5.4

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:

{
    "require": {
        "zrashwani/news-scrapper": "1.*"
    }
}

Then run composer install.

How to Use

Here's a quick how to scrap news data from a webpage:

    require 'vendor/autoload.php';

    // Initiate scrapper
    $scrap_client = new \Zrashwani\NewsScrapper\Client();    
	print_r($scrap_client->getLinkData($url));

By default, scrapper tries to guess the best structured data adapter and apply it.

Scrapping Structured data

You can select a specific adapter to be used for extracting the data as following:

    $url = "http://example.com/your-news-uri";
    //use microdata standard for scrapping
    $scrap_client = new \Zrashwani\NewsScrapper\Client('Microdata'); 
    print_r($scrap_client->getLinkData($url));

Here is the list of supported structured data adapters or scrapping modes:

Scrapping Unstructured data

If the webpage doesn't follow any standard structured data, you can still scrap news information by specifying xpath or css selector for different article parts like: title, description, image and body. as following:

$scrapClient = new \Zrashwani\NewsScrapper\Client('Custom');

/*@var $adapter \Zrashwani\NewsScrapper\Adapters\CustomAdapter */
$adapter = $scrapClient->getAdapter();
$adapter        
        ->setTitleSelector('.single-post h1') //selectors can be either css or xpath
        ->setImageSelector(".sidebar img")
        ->setAuthorSelector('//a[@rel="author"]')
        ->setPublishDateSelector('//span[@class="published_data"]')
        ->setBodySelector('//div[@class="contents"]');        

$newsData = ($scrapClient->getLinkData("http://example.com/your-news-uri"));
print_r($newsData);

Custom scrapping adapter CustomAdapter supports method chaining for setting the selectors. If any selector is not specified it will use default selectors based on DefaultAdapter (which is html adapter that depends of standard meta tags).

Scrapping Group of Links

To scrap group of news article from certain page containing news links, scrapLinkGroup method can be used

$listingPageUrl = 'https://www.readability.com/topreads/'; //url containing news listing
$linksSelector = '.entry-title a'; //css or xpath selector for news links inside listing page
$numberOfArticles = 3; //number of links to scrap, use null to get all matching selector

$scrapClient = new \Zrashwani\NewsScrapper\Client();
$newsGroupData = $scrapClient->scrapLinkGroup($listingPageUrl, $linksSelector,$numberOfArticles);                
foreach($newsGroupData as $singleNews){
    print_r($singleNews);
}

How to Contribute

Fork this repository
Create a new branch for each feature or improvement
Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard.

System Requirements

PHP 5.4.0+

License

MIT Public License

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
src/NewsScrapper		src/NewsScrapper
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/NewsScrapper

src/NewsScrapper

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE.md

LICENSE.md

README.md

README.md

composer.json

composer.json

composer.lock

composer.lock

phpunit.xml.dist

phpunit.xml.dist

Repository files navigation

News Scrapper

How to Install

How to Use

Scrapping Structured data

Scrapping Unstructured data

Scrapping Group of Links

How to Contribute

System Requirements

License

About

Releases 6

Packages

Contributors 3

Languages

License

zrashwani/news-scrapper

Folders and files

Latest commit

History

Repository files navigation

News Scrapper

How to Install

How to Use

Scrapping Structured data

Scrapping Unstructured data

Scrapping Group of Links

How to Contribute

System Requirements

License

About

Resources

License

Stars

Watchers

Forks

Languages