Aldo

Simple, yet relatively advanced HTML scraper.

Most page scrapers built in PHP can be tedious to use, while providing unintended results. These page scrapers iterate through the HTML for each independent "DOM Extraction", thus making it slow to use. Once the results are received, you still need to manipulate and sort the data yourself, which can be difficult without knowledge of JavaScript.

Aldo aims to make it almost effortless to fetch results from a remote website.

Installation

Unfortunately this project is not yet up on composer. So for installation as of current, you need to fork the repository or download the ZIP file.

How to Use

use Aldo\Lexer\Lexer;
use Aldo\Http\Request;

// Create an HTTP request. Can be done with Aldo's own Request class, Guzzle, or your own library
$request = new Request('http://localhost/Aldo/test.html');
$html = $request->fetch();

// Transform the HTML into an array of elements and return the element manager
$lexer = new Lexer();
$elementManager = $lexer->transform($html);

Elements

This is what a typical element looks like

Element => {
    [id] => index of elements array,
    [tag] => HTML tag name,
    [value] => If the element has inner text or value attribute,
    [attributes] =>
        [class] => array|string depending on number of classes
        [id] => ID of element,
        any other attributes can be found here
    [parent] => index of parent in elements array
}

Element Management

// Getting an element
$elementManager->getElement('a#bob.class-here'); // using a selector, only supports tag name, id and class. Optional 2nd parameter for an elements array returned from getChildren()
$elementManager->getElementWithAttributes(['tag' => 'a', 'id' => 'bob', 'class' => ['class-here']]); // class can also be a string if it is one class
$elementManager->getElementByIndex(0); // Gets the element by the index in the elements array. Both the opening and close tag count as 2 elements
$elementManager->getElementById('bob'); // Gets the element by HTML id
$elementManager->getElementsByClass('class-here'); // Gets the element using classes, can be either string or array

// Getting parent of element
$elementManager->getParent($element); // Retrieve parent from already fetched element
$elementManager->getParentByIndex(1); // Retrieve parent from index in the elements array. In this case the parent would be <html>
$element->getParent(); // Retrieve parent directly from element

// Getting children of element
$elementManager->getChildren($element); // Retrieve children from already fetched element
$elementManager->getChildrenByIndex(0); // Retrieve children from index in the elements array. This would return everything inside <html>
$element->getChildren(); // Retrieve children directly from element

Element Filtering

As of right now, there is only 3 ways of filtering but more will be added in the future.

use Aldo\Element\ElementFilter;

$emails = ElementFilter::getEmails($elements, $attribute); // Retrieve all emails from the elements array. Optional second parameter for searching within a specific attribute

$words = ElementFilter::getElementsWithWord($elements, $word, $attribute); // Retrieve all elements that have a specific word in them. Optional second parameter for searching within an attribute

$urls = ElementFilter::getUrls($elements, $attribute); // Retrieve all urls from the elements array. Optional second parameter for searching within an attribute

Element Sorting

use Aldo\Element\ElementSort;

$sortedElements = ElementSort::orderBy($elements, $attribute, $direction = 'asc'|'desc'); // Sort the elements by the attribute, and which direction you choose, default direction is asc.

// Please be aware that the closing tags are included within the elements array. Sorting by tag may result in the closing tags appearing first.

Element Aliases

There are a few aliases to help retrieve certain attributes

$element->link(); // Retrieves the href attribute, assuming it's an anchor tag

$element->source(); // Retrieves the src attribute

$element->val(); //Retrieves the value attribute or the inner text of element

Rebuilding HTML

If for some reason you need to rebuild the HTML, you can do so using this:

use Aldo\Lexer\Lexer;

$lexer = new Lexer;
$elements = $lexer->transform($html); // based on How to Use guide
$lexer->rebuild($elements, $name); // optional second parameter for the name of new html file, which will be available in base directory. Default name is "rebuild". Please omit the .html

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src/Aldo		src/Aldo
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
composer.json		composer.json
example.php		example.php
new.html		new.html
rebuild.html		rebuild.html
test.html		test.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/Aldo

src/Aldo

tests

tests

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

composer.json

composer.json

example.php

example.php

new.html

new.html

rebuild.html

rebuild.html

test.html

test.html

Repository files navigation

Aldo

Simple, yet relatively advanced HTML scraper.

Installation

How to Use

Elements

Element Management

Element Filtering

Element Sorting

Element Aliases

Rebuilding HTML

TODO

About

Releases

Packages

Languages

License

Tuccinator/Aldo

Folders and files

Latest commit

History

Repository files navigation

Aldo

Simple, yet relatively advanced HTML scraper.

Installation

How to Use

Elements

Element Management

Element Filtering

Element Sorting

Element Aliases

Rebuilding HTML

TODO

About

Resources

License

Stars

Watchers

Forks

Languages