Skip to content

Tuccinator/Aldo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aldo

Simple, yet relatively advanced HTML scraper.

Most page scrapers built in PHP can be tedious to use, while providing unintended results. These page scrapers iterate through the HTML for each independent "DOM Extraction", thus making it slow to use. Once the results are received, you still need to manipulate and sort the data yourself, which can be difficult without knowledge of JavaScript.

Aldo aims to make it almost effortless to fetch results from a remote website.

Installation

Unfortunately this project is not yet up on composer. So for installation as of current, you need to fork the repository or download the ZIP file.

How to Use

use Aldo\Lexer\Lexer;
use Aldo\Http\Request;

// Create an HTTP request. Can be done with Aldo's own Request class, Guzzle, or your own library
$request = new Request('http://localhost/Aldo/test.html');
$html = $request->fetch();

// Transform the HTML into an array of elements and return the element manager
$lexer = new Lexer();
$elementManager = $lexer->transform($html);

Elements

This is what a typical element looks like

Element => {
    [id] => index of elements array,
    [tag] => HTML tag name,
    [value] => If the element has inner text or value attribute,
    [attributes] =>
        [class] => array|string depending on number of classes
        [id] => ID of element,
        any other attributes can be found here
    [parent] => index of parent in elements array
}

Element Management

// Getting an element
$elementManager->getElement('a#bob.class-here'); // using a selector, only supports tag name, id and class. Optional 2nd parameter for an elements array returned from getChildren()
$elementManager->getElementWithAttributes(['tag' => 'a', 'id' => 'bob', 'class' => ['class-here']]); // class can also be a string if it is one class
$elementManager->getElementByIndex(0); // Gets the element by the index in the elements array. Both the opening and close tag count as 2 elements
$elementManager->getElementById('bob'); // Gets the element by HTML id
$elementManager->getElementsByClass('class-here'); // Gets the element using classes, can be either string or array

// Getting parent of element
$elementManager->getParent($element); // Retrieve parent from already fetched element
$elementManager->getParentByIndex(1); // Retrieve parent from index in the elements array. In this case the parent would be <html>
$element->getParent(); // Retrieve parent directly from element

// Getting children of element
$elementManager->getChildren($element); // Retrieve children from already fetched element
$elementManager->getChildrenByIndex(0); // Retrieve children from index in the elements array. This would return everything inside <html>
$element->getChildren(); // Retrieve children directly from element

Element Filtering

As of right now, there is only 3 ways of filtering but more will be added in the future.

use Aldo\Element\ElementFilter;

$emails = ElementFilter::getEmails($elements, $attribute); // Retrieve all emails from the elements array. Optional second parameter for searching within a specific attribute

$words = ElementFilter::getElementsWithWord($elements, $word, $attribute); // Retrieve all elements that have a specific word in them. Optional second parameter for searching within an attribute

$urls = ElementFilter::getUrls($elements, $attribute); // Retrieve all urls from the elements array. Optional second parameter for searching within an attribute

Element Sorting

use Aldo\Element\ElementSort;

$sortedElements = ElementSort::orderBy($elements, $attribute, $direction = 'asc'|'desc'); // Sort the elements by the attribute, and which direction you choose, default direction is asc.

// Please be aware that the closing tags are included within the elements array. Sorting by tag may result in the closing tags appearing first.

Element Aliases

There are a few aliases to help retrieve certain attributes

$element->link(); // Retrieves the href attribute, assuming it's an anchor tag

$element->source(); // Retrieves the src attribute

$element->val(); //Retrieves the value attribute or the inner text of element

Rebuilding HTML

If for some reason you need to rebuild the HTML, you can do so using this:

use Aldo\Lexer\Lexer;

$lexer = new Lexer;
$elements = $lexer->transform($html); // based on How to Use guide
$lexer->rebuild($elements, $name); // optional second parameter for the name of new html file, which will be available in base directory. Default name is "rebuild". Please omit the .html

TODO

  • HTTP Requests
  • Element Manager
  • Selectors for ID, class, and tag name
  • Sorting
  • Filtering (getting emails)
  • Rebuild HTML
  • Parent/children
  • Set value of element, instead of creating a new array for value
  • Handle HTML empty elements: input, br, etc
  • Do not include comments in sequence
  • Alias functions for certain attributes; href => link(), src => source(), value => val(), etc
  • Support multiple classes in element
  • Turn arrays into objects

About

Revamp of the previous page scraper, "Damon".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published