Differences between the PHP port and the original ------------------------------------------------------ Arc90's Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page's CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP's ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90's Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90's Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.) Another significant difference is that the aim of Arc90's Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser - Arc90 already do that extremely well, and for PDF output there's FiveFilters.org's PDF Newspaper: http://fivefilters.org/pdf-newspaper/. Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don't want to do because it makes debugging and updating more difficult), I've tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.
Example #1
0
 /** Extract article from a page using php-readability */
 function getArticle($url)
 {
     $html = file_get_contents($url);
     $Readability = new Readability($html, $url);
     $result = $Readability->init();
     $results = array('title' => $Readability->getTitle()->textContent, 'content' => $Readability->getContent()->textContent);
     return $results;
 }
function getTitle($url)
{
    $cachedURL = str_replace('http://', 'http://webcache.googleusercontent.com/search?q=cache:', $url);
    $html = file_get_contents($cachedURL);
    $readability = new Readability($html, $url, 'libxml', false);
    $readabilityData = $readability->init();
    if ($readability->getTitle()->textContent == "") {
        $html = file_get_contents($url);
        $readability = new Readability($html, $url, 'libxml', false);
        $readabilityData = $readability->init();
        if ($readability->getTitle()->textContent == "") {
            return 'This link has no title';
        }
    }
    return $readability->getTitle()->textContent;
}