Differences between the PHP port and the original
------------------------------------------------------
Arc90's Readability is designed to run in the browser. It works on the DOM
tree (the parsed HTML) after the page's CSS styles have been applied and
Javascript code executed. This PHP port does not run inside a browser.
We use PHP's ability to parse HTML to build our DOM tree, but we cannot
rely on CSS or Javascript support. As such, the results will not always
match Arc90's Readability. (For example, if a web page contains CSS style
rules or Javascript code which hide certain HTML elements from display,
Arc90's Readability will dismiss those from consideration but our PHP port,
unable to understand CSS or Javascript, will not know any better.)
Another significant difference is that the aim of Arc90's Readability is
to re-present the main content block of a given web page so users can
read it more easily in their browsers. Correct identification, clean up,
and separation of the content block is only a part of this process.
This PHP port is only concerned with this part, it does not include code
that relates to presentation in the browser - Arc90 already do
that extremely well, and for PDF output there's FiveFilters.org's
PDF Newspaper: http://fivefilters.org/pdf-newspaper/.
Finally, this class contains methods that might be useful for developers
working on HTML document fragments. So without deviating too much from
the original code (which I don't want to do because it makes debugging
and updating more difficult), I've tried to make it a little more
developer friendly. You should be able to use the methods here on
existing DOMElement objects without passing an entire HTML document to
be parsed.