PHP MyCrawler::obeyNoFollowTags Examples

Programming Language: PHP

Class/Type: MyCrawler

Method/Function: obeyNoFollowTags

Examples at hotexamples.com: 2

PHP MyCrawler::obeyNoFollowTags - 2 examples found. These are the top rated real world PHP examples of MyCrawler::obeyNoFollowTags extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

setURL(23)

addURLFilterRule(5)

setTrafficLimit(3)

obeyRobotsTxt(3)

setPageLimit(3)

addContentTypeReceiveRule(2)

goMultiProcessed(2)

go(2)

obeyNoFollowTags(2)

enableAggressiveLinkSearch(2)

addURLFollowRule(2)

setFollowMode(2)

setCrawlingDepthLimit(1)

setUrlCacheType(1)

setLinkExtractionTags(1)

setUserAgentString(1)

setWorkingDirectory(1)

addBasicAuthentication(1)

resume(1)

processLinks(1)

getProcessReport(1)

getCrawlerId(1)

excludeLinkSearchDocumentSections(1)

enableResumption(1)

enableCookieHandling(1)

addReceiveContentType(1)

addLinkSearchContentType(1)

set_url_test_auth(1)

Example #1

Show file

File: example.php Project: anselmbradford/OpenSanMateo

        } else {
            echo "Content not received" . $lb;
        }
        // Now you should do something with the content of the actual
        // received page or file ($DocInfo->source), we skip it in this example
        echo $lb;
        flush();
    }
}
// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.
$crawler = new MyCrawler();
// URL to crawl
$crawler->setURL("localhost.p2.gta.charlie");
$crawler->obeyNoFollowTags(TRUE);
$crawler->obeyRobotsTxt(TRUE);
$crawler->enableAggressiveLinkSearch(FALSE);
// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");
// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\\.(jpg|jpeg|gif|png|css|js)([?].*)?\$# i");
// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);
// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
$crawler->setTrafficLimit(1000 * 1024);
// Thats enough, now here we go
$crawler->go();
// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)

Example #2

Show file

File: crawl.php Project: JamesRichard-son/whyte-dwarf

        // Final re-order
        $this->links = array_values($this->links);
        return $this->links;
    }
}
// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.
$crawler = new MyCrawler($_SESSION['crawler']['domain']);
$crawler->setFollowMode(2);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\\.(jpg|jpeg|gif|png)\$# i");
$crawler->enableCookieHandling(true);
if ($_SESSION['crawler']['respect_robots_txt'] == true) {
    $crawler->obeyRobotsTxt(true, $_SESSION['crawler']['domain'] . '/robots.txt');
    $crawler->obeyNoFollowTags(true);
}
$crawler->enableAggressiveLinkSearch(false);
$crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS);
$crawler->addLinkSearchContentType("#text/html# i");
$crawler->setLinkExtractionTags(array('href'));
$crawler->setUserAgentString('Crawl_Scrape_Solr_Index/1.0)');
// no data on poage yet
if ($_SESSION['crawler']['auth'] == true) {
    $crawler->set_url_test_auth($_SESSION['crawler']['user'], $_SESSION['crawler']['pass']);
    $pattern = "/https?://" . str_replace('.', '\\.', $_SESSION['crawler']['silo']) . "/is";
    $crawler->addBasicAuthentication($pattern, $_SESSION['crawler']['user'], $_SESSION['crawler']['pass']);
}
// Thats enough, now here we go
$crawler->go();
// At the end, after the process is finished, we print a short