PHP MyCrawler::obeyRobotsTxt Exemples

Langage de programmation: PHP

Class/Type: MyCrawler

Méthode/Fonction: obeyRobotsTxt

Exemples au hotexamples.com: 3

PHP MyCrawler::obeyRobotsTxt - 3 exemples trouvés. Ce sont les exemples réels les mieux notés de MyCrawler::obeyRobotsTxt extraits de projets open source. Vous pouvez noter les exemples pour nous aider à en améliorer la qualité.

Méthodes fréquemment utilisées

Afficher Cacher

setURL(23)

addURLFilterRule(5)

setTrafficLimit(3)

obeyRobotsTxt(3)

setPageLimit(3)

addContentTypeReceiveRule(2)

goMultiProcessed(2)

go(2)

obeyNoFollowTags(2)

enableAggressiveLinkSearch(2)

addURLFollowRule(2)

setFollowMode(2)

setCrawlingDepthLimit(1)

setUrlCacheType(1)

setLinkExtractionTags(1)

setUserAgentString(1)

setWorkingDirectory(1)

addBasicAuthentication(1)

resume(1)

processLinks(1)

getProcessReport(1)

getCrawlerId(1)

excludeLinkSearchDocumentSections(1)

enableResumption(1)

enableCookieHandling(1)

addReceiveContentType(1)

addLinkSearchContentType(1)

set_url_test_auth(1)

Méthodes fréquemment utilisées

setURL (23)

addURLFilterRule (5)

setTrafficLimit (3)

obeyRobotsTxt (3)

setPageLimit (3)

addContentTypeReceiveRule (2)

goMultiProcessed (2)

go (2)

obeyNoFollowTags (2)

enableAggressiveLinkSearch (2)

Méthodes fréquemment utilisées

addURLFollowRule (2)

setFollowMode (2)

setCrawlingDepthLimit (1)

setUrlCacheType (1)

setLinkExtractionTags (1)

setUserAgentString (1)

setWorkingDirectory (1)

addBasicAuthentication (1)

resume (1)

processLinks (1)

getProcessReport (1)

getCrawlerId (1)

excludeLinkSearchDocumentSections (1)

enableResumption (1)

enableCookieHandling (1)

addReceiveContentType (1)

addLinkSearchContentType (1)

set_url_test_auth (1)

Méthodes fréquemment utilisées

getProcessReport (1)

getCrawlerId (1)

excludeLinkSearchDocumentSections (1)

enableResumption (1)

enableCookieHandling (1)

addReceiveContentType (1)

addLinkSearchContentType (1)

set_url_test_auth (1)

Associées

Test

MagentoPlugin_TestCase

ComposerAutoloaderInit8a8874a06841859a9c31cbc652c4de6b

ActionDescriptor

ids_to_array

GtinFormat

adshow_show_adverts

Checkbox

Application_Modules_Installer_Models_BuildPosts

Jam_Validator_Attributes

Related in langs

RigidBodyBase (C#)

ProductBusiness (C#)

mp_div_2d (C++)

spy (C++)

BlendFunc (Go)

vorbis_synthesis_lapout (Go)

Stmt (Java)

RangeRestriction (Java)

activate (Python)

rgb2hsv (Python)

Exemple #1

0

Afficher le fichier

Fichier : crawler.php Projet : suvash23/app-demo

/** * crawl method * Create the crawler class object and set the options for crawling * @param type $u URL */ function crawl($u) { $C = new MyCrawler(); $C->setURL($u); $C->addContentTypeReceiveRule("#text/html#"); /* Only receive HTML pages */ $C->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)\$# i"); /* We don't want to crawl non HTML pages */ $C->setTrafficLimit(2000 * 1024); $C->obeyRobotsTxt(true); /* Should We follow robots.txt */ $C->go(); }

Exemple #2

0

Afficher le fichier

Fichier : example.php Projet : anselmbradford/OpenSanMateo

echo "Content not received" . $lb; } // Now you should do something with the content of the actual // received page or file ($DocInfo->source), we skip it in this example echo $lb; flush(); } } // Now, create a instance of your class, define the behaviour // of the crawler (see class-reference for more options and details) // and start the crawling-process. $crawler = new MyCrawler(); // URL to crawl $crawler->setURL("localhost.p2.gta.charlie"); $crawler->obeyNoFollowTags(TRUE); $crawler->obeyRobotsTxt(TRUE); $crawler->enableAggressiveLinkSearch(FALSE); // Only receive content of files with content-type "text/html" $crawler->addContentTypeReceiveRule("#text/html#"); // Ignore links to pictures, dont even request pictures $crawler->addURLFilterRule("#\\.(jpg|jpeg|gif|png|css|js)([?].*)?\$# i"); // Store and send cookie-data like a browser does $crawler->enableCookieHandling(true); // Set the traffic-limit to 1 MB (in bytes, // for testing we dont want to "suck" the whole site) $crawler->setTrafficLimit(1000 * 1024); // Thats enough, now here we go $crawler->go(); // At the end, after the process is finished, we print a short // report (see method getProcessReport() for more information) $report = $crawler->getProcessReport();

Exemple #3

0

Afficher le fichier

Fichier : crawl.php Projet : JamesRichard-son/whyte-dwarf

} // Final re-order $this->links = array_values($this->links); return $this->links; } } // Now, create a instance of your class, define the behaviour // of the crawler (see class-reference for more options and details) // and start the crawling-process. $crawler = new MyCrawler($_SESSION['crawler']['domain']); $crawler->setFollowMode(2); $crawler->addContentTypeReceiveRule("#text/html#"); $crawler->addURLFilterRule("#\\.(jpg|jpeg|gif|png)\$# i"); $crawler->enableCookieHandling(true); if ($_SESSION['crawler']['respect_robots_txt'] == true) { $crawler->obeyRobotsTxt(true, $_SESSION['crawler']['domain'] . '/robots.txt'); $crawler->obeyNoFollowTags(true); } $crawler->enableAggressiveLinkSearch(false); $crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS); $crawler->addLinkSearchContentType("#text/html# i"); $crawler->setLinkExtractionTags(array('href')); $crawler->setUserAgentString('Crawl_Scrape_Solr_Index/1.0)'); // no data on poage yet if ($_SESSION['crawler']['auth'] == true) { $crawler->set_url_test_auth($_SESSION['crawler']['user'], $_SESSION['crawler']['pass']); $pattern = "/https?://" . str_replace('.', '\\.', $_SESSION['crawler']['silo']) . "/is"; $crawler->addBasicAuthentication($pattern, $_SESSION['crawler']['user'], $_SESSION['crawler']['pass']); } // Thats enough, now here we go $crawler->go();