Crawler Python API¶
Getting started with Crawler is easy.
The main class you need to care about is Crawler
crawler.main¶
-
class
crawler.main.
Crawler
(url, delay, ignore)¶ Main Crawler object.
Example:
c = Crawler('http://example.com') c.crawl()
Parameters: - delay – Number of seconds to wait between searches
- ignore – Paths to ignore
-
crawl
()¶ Crawl the URL set up in the crawler.
This is the main entry point, and will block while it runs.
-
get
(url)¶ Get a specific URL, log its response, and return its content.
Parameters: url – The fully qualified URL to retrieve
-
crawler.main.
run_main
()¶ A small wrapper that is used for running as a CLI Script.
crawler.utils¶
-
utils.
should_ignore
(ignore_list, url)¶ Returns True if the URL should be ignored
Parameters: - ignore_list – The list of regexs to ignore.
- url – The fully qualified URL to compare against.
>>> should_ignore(['blog/$'], 'http://ericholscher.com/blog/')
True
# This test should fail
>>> should_ignore(['home'], 'http://ericholscher.com/blog/')
True
-
utils.
log
(url, status)¶ Log information about a response to the console.
Parameters: - url – The URL that was retrieved.
- status – A status code for the Response.
>>> log('http://ericholscher.com/blog/', 200)
OK: 200 http://ericholscher.com/blog/
>>> log('http://ericholscher.com/blog/', 500)
ERR: 500 http://ericholscher.com/blog/
# This test should fail
>>> log('http://ericholscher.com/blog/', 500)
OK: 500 http://ericholscher.com/blog/