Crawler Python API¶

Getting started with Crawler is easy. The main class you need to care about is Crawler

crawler.main¶

class crawler.main.Crawler(url, delay, ignore)¶

Main Crawler object.

Example:

c = Crawler('http://example.com')
c.crawl()

Parameters:	delay – Number of seconds to wait between searches ignore – Paths to ignore

crawl()¶

Crawl the URL set up in the crawler.

This is the main entry point, and will block while it runs.

get(url)¶

Get a specific URL, log its response, and return its content.

Parameters:	url – The fully qualified URL to retrieve

crawler.main.run_main()¶: A small wrapper that is used for running as a CLI Script.

utils.should_ignore(ignore_list, url)¶

Returns True if the URL should be ignored

Parameters:	ignore_list – The list of regexs to ignore. url – The fully qualified URL to compare against.

>>> should_ignore(['blog/$'], 'http://ericholscher.com/blog/')
True

# This test should fail
>>> should_ignore(['home'], 'http://ericholscher.com/blog/')
True

utils.log(url, status)¶

Log information about a response to the console.

Parameters:	url – The URL that was retrieved. status – A status code for the Response.

>>> log('http://ericholscher.com/blog/', 200)
OK: 200 http://ericholscher.com/blog/

>>> log('http://ericholscher.com/blog/', 500)
ERR: 500 http://ericholscher.com/blog/

# This test should fail
>>> log('http://ericholscher.com/blog/', 500)
OK: 500 http://ericholscher.com/blog/