Python modules you should know: Scrapy
April 22, 2012 at 10:50 AM | categories: Python, PyMYSK, Howto | View CommentsNext in our series of Python modules you should know is Scrapy. Do you want to be the next Google ? Well read on.
Home page
Use
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You can use Scrapy to extract any kind of data from a web page, in HTML, XML, CSV and other formats. I recently used it to automate the extraction of domains and emails on the ISPA Spam Hall of Shame list, for use in a DNSBL.
Installation
pip install scrapy
Usage
Scrapy is a very extensive package it is not possible to describe its full usage in a single blog post, There is tutorial on the scrapy website as well as extensive documentation.
For this post i will describe how i used it to extract listed domains from the ISPA hall of shame website.
The page is http://ispa.org.za/spam/hall-of-shame/ and looking at the page source you find that the domains are displayed in lists with bold text "Domains: " before the actual domains list
<ul>
<li><strong>Domains: </strong>
dfemail.co.za, extremedeals.co.za, hospitalcoverza.co.za,
lifeinsuranceza.co.za, portablebreathalyzer.co.za
</li>
<li><strong>Addresses: </strong>bounce@dfemail.co.za, bounce@extremedeals.co.za,
bounce@hospitalcoverza.co.za, bounce@lifeinsuranceza.co.za,
bounce@portablebreathalyzer.co.za, info@dfemail.co.za, info@extremedeals.co.za,
info@gmarketing.co.za, info@hospitalcoverza.co.za, info@lifeinsuranceza.co.za,
sales@portablebreathalyzer.co.za
</li>
</ul>
The Xpath expression to extract this will be.
'//li/strong[text()="Domains: "]/following-sibling::text()'
For more information about XPath see the XPath reference.
With the Xpath expression we can now write a spider to download the webpage and extract the data we want.
Create a python file crawl-ispa-domains.py with the following contents
#!/usr/bin/python # -*- coding: utf-8 -*- # crawl-ispa-domains.py # Copyright (C) 2012 Andrew Colin Kissa <andrew@topdog.za.net> # vim: ai ts=4 sts=4 et sw=4 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class ISPASpider(BaseSpider): name = "ispa-domains" allowed_domains = ["ispa.org.za"] start_urls = [ "http://ispa.org.za/spam/hall-of-shame/", ] def parse(self, response): hxs = HtmlXPathSelector(response) lines = hxs.select('//li/strong[text()="Domains: "]/following-sibling::text()').extract() for line in lines: domains = line.split(',') domains = [domain.strip() for domain in domains if domain.strip()] for domain in domains: print domain
You can then run the spider from the command line and it should provide you will the list of domains extracted.
scrapy runspider --nolog crawl-ispa-domains.py
And there is more
This post just touches a tip of what scrapy can do, use the documentation for details on what can be done using this package.
blog comments powered by Disqus