Skip to content

How to Deploy Custom Docker Images for Your Web Crawlers

What if you could have complete control over your environment? Your crawling environment, that is… One of the many benefits of our upgraded production environment, Scrapy Cloud 2.0, is that you can customize your crawler runtime environment via Docker images. It’s like a superpower that allows you to use specific versions of Python, Scrapy and the rest of your stack, deciding if and when to upgrade.

docker

With this new feature, you can tailor a Docker image to include any dependency your crawler might have. For instance, if you wanted to crawl JavaScript-based pages using Selenium and PhantomJS, you would have to include the PhantomJS executable somewhere in the PATH of your crawler’s runtime environment.

And guess what, we’ll be walking you through how to do just that in this post.

Heads up, while we have a forever free account, this feature is only available for paid Scrapy Cloud users. The good news is that it’s easy to upgrade your account. Just head over to the Billing page on Scrapy Cloud.

Upgrade Your Account

Using a custom image to run a headless browser

Download the sample project or clone the GitHub repo to follow along.

Imagine you created a crawler to handle website content that is rendered client-side via Javascript. You decide to use selenium and PhantomJS. However, since PhantomJS is not installed by default on Scrapy Cloud, trying to deploy your crawler the usual way would result in this message showing up in the job logs:

selenium.common.exceptions.WebDriverException: Message: 'phantomjs' executable needs to be in PATH.

PhantomJS, which is a C++ application, needs to be installed in the runtime environment. You can do this by creating a custom Docker image that downloads and installs the PhantomJS executable.

Building a custom Docker image

First you have to install a command line tool that will help you with building and deploying the image:

$ pip install shub-image

Before using shub-image, you have to include scrapinghub-entrypoint-scrapy in your project’s requirements file, which is a runtime dependency of Scrapy Cloud.

$ echo scrapinghub-entrypoint-scrapy >> ./requirements.txt

Once you have done that, run the following command to generate an initial Dockerfile for your custom image:

$ shub-image init --requirements ./requirements.txt

It will ask you whether you want to save the Dockerfile, so confirm by answering Y.

Now it’s time to include the installation steps for PhantomJS binary in the generated Dockerfile. All you need to do is copy the highlighted code below and put it in the proper place inside your Dockerfile:

FROM python:2.7
RUN apt-get update -qq && \
    apt-get install -qy htop iputils-ping lsof ltrace strace telnet vim && \
    rm -rf /var/lib/apt/lists/*
RUN wget -q https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/bin && \
    rm -rf phantomjs-2.1.1-linux-x86_64.tar.bz2 phantomjs-2.1.1-linux-x86_64
ENV TERM xterm
ENV PYTHONPATH $PYTHONPATH:/app
ENV SCRAPY_SETTINGS_MODULE demo.settings
RUN mkdir -p /app
WORKDIR /app
COPY ./requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app

The Docker image you’re going to build with shub-image has to be uploaded to a Docker registry. I used Docker Hub, the default Docker registry, to create a repository under my user account:

docker-hub

Once this is done, you have to define the images setting in your project’s scrapinghub.yml (replace stummjr/demo with your own):

projects:
    default: PUT_YOUR_PROJECT_ID_HERE
requirements_file: requirements.txt
images:
    default: stummjr/demo

This will tell shub-image where to push the image once it’s built and also where Scrapy Cloud should pull the image from when deploying.

Now that you have everything configured as expected, you can build, push and deploy the Docker image to Scrapy Cloud. This step may take a couple minutes, so now might be a good time to go grab a cup of coffee. 🙂

$ shub-image upload --username stummjr --password NotSoEasy
The image stummjr/demo:1.0 build is completed.
Pushing stummjr/demo:1.0 to the registry.
The image stummjr/demo:1.0 pushed successfully.
Deploy task results:
You can check deploy results later with 'shub-image check --id 1'.
Deploy results:
{u'status': u'progress', u'last_step': u'pulling'}
{u'status': u'ok', u'project': 98162, u'version': u'1.0', u'spiders': 1}

If everything went well, you should now be able to run your PhantomJS spider on Scrapy Cloud. If you followed along with the sample project from the GitHub repo, your crawler should have collected 300 quotes scraped from the page that was rendered with PhantomJS.

Wrap Up

You now officially know how to use custom Docker images with Scrapy Cloud to supercharge your crawling projects. For example, you might want to do OCR using Tesseract in your crawler. Now you can, it’s just a matter of creating a Docker image with the Tesseract command line tool and pytesseract installed. You can also install tools from apt repositories and even compile the libraries/tools that you want.

Warning: this feature is still in beta, so be aware that some Scrapy Cloud features, such as addons, dependencies and settings, still don’t work with custom images.

For further information, check out the shub-image documentation.

Feel free to comment below with any other ideas or tips you’d like to hear more about!

This feature is a perk of paid accounts, so painlessly upgrade to unlock custom docker images for your projects. Just head over to the Billing page on Scrapy Cloud.

Upgrade Your Account

Improved Frontera: Web Crawling at Scale with Python 3 Support

Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible.

barney

And so we’re pleased to announce that Frontera will remain alive and kicking because it now supports Python 3 in full! Joining the ranks of Scrapy and Scrapy Cloud, you can officially continue to quickly create and scale fully formed crawlers without any issues in your Python 3-ready stack.

Frontera_Vector_Medium

As a key web crawling toolbox that works with Scrapy, along with other web crawling systems, Frontera provides a crawl frontier framework that is ideal for broad crawls. Frontera manages when and what to crawl next, and checks for crawling goal accomplishment. This is especially useful for building a distributed architecture with multiple web spider processes consuming URLs from a frontier.

Once you’re done cheering with joy, read on to see how you can use this upgrade in your stack.

Python 3 Perks and Frontera Installation

This move to Python 3 includes all run modes, workers, message buses, and backends, HBase, ZeroMQ and Kafka clients. The development process is now a lot more reliable since we have tests that cover all major components as well as integration tests running HBase and Kafka.

Frontera is already available on PyPI. All you need to do is pip install --upgrade frontera. And then you just run it with Python 3 interpreter and you’re ready to get your crawlers scaled!

Shiny New Features

The request object is now propagated throughout the whole pipeline, allowing you to schedule requests with custom methods, headers, cookies and body parameters.

HBaseQueue supports delayed requests now. Using ‘crawl_at’ field in meta with a timestamp makes requests available to spiders only after the moment expressed with the timestamp has passed.

There is a new option allowing you to choose an option other than the default message bus codec (MsgPack) or use a custom one, see the MESSAGE_BUS_CODEC option.

Upgrades from the Original

Now, Frontera guarantees the exclusive assignment of extracted links to strategy workers based on links’ hostname.

So links from a specific host will be always be assigned to the same strategy worker instance which prevents errors and greatly simplifies design.

Upcoming Improvements

In the near to distant future, we want Frontera and Frontera-based crawlers to be the number one software for large scale web crawling. Our next step in this process is to ease the deployment of Frontera in the Docker environment. This includes scaling and management.

We’re aiming for Frontera to be easily deployable to major cloud providers infrastructures like Google Cloud Platform and AWS, among others. It’s quite likely we will choose Kubernetes as our orchestration platform. Along with this goal, we will develop a good Web UI to manage and monitor Frontera-based crawlers. So stay tuned!

Wrap Up

Have we piqued your interest? Here’s a quick guide to get started.

Well, what are you waiting for? Take full advantage of Frontera with Python 3 support and start scaling your crawls. Check out our use cases to see what’s possible.

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.

In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers.

Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.

Robots

What Makes a Crawler Polite?

A polite crawler respects robots.txt
A polite crawler never degrades a website’s performance
A polite crawler identifies its creator with contact information
A polite crawler is not a pain in the buttocks of system administrators

robots.txt

Always make sure that your crawler follows the rules defined in the website’s robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and it describes what a crawler should or shouldn’t crawl according to the Robots Exclusion Standard. Some websites even use the crawlers’ user agent to specify separate rules for different web crawlers:

User-agent: Some_Annoying_Bot
Disallow: /

User-Agent: *
Disallow: /*.json
Disallow: /api
Disallow: /post
Disallow: /submit
Allow: /

Crawl-Delay

Mission critical to having a polite crawler is making sure your crawler doesn’t hit a website too hard. Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.

When a website gets overloaded with more requests that the web server can handle, they might become unresponsive. Don’t be that guy or girl that causes a headache for the website administrators.

User-Agent

However, if you have ignored the cardinal rules above (or your crawler has achieved aggressive sentience), there needs to be a way for the website owners to contact you. You can do this by including your company name and an email address or website in the request’s User-Agent header. For example, Google’s crawler user agent is “Googlebot”.

Scrapinghub Abuse Report Form

Hey folks using our Scrapy Cloud platform! We trust you will crawl responsibly, but to support website administrators, we provide an abuse report form where they can report any misbehaviour from crawlers running on our platform. We’ll kindly pass the message along so that you can modify your crawls and avoid ruining a sysadmin’s day. If your crawler’s are turning into Skynet and running roughshod over human law, we reserve the right to halt their crawling activities and thus avert the robot apocalypse.

How to be Polite using Scrapy

Scrapy is a bit like Optimus Prime: friendly, fast, and capable of getting the job done no matter what. However, much like Optimus Prime and his fellow Autobots, Scrapy occasionally needs to be kept in check. So here’s the nitty gritty for ensuring that Scrapy is as polite as can be.

scrapy

Robots.txt

Crawlers created using Scrapy 1.1+ already respect robots.txt by default. If your crawlers have been generated using a previous version of Scrapy, you can enable this feature by adding this in the project’s settings.py:

ROBOTSTXT_OBEY = True

Then, every time your crawler tries to download a page from a disallowed URL, you’ll see a message like this:

2016-08-19 16:12:56 [scrapy] DEBUG: Forbidden by robots.txt: <GET http://website.com/login>

Identifying your Crawler

It’s important to provide a way for sysadmins to easily contact you if they have any trouble with your crawler. If you don’t, they’ll have to dig into their logs and look for the offending IPs.

Be nice to the friendly sysadmins in your life and identify your crawler via the Scrapy USER_AGENT setting. Share your crawler name, company name and a contact email:

USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'

Introducing Delays

Scrapy spiders are blazingly fast. They can handle many concurrent requests and they make the most of your bandwidth and computing power. However, with great power comes great responsibility.

To avoid hitting the web servers too frequently, you need to use the DOWNLOAD_DELAY setting in your project (or in your spiders). Scrapy will then introduce a random delay ranging from 0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY seconds between consecutive requests to the same domain. If you want to stick to the exact DOWNLOAD_DELAY that you defined, you have to disable RANDOMIZE_DOWNLOAD_DELAY.

By default, DOWNLOAD_DELAY is set to 0. To introduce a 5 second delay between requests from your crawler, add this to your settings.py:

DOWNLOAD_DELAY = 5.0

If you have a multi-spider project crawling multiple sites, you can define a different delay for each spider with the download_delay (yes, it’s lowercase) spider attribute:

class MySpider(scrapy.Spider):
    name = 'myspider'
    download_delay = 5.0
    ...

Concurrent Requests Per Domain

Another setting you might want to tweak to make your spider more polite is the number of concurrent requests it will do for each domain. By default, Scrapy will dispatch at most 8 requests simultaneously to any given domain, but you can change this value by updating the CONCURRENT_REQUESTS_PER_DOMAIN setting.

Heads up, the CONCURRENT_REQUESTS setting defines the maximum amount of simultaneous requests that Scrapy’s downloader will do for all your spiders. Tweaking this setting is more about your own server performance / bandwidth than your target’s when you’re crawling multiple domains at the same time.

AutoThrottle to Save the Day

Websites vary drastically in the number of requests they can handle. Adjusting this manually for every website that you are crawling is about as much fun as watching paint dry. To save your sanity, Scrapy provides an extension called AutoThrottle.

AutoThrottle automatically adjusts the delays between requests according to the current web server load. It first calculates the latency from one request. Then it will adjust the delay between requests for the same domain in a way that no more than AUTOTHROTTLE_TARGET_CONCURRENCY requests will be simultaneously active. It also ensures that requests are evenly distributed in a given timespan.

To enable AutoThrottle, just include this in your project’s settings.py:

AUTOTHROTTLE_ENABLED = True

Scrapy Cloud users don’t have to worry about enabling it, because it’s already enabled by default.

There’s a wide range of settings to help you tweak the throttle mechanism, so have fun playing around!

Use an HTTP Cache for Development

Developing a web crawler is an iterative process. However, running a crawler to check if it’s working means hitting the server multiple times for each test. To help you to avoid this impolite activity, Scrapy provides a built-in middleware called HttpCacheMiddleware. You can enable it by including this in your project’s settings.py:

HTTPCACHE_ENABLED = True

Once enabled, it caches every request made by your spider along with the related response. So the next time you run your spider, it will not hit the server for requests already done. It’s a win-win: your tests will run much faster and the website will save resources.

Don’t Crawl, use the API

Many websites provide HTTP APIs so that third parties can consume their data without having to crawl their web pages. Before building a web scraper, check if the target website already provides an HTTP API that you can use. If it does, go with the API. Again, it’s a win-win: you avoid digging into the page’s HTML and your crawler gets more robust because it doesn’t need to depend on the website’s layout.

Wrap Up

Let’s all do our part to keep the peace between sysadmins, website owners, and developers by making sure that our web crawling projects are as noninvasive as possible. Remember, we need to band together to delay the rise of our robot overlords, so let’s keep our crawlers, spiders, and bots polite.

image03

To all website owners, help a crawler out and ensure your site has an HTTP API. And remember, if someone using our platform is overstepping their bounds, please fill out an Abuse Report form and we’ll take care of the issue.

For those new to our platform, Scrapy Cloud is forever free and is the peanut butter to Scrapy’s jelly. For our existing Scrapy and Scrapy Cloud users, hopefully you learned a few tips for how to both speed up your crawls and prevent abuse complaints. Let us know if you have any further suggestions in the comment section below!

Sign up for free

Introducing Scrapy Cloud with Python 3 Support

It’s the end of an era. Python 2 is on its way out with only a few security and bug fixes forthcoming from now until its official retirement in 2020. Given this withdrawal of support and the fact that Python 3 has snazzier features, we are thrilled to announce that Scrapy Cloud now officially supports Python 3.

scrapy_cloud_2x

If you are new to Scrapinghub, Scrapy Cloud is our production platform that allows you to deploy, monitor, and scale your web scraping projects. It pairs with Scrapy, the open source web scraping framework, and Portia, our open source visual web scraper.

Scrapy + Scrapy Cloud with Python 3

I’m sure you Scrapy users are breathing a huge sigh of relief! While Scrapy with official Python 3 support has been around since May, you can now deploy your Scrapy spiders using the fancy new features introduced with Python 3 to Scrapy Cloud. You’ll have the beloved extended tuple unpacking, function annotations, keyword-only arguments and much more at your fingertips.

Fear not if you are a Python 2 developer and can’t port your spiders’ codebase to Python 3, because Scrapy Cloud will continue supporting Python 2. In fact, Python 2 remains the default unless you explicitly set your environment to Python 3.

Deploying your Python 3 Spiders

Docker support was one of the new features that came along with the Scrapy Cloud 2.0 release in May. It brings more flexibility to your spiders, allowing you to define in which kind of runtime environment (AKA stack) they will be executed.

This configuration is done in your local project’s scrapinghub.yml. There you have to include a section called stacks having scrapy:1.1-py3 as the stack for your Scrapy Cloud project:

projects:
    default: 99999
stacks:
    default: scrapy:1.1-py3

After doing that, you just have to deploy your project using shub:

$ shub deploy

Note: make sure you are using shub 2.3+ by upgrading it:

$ pip install shub --upgrade

And you’re all done! The next time you run your spiders on Scrapy Cloud, they will run on Scrapy 1.1 + Python 3.

Multi-target Deployment File

If you have a multi-target deployment file, you can define a separate stack for each project ID:

projects:
    default:
        id: 55555
        stack: scrapy:1.1
    py3:
        id: 99999
        stack: scrapy:1.1-py3

This allows you to deploy your local project to whichever Scrapy Cloud project you want, using a different stack for each one:

$ shub deploy py3

This deploys your crawler to project 99999 and uses Scrapy 1.1 + Python 3 as the execution environment.

You can find different versions of the Scrapy stack here.

Wrap Up

We hope that you’re as excited as we are for this newest upgrade to Python 3. If you have further questions or are interested in learning more about the souped up Scrapy Cloud, take a look at our Knowledge Base article.

For those new to our platform, Scrapy Cloud has a forever free subscription, so sign up and give us a try.

Sign up for free

What the Suicide Squad Tells Us About Web Data

Web data is a bit like the Matrix. It’s all around us, but not everyone knows how to use it meaningfully. So here’s a brief overview of the many ways that web data can benefit you as a researcher, marketer, entrepreneur, or even multinational business owner.

morpheus_meme

Since web scraping and web data extraction are sometimes viewed a bit like antiheroes, I’m introducing each of the use cases through characters from the Suicide Squad film. I did my best to pair according to character traits and real world web data uses, so hopefully this isn’t too much of a stretch.

This should be spoiler free, with nothing revealed that you can’t get from the trailers! Fair warning, you’re going to have Ballroom Blitz stuck in your head all day. And if you haven’t seen Suicide Squad yet, hopefully we get you pumped up for this popcorn movie.

Market Research and Predictions: Deadshot

Deadshot’s claim to fame is accurate aim. He can predict bullet trajectories and he never misses a shot. So I paired him with using web data for market research and trend prediction. You can scrape multiple websites for price fluctuation, new products, reviews, and consumer trends. This is an automated process that allows you to quickly and accurately analyze data without needing to manually monitor websites.

Social Media Monitoring: Harley Quinn

Harley Quinn has a sunny personality that remains chipper even when faced with death, destruction, torture, and mayhem. She also always has a witty comeback no matter the situation. These traits go hand-in-hand with how brands should approach social media channels. Extracting web data from social media interactions help you understand consumer opinions. You can monitor ongoing chatter about your company or your competition and respond in the most positive way possible.

Lead Generation and HR Recruitment: Amanda Waller

This is probably the most obvious pairing since Amanda Waller (played by the wonderful Viola Davis) is the one responsible for assembling the Suicide Squad. She carefully researched and compiled intimate details on all the criminals-turned-reluctant-heroes. This is an aspect of web data that benefits all sales, marketing, recruitment, and HR. With a pre-vetted pool, you’ll have access to qualified leads and decision-makers without needing to wade through the worst of the worst.

Tracking Criminal Activity in the Dark Web: Killer Croc

This sewer-dwelling villain thrives in dark and hidden spaces. He’s used to working underground and in places most people don’t even know exist. This makes Killer Croc the perfect backdrop for the type of web data located in the deep/dark web. The dark web is the part of the internet that is not indexed by search engines (Google, Bing, etc.) and is often a haven for criminal activity. Data scraped from this part of the web is commonly used by law enforcement agencies.

Competitive Pricing: Captain Boomerang

This jewelry thief goes around the world stealing from banks and committing acts of burglary – with a boomerang… Captain Boomerang knows all about pricing and the comparative value of products so he can get the largest bang for his buck. Similarly, web data is a great resource for new companies looking to research their industry and how their prices match up to the competition. And if you are an established company, this is a great way for you to keep track of newcomers and potential market disruptors.

Machine Learning Models: Enchantress

In her 6313 years of existence, the Enchantress has had to cope with changing times, customs, and civilizations. The ability to learn quickly and adapt to new situations is definitely an important part of her continued survival. Likewise, machine learning is a form of artificial intelligence that can learn when given new information. Train your machine learning models using datasets for conducting sentiment analysis, making predictions, and even automating web scraping. Whether you are an SaaS company specializing in developing machine learning technology or someone who needs machine learning analysis, you need to ensure you have up-to-date datasets.

Monitoring Resellers: Colonel Rick Flag

Colonel Rick Flag is a “good guy” whose job is to keep track of the Suicide Squad and kill them if they get out of line. Now obviously your relationship with resellers is not a life-and-death situation, but it can be good to know how your brand is being represented across the internet. Web scraping can help you keep track of reseller customer reviews and any contract violations that might be occurring.

Monitoring Legal Matters and Government Corruption: Katana

Katana the samurai is the enforcer of the Suicide Squad. She is there as an additional check to keep the criminal members in line. Similarly, web data allows reporters, lawyers, and concerned citizens to keep track of government officials, potential corruption charges, and changing legal matters. You can scrape obscure or poorly presented public records and then use that information to create accessible interfaces for easy reference and research.

Web Scraping for Fun: the Joker

I believe the Joker needs no introduction, whether you know this character from Jack Nicholson, Heath Ledger, or the new Jared Leto incarnation. He is unpredictable, has eclectic tastes, and is capable of doing anything. And honestly, this is what web scraping is all about. Whether you want to build a bike sharing app or monitor government corruption, web data provides the backbone for all of your creative endeavors.

Wrap Up

I hope you enjoyed this unorthodox tour of the world of web data! If you’re looking for some mindless fun, Suicide Squad ain’t half bad (it ain’t half good either). If you’re looking to explore how web data fits within your business or personal projects, feel free to reach out to us. And if you’re looking to hate on or defend Suicide Squad, comment below.

P.S. There is no way this movie is worse than Batman v Superman: Dawn of Justice

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub.

OS-Scrapinghub

Scrapy

This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars on GitHub, making it the 10th most starred Python project on GitHub! We are very proud of this and want to thank all our users, stargazers and contributors!

What’s coming in Scrapy 1.2 (in a couple of weeks)?

  • The ability to specify the encoding of items in your JSON, CSV or XML output files
  • Creating Scrapy projects in any folder you want (not only the current one)

Scrapy Plugins

We’re moving various Scrapy middleware and helpers to their own repository under scrapy-plugins home on GitHub. They are all available on PyPI.
Many of these were previously found wrapped inside scrapylib (which will not see a new release).

Here are some of the newly released ones:

  • scrapy-querycleaner: used for cleaning up query parameters in URLs; helpful for when some of them are not relevant (you get the same page with or without them), thus avoiding duplicate page fetches.
  • scrapy-magicfields: automagically add special fields in your scraped items such as timestamps, response attributes, etc.

Libraries

Dateparser

In mid-June we released version 0.4 of Dateparser with quite a few parsing improvements and new features (as well as several bug fixes). For example, this version introduces its own parser, replacing dateutil’s one. However, we may converge back at some point in the future.

It also handles relative dates in the future e.g. “tomorrow”, “in two weeks”, etc. We also replaced PyYAML with one of its active forks, ruamel.yaml. We hope you enjoy it!

Fun fact: we caught the attention of Kenneth Reitz with dateparser. And although dateparser didn’t quite solve his issue, “[he] like[s] it a lot” so it made our day 😉

w3lib

w3lib v1.15 now has a canonicalize_url(), extracted from Scrapy helpers. You may find it handy when walking in the jungle of non-ASCII URLs in Python 3!

Wrap Up

And that’s it for This Month in Open Source at Scrapinghub August 2016. Open Source is in our DNA and so we’re always working on new projects and improving pre-existing ones. Keep up with us and explore our GitHub. We welcome contributors and we are also hiring, so check out our jobs page!

Meet Parsel: the Selector Library behind Scrapy

We eat our own spider food since Scrapy is our go-to workhorse on a daily basis. However, there are certain situations where Scrapy can be overkill and that’s when we use Parsel. Parsel is a Python library for extracting data from XML/HTML text using CSS or XPath selectors. It powers the scraping API of the Scrapy framework.

HarryParseltongue

Not to be confused with Parseltongue/Parselmouth

We extracted Parsel from Scrapy during Europython 2015 as a part of porting Scrapy to Python 3. As a library, it’s lighter than Scrapy (it relies on lxml and cssselect) and also more flexible, allowing you to use it within any Python program.

v-3

Using Parsel

Install Parsel using pip:

pip install parsel

And here’s how you use it. Say you have this HTML snippet in a variable:

>>> html = u'''
<ul>
	<li><a href="http://blog.scrapinghub.com">Blog</a></li>
...
	<li><a href="http://www.scrapinghub.com">Scrapinghub</a></li>
...
	<li class="external"><a href="http://www.scrapy.org">Scrapy</a></li>
</ul>
'''

You then import the Parsel library, load it into a Parsel Selector and extract links with an XPath expression:

>>> import parsel
>>> sel = parsel.Selector(html)
>>> sel.xpath("//a/@href").extract()
[u'http://blog.scrapinghub.com', u'http://www.scrapinghub.com', u'http://www.scrapy.org']

Note: Parsel works both in Python 3 and Python 2. If you’re using Python 2, remember to pass the HTML in a unicode object.

Sweet Parsel Features

One of the nicest features of Parsel is the ability to chain selectors. This allows you to chain CSS and XPath selectors however you wish, such as in this example:

>>> sel.css('li.external').xpath('./a/@href').extract()
[u'http://www.scrapy.org']

You can also iterate through the results of the .css() and .xpath() methods since each element will be another selector:

>>> for li in sel.css('ul li'):
...     print(li.xpath('./a/@href').extract_first())
...
http://blog.scrapinghub.com
http://www.scrapinghub.com
http://www.scrapy.org

You can find more examples of this in the documentation.

When to use Parsel

The beauty of Parsel is in its wide applicability. It is useful for a range of situations including:

  • Processing XML/HTML data in an IPython notebook
  • Writing end-to-end tests for your website or app
  • Simple web scraping projects with the Python Requests library
  • Simple automation tasks at the command-line

And now, you can also run Parsel with the command-line tool for simple extraction tasks in your terminal. This new development is thanks to our very own Rolando who created parsel-cli.

Install parsel-cli with pip install parsel-cli and play around using the examples below (you need to have curl installed).

The following command will download and extract the list of Academy Award-winning films from Wikipedia:

curl -s https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films |\
    parsel-cli 'table.wikitable tr td i a::text'

You can also get the current top 5 news items from Hacker News using:

curl -s https://news.ycombinator.com |\
    parsel-cli 'a.storylink::attr(href)' | head -n 5

And how about obtaining a list of the latest YouTube videos from a specific channel?

curl -s https://www.youtube.com/user/crashcourse/videos |\
    parsel-cli 'h3 a::attr(href), h3 a::text' |\
    paste -s -d' \n' - | sed 's|^|http://youtube.com|'

Wrap Up

I hope that you enjoyed this little tour of Parsel and I am looking forward to seeing how these examples have sparked your imagination when finding solutions for your HTML parsing needs.

The next time you find yourself wanting to extract data from HTML/XML and don’t need Scrapy and its crawling capabilities, you know what to do: just Parsel it!

Feel free to reach out to us on Twitter and let us know how you use Parsel in your projects.

%d bloggers like this: