How to make awesome datasets fast with Scrapy in Python

--

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

Scrapy is highly customizable and developer friendly crawling framework in Python. It can help you build in few line wonderful crawler to scrap data from a website. The best part of it is that the framework have a great flexibility. It can be use in any program with an import, or you can create a boilerplate project with one command line. You can even add functionalities to its projects adding any type of library, let’s cite Beautiful Soup 4.

Today’s goal is to get products’ details from an e-commerce at a blazing fast speed. Let’s take our usual example website for this experiment. I will assume that you already installed Scrapy using pip or conda. If not, just use one of the following commands.

pip install Scrapy# ORconda install -c conda-forge scrapy

Setting up the project

First let set up our boilerplate project using this simple command.

scrapy startproject figurinesmaniac

Once completed, Scrapy will have made for you a ready to use crawler project. Now what we have left is to indicate our crawler where to find the data. The best way is still to use the Google Dev Tools Inspector to know what kind of tags, classes or ids you need to look for. If you don’t know how to do this, I explain the process in details in this article about Selenium.

Creating the spider

We are looking for every products name, dimensions and price. So, our crawler need to take a look at every products’ page and get the information from it. Moreover, there is multiple pages. We will have to take into account pagination.

To start this task we need to create a spider for our crawler to use. This spider will contain where to get information that we need and how to manage multiple pages and pagination. Inside the generated Scrapy project, there is a folder called spiders. That is exactly where we are going to create our products.py which is going to be our spider. Here is an empty template for a basic spider. I already renamed the class name and name attribute for our project.

import scrapyclass ProductsSpider(scrapy.Spider):
name = "products"
def start_requests(self):
urls = ["https://example.com"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)

First question is where our crawler going to start? Let’s make things easier and start directly from the all products page. Set it inside urls as follow.

urls = ["https://www.figurines-maniac.com/toutes-les-figurines/"]

Next, we need to get every product urls actually visible, because there or some more on the other pages.

product_links = response.xpath("//article/ul/li/a[1]")

Once we got every product location, because they are “a” tags we can directly tell our crawler to go to every of them and get data from the page. But, it don’t know what to scrap from the page. We need to add a function that he can use to do this.

def parse_product(self, response):
product_name = response.xpath("//h2[contains(@class, 'product_title')]/text()").get()
price = response.xpath("//bdi/text()").get()
dimensions = response.xpath("//table[contains(@class, 'shop_attributes')]//tr[contains(@class, 'dimensions')]/td/p/text()").get()
return {
"product_name": product_name,
"price": price,
"dimensions": dimensions
}

I don’t get into detail on how to use Xpath to get data in a HTML structure, as stated above you can find more information about this in my article about using Selenium in Python.

With this function added to our spider class we can use it to tell our crawler that for every product page you can get data in this way. And, this is how we implement it into our parse function.

yield from response.follow_all(product_links, callback=self.parse_product)

Simple and easy isn’t it? We know that we need to take into account the pagination. This task is pretty trivial too. We just need to add a few lines in our parse function to follow to the next page. There is multiple approaches, my favorite is looking for the “next page” button and follow it, because sometime the pagination numbers could be misleading. In our case, we check if the “next page” button exists and follow it by using recursion.

nav_next = response.xpath("//nav/ul/li/a[contains(@class, 'next')]").attrib["href"]
if nav_next is not None:
yield response.follow(nav_next, callback=self.parse)

Enjoy the data

That’s it for our crawler code. The last step is to launch the crawler with the CLI and get those juicy data! Depending on the file extension chosen Scrapy automatically output in the right file format. Here, I chose CSV file, but we could also choose to replace .csv with a .json to get a JSON file output.

scrapy crawl products -O products.csv

If you enjoyed the article or found it useful, it would be kind of you to support me by following me here (Jonathan Mondaut). More articles are coming very soon! For the time being, you can find the integral spider code below!

import scrapyclass ProductsSpider(scrapy.Spider):
name = "products"
def start_requests(self):
urls = ['https://www.figurines-maniac.com/toutes-les-figurines/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse_product(self, response):
product_name = response.xpath("//h2[contains(@class, 'product_title')]/text()").get()
price = response.xpath("//bdi/text()").get()
dimensions = response.xpath("//table[contains(@class, 'shop_attributes')]//tr[contains(@class, 'dimensions')]/td/p/text()").get()
return {
"product_name": product_name,
"price": price,
"dimensions": dimensions
}
def parse(self, response):
product_links = response.xpath("//article/ul/li/a[1]")
if(len(product_links) > 0):
yield from response.follow_all(product_links, callback=self.parse_product)
nav_next = response.xpath("//nav/ul/li/a[contains(@class, 'next')]").attrib["href"]
if nav_next is not None:
yield response.follow(nav_next, callback=self.parse)

--

--