Building a Web Scraper in Python By Using Scrapy

Created At : 18 Nov 2024

Sharing is Caring, Pls help to share with your friend

Author : Aung Kyaw NyuntSoftware EnginnerContact : aungkyawnyunt2004@gmail.com or akn.cloud86@gmail.com

Building a web scraper in Python using Scrapy is a powerful way to gather data from websites.

Scrapy is an open-source web scraping framework for Python that provides a fast and efficient way to extract structured data from websites. Here's a step-by-step guide on how to create a web scraper using Scrapy.

1. Install Scrapy

To get started, you need to install Scrapy. You can install it using pip:

pip install scrapy

2. Create a New Scrapy Project

After installing Scrapy, create a new project to structure your scraper.

scrapy startproject myscraper

This will create a project folder called myscraper with the following structure:

myscraper/
    scrapy.cfg
    myscraper/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

3. Define the Item

In Scrapy, you define the structure of the data you want to scrape in the items.py file. This will represent a structured model for your scraped data.

Open myscraper/items.py and define an item:

import scrapy

class MyscraperItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    description = scrapy.Field()

Here, we’re defining an item with three fields: title, url, and description.

4. Create a Spider

Now, let's create a spider that will crawl the target website. Spiders are classes that define how a website should be scraped. You can create a spider inside the spiders/ directory.

Let’s create a simple spider for scraping blog posts, for example. Inside the spiders/ directory, create a new file called blog_spider.py.

cd myscraper/spiders
touch blog_spider.py

Then, open blog_spider.py and add the following code:

import scrapy
from myscraper.items import MyscraperItem

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://example.com/blog']  # Replace with the website you want to scrape

    def parse(self, response):
        # Extract blog post data
        for post in response.css('div.post'):
            item = MyscraperItem()
            item['title'] = post.css('h2.title a::text').get()
            item['url'] = post.css('h2.title a::attr(href)').get()
            item['description'] = post.css('div.description::text').get()

            yield item

        # Follow pagination (if exists)
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Explanation:

name: A unique name for the spider.

start_urls: A list of URLs to begin scraping from.

parse: The method that is called to process the response from the server. It extracts the data you want and yields it as an Item (defined in items.py).

response.css(): This is used to extract data using CSS selectors.

response.follow(): This is used for pagination or following links to scrape multiple pages.

5. Run the Spider

Now that we have our spider, we can run it and start scraping.

To run the spider, go back to the root directory of your project (myscraper/) and execute the following command:

scrapy crawl blogspider

This will start the spider and begin scraping the data from the start URL (https://example.com/blog in this case).

If you want to save the scraped data to a file (such as output.json), you can run:

scrapy crawl blogspider -o output.json

This will save the scraped data as a JSON file.

6. Handling Data Pipelines

Scrapy has built-in support for processing the data you scrape via pipelines. You can process the scraped data by enabling and configuring pipelines in the pipelines.py file.

Here’s an example of a simple pipeline that saves the scraped data to a database or file.

In pipelines.py:

class MyscraperPipeline:
    def process_item(self, item, spider):
        # Process the scraped item, e.g., clean data or save it
        return item

In settings.py, enable your pipeline:

ITEM_PIPELINES = {
   'myscraper.pipelines.MyscraperPipeline': 1,
}

7. Handle User-Agent and Throttling (Optional)

To avoid being blocked by websites, you may need to set a User-Agent and adjust the crawling speed.
You can configure these settings in settings.py:

# User-Agent to mimic a real browser
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

# Enable auto-throttling to prevent overloading the website
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 3
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

8. Handling Errors and Logging

Scrapy automatically handles errors like timeouts and retries. However, you can configure it further in settings.py:

# Enable retry on failed requests
RETRY_ENABLED = True
RETRY_TIMES = 3  # Number of retries on failure

# Logging configuration (optional)
LOG_LEVEL = 'INFO'  # Set to 'DEBUG' for more detailed logs

9. Advanced Scrapy Features

Handling Cookies

Scrapy can handle cookies by default, but if you need to disable or manage cookies, you can adjust the settings:

COOKIES_ENABLED = False # Disable cookies

Scraping Dynamic Content

If the website loads content dynamically with JavaScript (using frameworks like React, Vue, etc.), Scrapy alone might not be enough. You can use Splash (a headless browser) with Scrapy to render JavaScript content.

pip install scrapy-splash

You can then integrate Scrapy-Splash to handle JavaScript rendering.

10. Conclusion

You’ve now created a basic web scraper using Scrapy. Scrapy is a very powerful framework for scraping websites at scale, and its features, such as crawling multiple pages, handling errors, exporting data, and managing requests, make it ideal for large-scale scraping projects.

To further enhance your scraper, you can: