LOADING

加载过慢请开启缓存 浏览器默认开启

ScrapyStartUpBestPractice

2023/11/22 scrapy

This blog tells you mainly about how to build your first scrapy project.

installation

Skip this part

first project

scrapy startproject tutorial

This will create a tutorial directory with the following contents:

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

This is the code for our first Spider. Save it in a file named quotes_spider.py under the tutorial/spiders directory in your project:

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

  • name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

    The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

    How to run our spider

    To put our spider to work, go to the project’s top level directory and run:

    scrapy crawl quotes
    

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:

scrapy shell 'https://quotes.toscrape.com/page/1/'

Scrapy Course

Scrapy Item

Pipeline

Clean data, convert …. Process data, store the data(=> into a database).

These class should be activated in the settings.py when finished the convert.

Save data setting

FEEDS = {
    "xxx.json": {"format":"json"}
}

You can rewrite settings in the settings.py in the Class.

image-20231124152152217

Save the data into the SQL database


TO BE CONTINUED


User agent and headers

The user agent is a string that a browser sends to a web server to identify itself. This string often includes details about the browser, its version, and the operating system it’s running on. For example, a user agent string might look like “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36”. This helps the server understand what kind of browser is making the request, so it can tailor its response appropriately, such as formatting the web page for a specific browser, user agent is a subset of the headers.

Use different user-agent in the spiders.

In the settings.py define the USER_AGENT :

USER_AGENT = "..."

image-20231127113637596

In the response.fellow function, you can specify the user-agent in the call back functions to override the user-agent.

In the middlewares.py we can use the fake user-agent and pass it into the requests. You can use the website:

ScrapeOps - The DevOps Tool For Web Scraping. | ScrapeOps

To make fake User-Agent, and use api from this website to make fake user-agent.

image-20231127114228372

定义完成后需要将中间件导入

image-20231127115749628