This blog tells you mainly about how to build your first scrapy project.
installation
Skip this part
first project
scrapy startproject tutorial
This will create a tutorial
directory with the following contents:
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider
and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named quotes_spider.py
under the tutorial/spiders
directory in your project:
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body)
self.log(f"Saved file {filename}")
As you can see, our Spider subclasses scrapy.Spider
and defines some attributes and methods:
name
: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.start_requests()
: must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.parse()
: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance ofTextResponse
that holds the page content and has further helpful methods to handle it.The
parse()
method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request
) from them.How to run our spider¶
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl quotes
Extracting data¶
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:
scrapy shell 'https://quotes.toscrape.com/page/1/'
Scrapy Course
Scrapy Item
Pipeline
Clean data, convert …. Process data, store the data(=> into a database).
These class should be activated in the settings.py
when finished the convert.
Save data setting
FEEDS = {
"xxx.json": {"format":"json"}
}
You can rewrite settings in the
settings.py
in theClass
.
Save the data into the SQL
database
TO BE CONTINUED
User agent and headers
The user agent is a string that a browser sends to a web server to identify itself. This string often includes details about the browser, its version, and the operating system it’s running on. For example, a user agent string might look like “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36”. This helps the server understand what kind of browser is making the request, so it can tailor its response appropriately, such as formatting the web page for a specific browser, user agent is a subset of the
headers
.
Use different user-agent
in the spiders.
In the settings.py
define the USER_AGENT :
USER_AGENT = "..."
In the response.fellow
function, you can specify the user-agent
in the call back functions to override the user-agent.
In the middlewares.py
we can use the fake user-agent and pass it into the requests. You can use the website:
ScrapeOps - The DevOps Tool For Web Scraping. | ScrapeOps
To make fake User-Agent
, and use api
from this website to make fake user-agent.
定义完成后需要将中间件导入