posted on 2023-05-21 18:03 read(457) comment(0) like(0) collect(5)
1.1 What is a reptile
A crawler (spider, also a web crawler) is a program that initiates a request to a website/network, analyzes and extracts useful data after obtaining resources.
From a technical point of view, it is to simulate the behavior of a browser requesting a site through a program, and crawl the HTML code/JSON data/binary data (picture, video) returned by the site to the local, and then extract the data you need, and store it for use.
1.2 Basic process of reptiles
Ways for users to obtain network data:
Method 1: browser submits request —> download web page code —> parse into page
Method 2: Simulate browser to send request (obtain webpage code) -> extract useful data -> store in database or file
What the crawler has to do is method 2.
Getting started with python crawlers, 10 minutes is enough, this may be the simplest basic teaching I have ever seen
1 initiate a request
Use the http library to initiate a request to the target site, that is, send a Request
Request contains: request header, request body, etc.
Request module defect: cannot execute JS and CSS code
2 Get the response content
If the server can respond normally, you will get a Response
Response contains: html, json, pictures, videos, etc.
3 parse the content
Parsing html data: regular expression (RE module), xpath (mainly used), beautiful soup, css
Parsing json data: json module
Parse binary data: write to file in wb mode
4 save data
In the form of a database (MySQL, Mongdb, Redis) or a file.
1.3 http protocol request and response
http protocol
Request: The user sends his information to the server (socket server) through the browser (socket client)
Response: The server receives the request, analyzes the request information sent by the user, and then returns the data (the returned data may contain other links, such as pictures, js, css, etc.)
ps: After the browser receives the Response, it will parse its content and display it to the user, while the crawler program will extract the useful data after simulating the browser to send the request and then receive the Response.
Friends who are interested in Python or are learning, you can join our Python learning button: 784758214, from 0-based python scripts to web development, crawlers, django, data mining data analysis, etc., from 0-based to actual project information Everything is sorted out. For every python buddy! Share some learning methods and small details that need attention every night, learn route planning, and use programming to earn extra money. Click to join our python learning circle
1.3.1
request
(1) Request method
Common request methods: GET / POST
(2) Requested URL
url Global Uniform Resource Locator, used to define a unique resource on the Internet For example: a picture, a file, a video can be uniquely determined by url
(3) Request header
User-agent: If there is no user-agent client configuration in the request header, the server may treat you as an illegal user host;
cookies: cookies are used to save login information
Note: Generally, crawlers will add request headers.
Parameters that require attention in the request header:
Referrer: Where does the access source come from (some large websites will use Referrer as an anti-leech strategy; all crawlers should also pay attention to simulation)
User-Agent: The browser visited (to be added or it will be regarded as a crawler)
cookie: Please pay attention to carrying the request header
(4) Request body
If the request body is in the get method, the request body has no content (the request body of the get request is placed in the parameter behind the url and can be seen directly) If it is in the post method, the request body is format data
ps: 1. Login window, file upload, etc., the information will be attached to the request body 2. Login, enter the wrong username and password, and then submit, you can see the post. After the correct login, the page usually jumps and cannot be captured post
1.3.2
response
(1) Response status code
200: represents success
301: Represents a jump
404: The file does not exist
403: Unauthorized access
502: Server Error
(2)response header
Parameters that need attention in the response header: Set-Cookie: BDSVRTM=0; path=/: There may be more than one, which is to tell the browser to save the cookie
(3) preview is the source code of the web page
json data
Such as webpage html, pictures
binary data etc.
02
2. Basic module
2.1requests
requests is a simple and easy-to-use HTTP library implemented by python, which is upgraded from urllib.
Open source address:
https://github.com/pydmy…
Chinese API:
http://docs.python-requests.o…
2.2re regular expressions
Regular expressions are used in Python using the built-in re module.
Disadvantages: Unstable data processing and heavy workload
2.3XPath
XPath (XML Path Language) is a language for finding information in XML documents, and can be used to traverse elements and attributes in XML documents.
In python, the lxml library is mainly used for xpath acquisition (lxml is not used in the framework, and xpath can be used directly in the framework)
lxml is an HTML/XML parser, the main function is how to parse and extract HTML/XML data.
Like regexes, lxml is also implemented in C. It is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.
2.4BeautifulSoup
Like lxml, Beautiful Soup is also an HTML/XML parser, and its main function is how to parse and extract HTML/XML data.
Using BeautifulSoup needs to import the bs4 library
Disadvantages: Relatively regular and xpath processing speed is slow
Pros: easy to use
2.5Json
JSON (JavaScript Object Notation) is a lightweight data interchange format that makes it easy for people to read and write. At the same time, it is also convenient for machines to analyze and generate. It is suitable for data interaction scenarios, such as the data interaction between the foreground and the background of the website.
The json module is mainly used in python to process json data. Json parsing website:
https://www.sojson.com/simple…
2.6threading
Use the threading module to create threads, inherit directly from threading.Thread, and then rewrite the __init__ method and the run method
03
3. Method example
3.1 get method instance
demo_get.py
3.2 Post method instance
demo_post.py
3.3 Add proxy
demo_proxies.py
3.4 Get ajax class data instance
demo_ajax.py
3.5 Using multithreaded instances
demo_thread.py
04
4. Reptile framework
4.1 Srcapy framework
Scrapy is an application framework written in pure Python to crawl website data and extract structured data. It has a wide range of uses.
Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communication, which can speed up our download speed, without having to implement the asynchronous framework by ourselves, and includes various middleware interfaces, which can flexibly fulfill various needs.
4.2 Scrapy architecture diagram
4.3Scrapy main components
Scrapy Engine (engine): Responsible for communication, signal, data transmission, etc. among Spider, ItemPipeline, Downloader, and Scheduler.
Scheduler (scheduler): It is responsible for accepting the Request sent by the engine, sorting it in a certain way, entering the queue, and returning it to the engine when the engine needs it.
Downloader (downloader): responsible for downloading all the Requests sent by the Scrapy Engine (engine), and returning the obtained Responses to the Scrapy Engine (engine), and the engine is handed over to the Spider for processing.
Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again,
Item Pipeline (pipeline): It is responsible for processing the items obtained in the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
Downloader Middlewares (download middleware): You can think of it as a component that can customize and extend the download function.
Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the expansion and operation engine and the intermediate communication between the engine and the Spider (such as Responses entering the Spider; and Requests going out from the Spider)
4.4 Operation process of Scrapy
Engine: Hi! Spider, which site are you dealing with?
Spider: Boss wants me to handle xxxx.com.
Engine: Give me the first URL that needs to be processed.
Spider: Here you are, the first URL is xxxxxxx.com.
Engine: Hi! Scheduler, I have a request to ask you to sort it into the queue for me.
Scheduler: OK, processing you wait.
Engine: Hi! Scheduler, give me your processed request.
Scheduler: here you are, this is the request I processed
Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings
Downloader: OK! Here you go, here's the downloaded stuff. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed, please record it, we will download it later)
Engine: Hi! Spider, this is something that has been downloaded, and it has been processed according to the download middleware of the boss. You can handle it yourself (note! The responses here are handled by the def parse() function by default)
Spider: (for the URL that needs to be followed up after processing the data), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I got.
Engine: Hi! Pipe I have an item here, please help me deal with it! scheduler! This is the URL that needs to be followed up and you can help me deal with it. Then start the cycle from the fourth step until all the information needed by the boss is obtained.
pipeline ``scheduler: ok, do it now!
4.5 Making a Scrapy crawler in 4 steps
1 Create a new crawler project scrapy startproject mySpider 2 Clear the target (write items.py) Open items.py in the mySpider directory 3 Create a crawler (spiders/xxspider.py) scrapy genspider gushi365 "gushi365.com" 4 Store content (pipelines.py) Design pipeline Store crawled content
05
Five, common tools
5.1 fidder
fidder is a packet capture tool, mainly used for mobile phone packet capture.
5.2XPath Helper
The xpath helper plugin is a free chrome crawler web page analysis tool. It can help users solve problems such as failure to locate normally when obtaining the xpath path.
Installation and use of the Google Chrome plug-in xpath helper:
https://jingyan.baidu.com/art…
06
If you don’t understand anything during the learning process, you can add my python learning qun. There are good learning video tutorials, development tools and e-books in the 784758214 group. Share with you the current talent needs of python companies and how to learn python well from scratch, and what to learn
6. Distributed crawlers
6.1scrapy-redis
Scrapy-redis provides some redis-based components (pip install scrapy-redis) in order to implement Scrapy distributed crawling more conveniently
6.2 Distributed strategy
Master side (core server): Build a Redis database, not responsible for crawling, only responsible for url fingerprint weight judgment, Request distribution, and data storage.
The following are the introductory learning materials for python crawlers that I have compiled, all of which have been sorted out and packaged.
A complete set of learning routes for Python
Python zero-based introductory video
Python project source code
Python entry to advanced e-books and practical cases
Friends, if you need this full version of the full set of learning materials for Python, scan the QR code of the CSDN official certification below on WeChat [free access].
Author:Poison
link:http://www.pythonblackhole.com/blog/article/25333/000fdc240361ead56203/
source:python black hole net
Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.
name:
Comment content: (supports up to 255 characters)
Copyright © 2018-2021 python black hole network All Rights Reserved All rights reserved, and all rights reserved.京ICP备18063182号-7
For complaints and reports, and advertising cooperation, please contact vgs_info@163.com or QQ3083709327
Disclaimer: All articles on the website are uploaded by users and are only for readers' learning and communication use, and commercial use is prohibited. If the article involves pornography, reactionary, infringement and other illegal information, please report it to us and we will delete it immediately after verification!