posted on 2023-06-06 11:04 read(500) comment(0) like(13) collect(3)
XPath is a language for finding information in XML documents. It was originally used to search XML documents, but it is also suitable for searching HTML documents.
So in Python crawlers, we often use xpath analysis, an efficient and convenient way to extract information.
To use xpath, you need to install the lxml library
pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple
Instantiate an etree object, and the parsed page source code data needs to be loaded into the object. There are two ways:
1. Load the source code data in the local html document into the etree object
etree.parse('filePath', etree.HTMLParser()) # filePath为文件的路径
Example:
from lxml import etree # 导包
html = etree.parse('./test.html', etree.HTMLParser()) # ./test.html为本地的html文件的路径
html.xpath('xpath表达式')
2. Load the source code data obtained from the Internet into the etree object
etree.HtML('page_data') # page_data为从页面获取的源码数据
Example:
from lxml import etree # 导包
html = etree.HtML('page_data') # page_data为从页面获取的源码数据
html.xpath('xpath表达式')
Using xpath to parse data, the most important step is writing xpath expressions. The common expressions of xpath are introduced below.
expression | meaning |
---|---|
nodename | Select all child nodes of this node |
/ | Indicates that the location starts from the root node. represents a level |
// | Select descendant nodes (descendants) from the current node |
. | Select current node |
@ | select attribute |
text() | get text |
* | wildcard, any element node |
nodename[@attrib=‘value’] | Selects the specified elements with the given value for the given attribute. For example, div[@class='cell'] means all div elements whose class attribute value is cell |
Detailed explanation of the examples of the above expressions
First introduce a piece of HTML code to be tested
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>测试</title> </head> <body> <div class="big"> <ul> <li><a href="https://www.baidu.com/">百度</a></li> <li><a href="https://weibo.com/">微博</a></li> <li><a href="https://www.tmall.com/">天猫</a></li> <p>test1</p> </ul> <div> <a id="aa" href="https://www.iqiyi.com/">爱奇艺</a> <a id="bb" href="https://v.qq.com/">腾讯视频</a> <p>test2</p> </div> </div> </body> </html>
For convenience and intuition, we perform a local reading test on the HTML file
Let's first experience the process of parsing a web page using xpath Code
:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result1 = html.xpath('/html/body/div/ul/li/a') # /表示层级关系,第一个/是根节点
print(result1)
operation result:
[<Element a at 0x28696f20ac0>, <Element a at 0x28696f20b00>, <Element a at 0x28696f20b40>]
It can be seen that there are 3 nodes in the running result, exactly the first 3 a nodes from top to bottom. We can also see that the above uses /
to represent the hierarchical relationship. The so-called hierarchical relationship is actually a layer wrapped in a layer. For example, the HTML node in the test HTML code wraps the body node. If we want to get the body node, we can use /html/body
it to express. By analogy, we can get the node we want.
Since we can take out the first 3 a-nodes, how to take out all a-nodes?
At this point we will //
use
the code:
result2 = html.xpath('/html/body/div//a')
print(result2)
operation result:
[<Element a at 0x222bf9afc80>, <Element a at 0x222bf9afcc0>, <Element a at 0x222bf9afd00>, <Element a at 0x222bf9afd40>, <Element a at 0x222bf9afd80>]
It is not difficult to see that the div node with class="big" encloses all a nodes, that is, all a nodes are descendants of div nodes with class="big" (or simply understood as descendants), we /html/body/div//a
can a node
For example, if we want to get the two p nodes in the figure below, we can use wildcards *
to extract
code:
'''
第一个p节点的表达式为: '/html/body/div/ul/p'
第二个p节点的表达式为: '/html/body/div/div/p'
可以看到这两个表达式唯一不同的地方的是 p节点前面的节点(父节点)一个是ul,一个是div
这时我们只需用通配符 * 来代替p节点前面的父节点,因为通配符 * 可以表示任意节点
'''
# html.xpath('/html/body/div/ul/p')
# html.xpath('/html/body/div/div/p')
result = html.xpath('/html/body/div/*/p')
print(result)
operation result:
[<Element p at 0x1e2c335f880>, <Element p at 0x1e2c335f8c0>]
Of course, if you want to take these two p nodes, you can also use // to take them, here is just to demonstrate the usage *
of
For example, if we want to get the first a node, we can use the index to locate
the code:
result3 = html.xpath('/html/body/div/ul/li[1]/a') #li[1]表示第一个li节点,注意索引是从[1]开始
print(result3)
operation result:
[<Element a at 0x28705120ac0>]
It must be noted here that the index [1]
starts ! ! !
For example, if I want to get the a node with id="aa", I can use attribute positioningnodename[@attrib='value']
code:
result4 = html.xpath('//a[@id="aa"]')
print(result4)
operation result:
[<Element a at 0x2718236fc00>]
In xpath, you text()
can extract the text information in the web page
Code:
result5 = html.xpath('/html/body/div/ul/li[1]/a/text()') # text()获取文本
print(result5)
operation result:
['百度']
You can see that the result is a list. If we want to get the string inside, we can do this
result6 = html.xpath('/html/body/div/ul/li[1]/a/text()')[0] # 从列表中取第一个元素
print(result6)
operation result:
百度
If we want to get the property shown below, I can use @属性
this expression
code:
result7 = html.xpath('//a[@id="bb"]/@href')
print(result7)
operation result:
['https://v.qq.com/']
When we write crawlers, we often encounter more complex page codes, and it is more difficult for us to write xpath expressions. Here to tell you a lazy way, directly use the browser to copy
Specific operation: F12 in the browser to open the developer tool --> select the required content with the arrow in the upper left corner --> right click on the corresponding code --> Copy --> Copy XPath
Copy的结果:
//*[@id="js_top_news"]/div[2]/h2/a
In this way, we can easily get the xpath expression we need.
Of course, we can't just use this copy method, but also really understand the grammatical rules of xpath expressions. Because xpath parsing is limited in Python crawlers, we cannot use xpath expressions in some cases.
If the data of the webpage is dynamically loaded through Ajax, we cannot use xpath expressions to extract information.
A simple way to judge: right click on the webpage --> view the source code of the webpage --> ctrl+F to search for what you want Information ——> Search No Results ——> Cannot use xpath to parse
Sometimes, when we directly copy the xpath expression or write the xpath expression by ourselves, we will encounter the situation that an empty list is returned after extracting the information. After checking the code repeatedly, we find that we have written the right thing. What's going on?
In fact, it is very likely that you did notweb page source codeWrite the xpath expression based on the standard, but write the xpath expression according to the code displayed by the developer tool. The reason is that the developer tools are real-time web page codes (for example, after loading some data through js), and the page source data we extracted may not be real-time web page codes.
for example:
In general, we can write xpath expressions directly based on the code displayed by the developer tool, but it must be combined with the source code of the web page, and the source code of the web page shall prevail!
Say important things three times! The source code of the webpage shall prevail! The source code of the webpage shall prevail! The source code of the webpage shall prevail!
Let's enter the actual combat of reptiles. Our goal is to crawl the emoticon pack pictures and description text of a certain DouTu.com, and use the description text as the file name of the emoticon pack picture
# 导入必要的库 import requests from lxml import etree import time import re import os headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36", "Referer": "https://www.doutub.com/" # 防盗链:溯源,当前本次请求的上一级是谁。请看下面图一、图二 } num = int(input("你想爬取前几页:")) if os.path.exists("images") == False: os.mkdir("images") #如果不存在images文件夹,则创建images文件夹 for n in range(num): # for循环提取多页内容 url = f'https://www.doutub.com/img_lists/new/{n+1}' # url 是网址,这里使用字符串拼接网址 # 如 https://www.doutub.com/img_lists/new/1(第一页网址) https://www.doutub.com/img_lists/new/2(第二页网址)等等 resp = requests.get(url, headers=headers) html = etree.HTML(resp.text) divs = html.xpath("//div[@class='cell']")[0:50] # 返回的 divs 是一个列表,切片去除无用信息,第51个div我们不需要,详细看图三 for div in divs: imgSrc = div.xpath("./a/img/@data-src")[0] word = div.xpath("./a/span/text()")[0].strip() name = re.sub(r'[\:*?"<>/|]', '', word) #使用正则表达式sub函数去除 \:*?"<>/|这些字符。原因看图四 img_type = imgSrc.split(".")[-1] #因为图片文件的格式有些是jpg,有些是gif,这里取出图片格式 # 下载图片 img_resp = requests.get(imgSrc, headers=headers) with open("images/" + name + "." + img_type, mode="wb") as f: f.write(img_resp.content) print(name + "." + img_type, "下载完成") time.sleep(0.3) # 防止频繁访问被封ip,这里休息0.3秒 print(f"\n第{n+1}页下载完成!\n") print("全部下载完成!!!")
Figure 1:
Anti-leech problem
Figure 2:
Solve the anti-leech
Figure 3:
Figure 4:
final effect:
You're done!
Author:evilangel
link:http://www.pythonblackhole.com/blog/article/80225/42908da2317a2ce008aa/
source:python black hole net
Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.
name:
Comment content: (supports up to 255 characters)
Copyright © 2018-2021 python black hole network All Rights Reserved All rights reserved, and all rights reserved.京ICP备18063182号-7
For complaints and reports, and advertising cooperation, please contact vgs_info@163.com or QQ3083709327
Disclaimer: All articles on the website are uploaded by users and are only for readers' learning and communication use, and commercial use is prohibited. If the article involves pornography, reactionary, infringement and other illegal information, please report it to us and we will delete it immediately after verification!