如何使用爬虫(Python篇)-python black hole net

News from this site

Rental advertising space, please contact the webmaster if you need cooperation

Fiee

article

16630

browse

+focus

classification

no classification

date

2024-11(5)

如何使用爬虫(Python篇)

posted on 2023-06-06 12:50 read(1104) comment(0) like(28) collect(0)

A crawler (also known as a web crawler , web crawler) is a software system that automatically visits a website, and it is often used to crawl information on a website. Crawlers can be used to automatically discover new web pages when a website is updated, or when a website search engine index needs to be updated.

The workflow of a crawler is usually as follows:

Starting from a web page, the crawler parses the HTML code of the web page and finds the links in it.

The crawler continues to visit these links and parses the HTML code of new pages to find more links.

This process is repeated until the crawler has crawled the entire site, or until a termination condition is reached.

Here is a simple tutorial on writing a crawler in Python:

Install Python and crawler libraries.

To write a crawler in Python, you first need to install a Python interpreter. You can download the installation package from the Python official website, or use the package manager that comes with the system to install it.

Next, you need to install the crawler library. The most commonly used crawling library is Beautiful Soup, which can easily parse HTML and XML documents. Beautiful Soup can be installed with the following command:

pip install beautifulsoup4

Import library.

Before using a bug library in Python code, you need to import the library first. When using the Beautiful Soup crawler, the library can be imported using the following code:

from bs4 import BeautifulSoup

Get the HTML code.

The HTML code of the web pages that the crawler needs to crawl is stored on the web server. You can use Python's requests library to send an HTTP request to get the HTML code of a web page.

The sample code is as follows:


import requests
 
URL = "http://www.example.com"
page = requests.get(URL)
html_code = page.text

Parses HTML code.

Parsing HTML code with Beautiful Soup makes it easy to extract information.

The sample code is as follows:


soup = BeautifulSoup(html_code, 'html.parser')
 
# 获取所有的链接
links = soup.find_all('a')
for link in links:
    print(link.get('href'))
 
# 获取所有的段落
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

Save information.

The crawled information can be stored in a file or database for later use.

The sample code is as follows:


# 将信息存储到文件中
with open('info.txt', 'w') as f:
    for paragraph in paragraphs:
        f.write(paragraph.text + '\n')
 
# 将信息存储到数据库中
import sqlite3
 
conn = sqlite3.connect('info.db')
cursor = conn.cursor()
 
# 建表
cursor.execute('''CREATE TABLE IF NOT EXISTS info (id INTEGER PRIMARY KEY, text TEXT)''')
 
# 插入数据
for i, paragraph in enumerate(paragraphs):
    cursor.execute('''INSERT INTO info (id, text) VALUES (?, ?)''', (i, paragraph.text))
 
conn.commit()

Handle exceptions.

During the crawling process, various abnormal situations may be encountered, such as network connection errors, non-existent web pages, server errors, etc. These exceptions should be handled in the crawler code to ensure the stability of the crawler.

The sample code is as follows:


import requests
 
URL = "http://www.example.com"
 
try:
    page = requests.get(URL)
    html_code = page.text
except requests.exceptions.RequestException as e:
    print(e)

Limit crawl rate.

The crawling speed of the crawler can put pressure on the website. In order to avoid excessive occupation of website resources, you can set a delay in the crawler code to limit the crawling speed.

The sample code is as follows:


import time
time.sleep(1)  # 延时 1 秒

Anti-crawler mechanism.

Some websites use anti-crawler mechanisms to prevent crawlers from accessing. These mechanisms include disabling the crawler's IP address, using a CAPTCHA, setting a User-Agent, etc. The site's terms of use should be respected and these mechanisms considered in the crawler code.

The sample code is as follows:


import requests
 
URL = "http://www.example.com"
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
 
page = requests.get(URL, headers=headers)
html_code = page.text

Crawl dynamic web pages.

The content of many web pages is dynamically generated through JavaScript. This kind of webpage is called a dynamic webpage. Crawlers cannot directly obtain the HTML code of dynamic web pages, and need to use special methods to crawl.

Commonly used methods are to use the browser console to debug the webpage, or use the API interface that comes with the webpage to obtain data.

Encapsulate the crawler into a function .

When the code of the crawler is complex, the crawler can be encapsulated into a function. In this way, the crawler can be called conveniently, and the crawler code can be reused in different programs.

The sample code is as follows:


def crawl(url):
    """爬取网页内容"""
    page = requests.get(url)
    html_code = page.text
    soup = BeautifulSoup(html_code, 'html.parser')
    paragraphs = soup.find_all('p')
    return [paragraph.text for paragraph in paragraphs]

The code to call the crawler function is as follows:

info = crawl("http://www.example.com")

A crawler is a powerful tool that can help us obtain a large amount of information. However, crawlers must also abide by network ethics, and crawlers cannot be abused. The copyright and privacy rights of the website should be respected to avoid crawling unauthorized information.

When writing a crawler, there are some considerations and optimization methods to consider:

The crawling speed and efficiency of crawlers can be improved through multi-threaded or distributed crawlers.

Comments should be added to the crawler code to facilitate future reading and maintenance.

During the crawling process, excessive pressure on the website should be avoided. You can set a delay, or use a proxy server.

When crawling a website, the robots.txt file of the website should be obeyed. This file is used to tell the crawler which pages are not allowed to crawl.

Crawlers can be implemented in many languages, not just Python. You can choose the appropriate language according to your needs and your own skills.

Crawl AJAX web pages.

AJAX web pages are a technique for dynamically updating web pages using JavaScript. Crawlers cannot directly obtain the content of AJAX web pages, and need to use special methods to crawl.

Commonly used methods are to use the browser console to debug the webpage, or use the API interface that comes with the webpage to obtain data.

Use a proxy server.

When the crawling speed of the crawler is too fast, the IP address may be blocked by the website. To avoid this, a proxy server can be used. Proxy servers can help crawlers change IP addresses to avoid being blocked.

The code to use the proxy server is as follows:


import requests
 
URL = "http://www.example.com"
 
proxies = {
    'http': 'http://192.168.0.1:8080',
    'https': 'https://192.168.0.1:8080'
}
 
page = requests.get(URL, proxies=proxies)

Optimize crawlers with preprocessing libraries.

When the task of the crawler is more complicated, the preprocessing library can be used to optimize the crawler. The preprocessing library can help crawlers process large amounts of data efficiently, and can provide various convenient functions.

Commonly used preprocessing libraries include NumPy, pandas, etc.

Use caching to speed up crawlers.

During the crawling process, the crawler may visit the same web page multiple times. In order to reduce unnecessary network requests, a cache mechanism can be used to speed up crawlers.

The cache mechanism can store the crawled webpages locally so that they can be used directly the next time they are crawled. This can greatly reduce the number of network requests and improve the efficiency of crawlers.

The code to use the cache is as follows:


import requests
import os
 
URL = "http://www.example.com"
 
# 将网页存储到本地
def save_page(url, filename):
    if not os.path.exists(filename):  # 如果文件不存在
        page = requests.get(url)
        with open(filename, 'w') as f:
            f.write(page.text)
 
# 读取本地网页
def read_page(filename):
    with open(filename, 'r') as f:
        return f.read()
 
# 使用缓存
def get_page(url):
    filename = url.split('/')[-1] + '.html'
    if not os.path.exists(filename):
        save_page(url, filename)
    return read_page(filename)
 
html_code = get_page(URL)

Use multithreading to speed up crawlers.

When crawling web pages, you can use multi-threading to speed up the crawler. Multi-threading allows crawlers to crawl multiple web pages at the same time, thereby improving efficiency.

The code using multithreading is as follows:


import threading
import time
 
URLs = [
    "http://www.example.com/1",
    "http://www.example.com/2",
    "http://www.example.com/3",
]
 
# 爬取网页
def crawl(url):
    page = requests.get(url)
    html_code = page.text
 
# 多线程爬取
threads = []
for url in URLs:
    t = threading.Thread(target=crawl, args=(url,))
    threads.append(t)
    t.start()
 
# 等待线程结束
for t in threads:
    t.join()

Note: When using multithreading, you need to pay attention to thread safety issues.

Use distributed crawlers to speed up crawling.

When crawling a large amount of data, you can use a distributed crawler method to speed up crawling. Distributed crawlers can use multiple machines to crawl web pages at the same time, thereby greatly improving the efficiency of crawlers.

The code for using distributed crawlers is as follows:


from multipyparallel import Client
 
# 连接到本地的 IPython 并行服务器
rc = Client()
 
# 爬取网页
def crawl(url):
    page = requests.get(url)
    html_code = page.text
 
# 并行爬取
result = rc[:].apply_async(crawl, URLs)
 
# 等待爬取完成
result.wait()