Python 爬蟲基礎教學：從入門到實作

1. 什麼是網路爬蟲？

網路爬蟲（Web Crawler）是自動化擷取網頁資料的程式，常用於資料分析、內容彙整、價格比較等。

2. 必備套件安裝

pip install requests beautifulsoup4

3. 基本爬蟲流程

發送 HTTP 請求取得網頁內容
解析 HTML，擷取需要的資料
儲存或處理資料

4. 實戰範例：爬取範例網頁標題

import requests
from bs4 import BeautifulSoup

url = 'https://example.com/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
title = soup.find('title').text
print('網頁標題：', title)

5. 進階技巧

(1) 偽裝瀏覽器 headers

有些網站會檢查 User-Agent，建議加上 headers：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
res = requests.get(url, headers=headers)

(2) 批次爬取多頁（for 迴圈與分頁）

for page in range(1, 6):
    url = f'https://example.com/list?page={page}'
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    # 解析每頁內容

(3) 解析表格、清單、屬性資料

# 取得所有連結
for a in soup.find_all('a'):
    print(a['href'])
# 解析表格
for row in soup.select('table tr'):
    cols = [td.text for td in row.find_all('td')]
    print(cols)

(4) 處理中文編碼與例外錯誤

res.encoding = 'utf-8'  # 或 res.apparent_encoding
try:
    res = requests.get(url, timeout=10)
    res.raise_for_status()
except Exception as e:
    print('錯誤:', e)

(5) 爬取動態網頁（Selenium 實戰）

有些網頁內容需 JS 執行後才出現，可用 Selenium：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
driver.get('https://example.com/')
print(driver.title)
elems = driver.find_elements(By.CSS_SELECTOR, 'div.item')
for elem in elems:
    print(elem.text)
driver.quit()

(6) 反爬蟲對策與常見封鎖

加入隨機延遲（time.sleep + random）
代理伺服器（proxies 參數）
觀察網站 robots.txt 與 API 限制

(7) 資料儲存（CSV、JSON、資料庫）

import csv, json
# 儲存為 CSV
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['標題', '連結'])
    writer.writerow(['test', 'https://example.com'])
# 儲存為 JSON
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump({'title': 'test'}, f, ensure_ascii=False)

(8) 爬蟲專案結構與模組化

建議將爬蟲主程式、資料解析、資料儲存分開成模組
使用 logging、config 檔案管理

6. 法律與道德注意事項

僅爬取公開、允許擷取的資料
尊重網站 robots.txt 與 API 條款
不得用於非法用途或大量影響網站運作

本文適合 Python 新手快速入門爬蟲，進階可學習 Selenium、Scrapy、API 擷取、分散式爬蟲等技術。歡迎留言討論你的爬蟲經驗！

Python 爬蟲基礎教學：從入門到實作#

1. 什麼是網路爬蟲？#

2. 必備套件安裝#

3. 基本爬蟲流程#

4. 實戰範例：爬取範例網頁標題#

5. 進階技巧#

(1) 偽裝瀏覽器 headers#

(2) 批次爬取多頁（for 迴圈與分頁）#

(3) 解析表格、清單、屬性資料#

(4) 處理中文編碼與例外錯誤#

(5) 爬取動態網頁（Selenium 實戰）#

(6) 反爬蟲對策與常見封鎖#

(7) 資料儲存（CSV、JSON、資料庫）#

(8) 爬蟲專案結構與模組化#

6. 法律與道德注意事項#

延伸閱讀

Python 爬蟲基礎教學：從入門到實作

1. 什麼是網路爬蟲？

2. 必備套件安裝

3. 基本爬蟲流程

4. 實戰範例：爬取範例網頁標題

5. 進階技巧

(1) 偽裝瀏覽器 headers

(2) 批次爬取多頁（for 迴圈與分頁）

(3) 解析表格、清單、屬性資料

(4) 處理中文編碼與例外錯誤

(5) 爬取動態網頁（Selenium 實戰）

(6) 反爬蟲對策與常見封鎖

(7) 資料儲存（CSV、JSON、資料庫）

(8) 爬蟲專案結構與模組化

6. 法律與道德注意事項