簡易爬蟲筆記使用Python

2017-07-23

前言

最近在幫忙改寫爬蟲程式到Python,剛好有使用到 (BeautifulSoup) 這個還不錯的套件，他可以輕易的讓你找到爬下來的HTML節點與內容，進而尋找到你想爬的資料，真是非常的方便阿，所以就來做個紀錄搂

Step 1: 安裝 Python 3

windows 可以到 python 官網可以下載exe檔案進行安裝，ubontu 可以在指令視窗使用 apt-get 進行安裝

sudo apt-get install python3
sudo apt-get install pip3

Step 2: 安裝BeautifulSoup4 與 Selenium

有了python之後我們可以使用pip，來安裝beautifulsoup相關套件

pip install bs4

Selenium是一個很好用的工具，常常被拿來模仿使用者動作來撰寫測試，這邊就借他來爬網頁搂

pip install selenium

然後要下載 driver (這邊使用chrome driver)，解壓縮以後可以看到一個檔案，你可以把路徑放到path或者讓他跟你的程式在同一個目錄下，這樣程式等等才抓的到喔

Step 3:新增一個檔案 crawler.py 並使用 Selenium 爬出相關內容

這邊我隨便找了一個html範例 (Example) ，以下為 crawler.py 內容

from bs4 import BeautifulSoup
from selenium import webdriver

def start(link):
    #先用selenium模仿web使用者動作取得網頁內容
    #後面的executable_path為你chromedriver的路徑(相對路徑)
    driver=webdriver.Chrome(executable_path="./chromedriver")
    driver.get(link)
    print(driver.page_source)

url ='http://example.com/'
start(url)

寫好後在小黑視窗打上

python crawler.py
//或者
python3 crawler.py

你就會看到抓下來的html

抓下來的html

Step 4:使用 BeautifulSoup4 解析 Html 內容

首先我們在程式裡面寫個小小方法叫 innerHtml 來做轉換節點為文字的動作

def innerHTML(element):
    return element.decode_contents(formatter="html")

然後要將我們剛剛抓下來的東西餵進 BeautifulSoup 內並用一個變數做儲存

testsoup = BeautifulSoup(driver.page_source, "html.parser")

假設我們現在想找所有 Html 的

標籤裡面的東西並把他印出來，這時我們只要這樣寫就可以把所有的 p 節點內容印出來瞜

for nodep in testsoup.find_all('p') :
    print(innerHTML(nodep))

我們也可以把找到的內容用變數存起來

pArrNodes = testsoup.find_all('p')
print(pArrNodes[0])

如果找不到回傳的結果會是 none 所以我們可以用這個來做 if 的判斷

if testsoup.find_all('p') is not None :
    for nodep in testsoup.find_all('p') :
        print(innerHTML(nodep))

如果我們想要取得某個 p 標籤內 class 名稱叫做 ‘test’

testsoup.find_all('p',{'class':'test'})
#當然我們也可以用相同方法過濾其他的
testsoup.find_all('p',{'id':'test'})

而且在裡面也可以使用 lembda

#這句是撈所有 p 標籤內有 class 的標籤
testsoup.find(lambda tag: tag.name == 'p' and 'class' in tag.attrs )

最後我們來簡單的抓出所有p標籤底下的 a 標籤然後把它印出來吧(範例程式碼)

from bs4 import BeautifulSoup
from selenium import webdriver


def innerHTML(element):
    return element.decode_contents(formatter="html")
def start(link):
    #先用selenium模仿web使用者動作取得網頁內容
    #後面的executable_path為你chromedriver的路徑(相對路徑)
    driver=webdriver.Chrome(executable_path="./chromedriver")
    driver.get(link)
    testsoup = BeautifulSoup(driver.page_source, "html.parser")
    #找出所有p底下的所有a
    pArr =testsoup.find_all('p')
    for nodep in pArr:
        aArr =nodep.find_all('a')
        if aArr is not None:
            for nodea in aArr :
                print(innerHTML(nodea))
    #印出網頁內容
    #print(driver.page_source)

url ='http://example.com/'
start(url)

結尾

這邊小小紀錄了一下 BeautifulSoup 簡單的使用方法更詳細的可以參考下面的連結有更多教學喔，他可以很簡單的取出你想要的內容，快速塞選出你想要的節點進行資料，真是相當的方便阿，不過在爬網頁時還是要注意一下資料的合法性喔。

參考文獻

crummy :https://www.crummy.com/software/BeautifulSoup/bs4/doc/