【Python】Scrapyでspiderを作成する流れ

スパイダーの作成

cd spiders
scrapy genspider [-t テンプレート] スパイダー名 URL(最初のhttps://と最後の/は不要)

1 2	cd spiders scrapy genspider [-t テンプレート] スパイダー名 URL(最初のhttps://と最後の/は不要)

プロジェクト内のspidersフォルダに移動して上記コマンドを実行します。

-t テンプレート

デフォルトは「basicテンプレート」になっています。

テンプレートの確認

scrapy genspider -l

Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

scrapy genspider -l

Available templates:

basic

crawl

csvfeed

xmlfeed

選べるテンプレートは上記コマンドで確認できます。

生成されるスパイダーclass

import scrapy


class XXXSpider(scrapy.Spider):
    name = "スパイダー名"
    allowed_domains = ["xxx"]
    start_urls = ["https://xxx"]

    def parse(self, response):
        pass

import scrapy

class XXXSpider(scrapy.Spider):

name = "スパイダー名"

allowed_domains = ["xxx"]

start_urls = ["https://xxx"]

def parse(self, response):

pass

scrapy.Spider

継承しているクラスです。これを継承しているのでほんの数行でも多くのことを行うことができます。

allowed_domains

spiderがアクセスするドメイン名です。指定されたドメイン以外にスクレイピングすることを防止できます。設定しなくても動きますが、思わぬサイトにアクセスしてしまわないように設定したほうが無難です。

start_urls

ここで設定したURLに対してリクエストをまず行います。配列なので複数設定も可能です。

parseメソッド

Webサイトからのレスポンスをここでキャッチします。ここにXPathやCSSセレクタなどを使った情報抽出処理などを記述していきます。

サンプル

Scrapyではほんの数行のコードを記述するだけで基本的なコーディングは終わります。

import scrapy


class QiitaTrend1dSpider(scrapy.Spider):
    name = "spider名"
    allowed_domains = ["xxx.com"]
    start_urls = ["https://xxx.com"]

    def parse(self, response):
        変数1 = response.xpath('変数1を取得するXpath').get()
        変数2 = response.xpath('変数2を取得するXpath').getall()

        yield {
            'キー1':変数1,
            'キー2': 変数2,
        }

import scrapy

class QiitaTrend1dSpider(scrapy.Spider):

name = "spider名"

allowed_domains = ["xxx.com"]

start_urls = ["https://xxx.com"]

def parse(self, response):

変数1 = response.xpath('変数1を取得するXpath').get()

変数2 = response.xpath('変数2を取得するXpath').getall()

yield {

'キー1':変数1,

'キー2': 変数2,

}

response

データ型は「scrapy.http.response.html.HtmlResponse」というclassになります。

公式サイトに詳しい使い方は載っています。（xpathやcssメソッドについてもこのページに記載があります。）

https://doc-ja-scrapy.readthedocs.io/ja/latest/topics/request-response.html

yield 戻り値

関数の実行を一時的に停止して戻り値を返します。response結果に対して辞書型を戻り値として設定してあげることによって画面やコンソールに取得結果が出力されます。一旦停止するだけなのでその後処理は継続して実行されます。

spiderの実行

プロジェクトフォルダ直下で、以下のコマンドを実行します。

scrapy crawl スパイダー名（spiderクラスの中のnameで指定した値）

1	scrapy crawl スパイダー名（spiderクラスの中のnameで指定した値）

jsonに出力したい場合

oオプションを使います。

scrapy crawl spider名 -o xxx.json

1	scrapy crawl spider名 -o xxx.json

入れ子(Selectorに対してXPathを指定)：相対XPath

parent = response.xpath('//div[contains(@class,"parent")]')
child = parent.xpath('.//div[contains(@class,"pName")]/p')

1 2	parent = response.xpath('//div[contains(@class,"parent")]') child = parent.xpath('.//div[contains(@class,"pName")]/p')

親要素のSelectorに対してxpathを実行できます。その際はresponseにXPathを指定するのと異なり、ドットを先頭につける必要があるので注意です。

これを相対XPathと呼びます。

スパイダーの作成

-t テンプレート

テンプレートの確認

生成されるスパイダーclass

scrapy.Spider

allowed_domains

start_urls

parseメソッド

サンプル

response

yield 戻り値

spiderの実行

jsonに出力したい場合

入れ子(Selectorに対してXPathを指定)：相対XPath

スポンサーリンク

関連記事

【Python】対話実行モード(REPL)。開発環境(PyCharm、IDLEなど)

【Python】「BeautifulSoup」について

【Python】環境構築手法比較

【Lambda】PythonでLamda関数を定義して実行する

【Python】基本文法

【Python】スクレイピング結果をCSVに保存する。

著者プロフィール

スポンサーリンク

カテゴリー