{"id":21136,"date":"2023-06-09T06:26:47","date_gmt":"2023-06-08T21:26:47","guid":{"rendered":"http:\/\/www.code-magagine.com\/?p=21136"},"modified":"2023-06-09T08:41:05","modified_gmt":"2023-06-08T23:41:05","slug":"%e3%80%90python%e3%80%91scrapy-item%e3%81%ae%e5%9f%ba%e6%9c%ac","status":"publish","type":"post","link":"http:\/\/www.code-magagine.com\/?p=21136","title":{"rendered":"\u3010Python\u3011Scrapy Item\u3001Item Loader\u3001Item pipeline\u306e\u57fa\u672c"},"content":{"rendered":"<h2>Scrapy Item\u3068\u306f\uff1f<\/h2>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-large wp-image-21152\" src=\"http:\/\/www.code-magagine.com\/wp-content\/uploads\/2023\/06\/\u30b9\u30af\u30ea\u30fc\u30f3\u30b7\u30e7\u30c3\u30c8-2023-06-09-8.14.16-1024x402.png\" alt=\"\" width=\"1024\" height=\"402\" \/><\/p>\n<ul>\n<li>Web\u30b5\u30a4\u30c8\u304b\u3089\u53d6\u5f97\u3057\u305f\u30c7\u30fc\u30bf\u3092\u683c\u7d0d\u3059\u308b\u5165\u308c\u7269\uff08\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\uff09<\/li>\n<li>\u3042\u3089\u304b\u3058\u3081\u5b9a\u7fa9\u3057\u305f\u30d5\u30a3\u30fc\u30eb\u30c9\u306b\u5bfe\u3057\u3066\u30c7\u30fc\u30bf\u3092\u5b9a\u7fa9\u3059\u308b\u3002<\/li>\n<li>\u30c7\u30fc\u30bf\u69cb\u9020\u3092\u6b63\u78ba\u306b\u4fdd\u3064\u3053\u3068\u304c\u3067\u304d\u308b\u3002\uff08\u5b9a\u7fa9\u3057\u3066\u3044\u306a\u3044\u30d5\u30a3\u30fc\u30eb\u30c9\u306b\u5024\u3092\u5165\u308c\u3088\u3046\u3068\u3059\u308b\u3068\u30a8\u30e9\u30fc\u306b\u306a\u308b\u3002\uff09<\/li>\n<li>Item pipeline\u3092\u4f7f\u3063\u3066\u30c7\u30fc\u30bf\u30af\u30ec\u30f3\u30b8\u30f3\u30b0\u3001\u30c1\u30a7\u30c3\u30af\u3001DB\u4fdd\u5b58\u306a\u3069\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002<\/li>\n<li>\u30c7\u30fc\u30bf\u306e\u683c\u7d0d\u306b\u306fItem Loader\u3092\u7528\u3044\u3066\u53d6\u5f97\u3057\u305f\u30c7\u30fc\u30bf\u306e\u6570\u5024\u5909\u63db\u306a\u3069\u4fbf\u5229\u6a5f\u80fd\u3092\u4f7f\u3048\u307e\u3059\u3002<\/li>\n<\/ul>\n<h3>Item Loader\u306e\u30e1\u30ea\u30c3\u30c8<\/h3>\n<p>Spider\u306b\u76f4\u63a5\u53d6\u5f97\u30c7\u30fc\u30bf\u306e\u5909\u63db\u51e6\u7406\u3092\u8a18\u8ff0\u3057\u3066\u3082\u826f\u3044\u306e\u3067\u3059\u304c\u3001\u305d\u308c\u3060\u3068\u8907\u6570Spider\u3092\u4f5c\u308b\u5834\u5408\u306b\u30b3\u30fc\u30c9\u306e\u5171\u6709\u304c\u3067\u304d\u307e\u305b\u3093\u3002\u307e\u305fspider\u306e\u30b3\u30fc\u30c9\u306e\u80a5\u5927\u5316\u9632\u6b62\u306b\u3082\u7e4b\u304c\u308b\u306e\u3067Scrapy\u306e\u30b3\u30fc\u30c9\u306e\u54c1\u8cea\u5411\u4e0a\u306b\u3064\u306a\u304c\u308a\u307e\u3059\u3002<\/p>\n<h2>Item\u306e\u5b9a\u7fa9<\/h2>\n<p>\u4ee5\u4e0b\u306e\u30d5\u30a1\u30a4\u30eb\u3092\u7de8\u96c6\u3057\u307e\u3059\u3002<\/p>\n<pre class=\"lang:default decode:true \">projects\/\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u540d\/\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u540d\/items.py<\/pre>\n<h3>\u8a18\u8ff0<\/h3>\n<p>\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u611f\u3058\u3067\u5b9a\u7fa9\u3057\u307e\u3059\u3002<\/p>\n<pre class=\"lang:default decode:true\">import scrapy\r\n\r\nclass BookItem(scrapy.Item):\r\n    title = scrapy.Field()\r\n    price = scrapy.Field()\r\n    pass<\/pre>\n<h4>Item\u3092\u4f7f\u3046\u3002<\/h4>\n<pre class=\"lang:default decode:true\">from \u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u540d.items import BookItem\r\n\r\nclass BooksSpider(CrawlSpider):\r\n    def parse_item(self, response):\r\n        loader = ItemLoader(Item=BookItem(), response = response)\r\n        loader.add.xpath('title','title\u3092\u53d6\u5f97\u3059\u308bXPath')\r\n        loader.add.xpath('price','price\u3092\u53d6\u5f97\u3059\u308bXPath')\r\n        yield loader.load_item()\r\n<\/pre>\n<h4>loader = ItemLoader(Item=BookItem(), response = response)<\/h4>\n<p>Item\u306b\u30c7\u30fc\u30bf\u3092\u683c\u7d0d\u3059\u308b\u306e\u306bItemLoader\u3092\u4f7f\u3044\u307e\u3059\u3002<\/p>\n<h4>yield loader.load_item()<\/h4>\n<p>Web\u30b5\u30a4\u30c8\u304b\u3089\u53d6\u5f97\u3057\u305f\u30c7\u30fc\u30bf\u3092Item\u306b\u683c\u7d0d\u3059\u308b\u69cb\u6587\u3067\u3059\u3002yield\u3067\u683c\u7d0d\u7d50\u679c\u3092\u51fa\u529b\u3057\u3066\u3044\u307e\u3059\u3002<\/p>\n<h3>\u5b9f\u884c<\/h3>\n<pre class=\"lang:default decode:true \">scrapy crawl spider\u540d -o \u30d5\u30a1\u30a4\u30eb\u540d.json<\/pre>\n<h5>\u7d50\u679c<\/h5>\n<p>\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u51fa\u529b\u7d50\u679c\u304c\u8fd4\u308a\u307e\u3059\u3002<\/p>\n<pre class=\"lang:default decode:true \">[\r\n    {\r\n        \"title\": [\r\n            \" \u30d7\u30ed\u3092\u76ee\u6307\u3059\u4eba\u306e\u305f\u3081\u306e\uff34\uff59\uff50\uff45\uff33\uff43\uff52\uff49\uff50\uff54\u5165\u9580\u2015\u5b89\u5168\u306a\u30b3\u30fc\u30c9\u306e\u66f8\u304d\u65b9\u304b\u3089\u9ad8\u5ea6\u306a\u578b\u306e\u4f7f\u3044\u65b9\u307e\u3067\"\r\n        ],\r\n        \"author\": [\r\n            \"\u9234\u6728 \u50da\u592a\u3010\u8457\u3011\"\r\n        ],\r\n        \"price\": [\r\n            \"\u00a53,278\"\r\n        ],\r\n        \"publisher\": [\r\n            \"\u6280\u8853\u8a55\u8ad6\u793e\"\r\n        ],\r\n        \"size\": [\r\n            \"\u30b5\u30a4\u30ba B5\u5224\uff0f\u30da\u30fc\u30b8\u6570 411p\uff0f\u9ad8\u3055 24cm\"\r\n        ],\r\n        \"isbn\": [\r\n            \"\u5546\u54c1\u30b3\u30fc\u30c9 9784297127473\"\r\n        ]\r\n    },\r\n<\/pre>\n<h2>Item Loader<\/h2>\n<p>\u6587\u5b57\u5217\u304b\u3089\u6570\u5024\u306b\u5909\u63db\u306a\u3069Item\u306b\u683c\u7d0d\u3059\u308b\u524d\u5f8c\u51e6\u7406\u3092\u8a18\u8ff0\u3059\u308b\u305f\u3081\u306b\u4f7f\u3044\u307e\u3059\u3002input processor\u3084output processor\u306e\u51e6\u7406\u3082items.py\u306b\u8a18\u8ff0\u3057\u307e\u3059\u3002<\/p>\n<h3>items.py\u3078\u306e\u8a18\u8ff0<\/h3>\n<pre class=\"lang:default decode:true\">from itemloaders.processors import TakeFirst,MapCompose,Join\r\n\r\ndef strip_yen(element):\r\n    if element:\r\n        return element.replace('\u00a5','')\r\n    return element\r\n\r\ndef convert_integer(element):\r\n    if element:\r\n        return int(element)\r\n    return 0\r\n\r\nclass BookItem(scrapy.Item):\r\n    title = scrapy.Field(\r\n        input_processor = MapCompose(str.lstrip),\r\n        output_processor = Join(' ')\r\n    )\r\n    price = scrapy.Field(\r\n        input_processor = MapCompose(strip_yen,convert_integer),\r\n        output_processor = TakeFirst()\r\n    )\r\n    pass<\/pre>\n<h4>input processor<\/h4>\n<p>XPath\u3084CSS\u3067\u53d6\u5f97\u3057\u305f\u30c7\u30fc\u30bf\u3092Item\u306b\u8aad\u307f\u8fbc\u307f\u3059\u308b\u524d\u306b\u4f55\u304b\u3057\u305f\u3044\u969b\u306b\u6307\u5b9a\u3057\u307e\u3059\u3002<\/p>\n<ul>\n<li>\u5148\u982d\u306e\u30b9\u30da\u30fc\u30b9\u9664\u53bb\u306a\u3069<\/li>\n<\/ul>\n<h5>MapCompose(\u5b9f\u884c\u3057\u305f\u3044\u95a2\u65701\u3001\u5b9f\u884c\u3057\u305f\u3044\u95a2\u65702)<\/h5>\n<p>\u4f55\u304b\u5165\u529b\u5024\u306b\u5bfe\u3057\u3066Item\u306b\u683c\u7d0d\u3059\u308b\u524d\u306b\u95a2\u6570\u3092\u5b9f\u884c\u3057\u305f\u3044\u5834\u5408\u306b\u4f7f\u3044\u307e\u3059\u3002\u30ab\u30f3\u30de\u533a\u5207\u308a\u3067\u8907\u6570\u95a2\u6570\u3092\u6307\u5b9a\u3067\u304d\u307e\u3059\u3002<\/p>\n<h4>output processor<\/h4>\n<p>\u30a2\u30a4\u30c6\u30e0\u306e\u5404\u30d5\u30a3\u30fc\u30eb\u30c9\u306b\u683c\u7d0d\u3057\u307e\u3059\u3002Item\u3067\u306f\u7d50\u679c\u306flist\u3067\u683c\u7d0d\u3055\u308c\u3066\u3044\u308b\u306e\u3067list\u304b\u3089\u306e\u53d6\u5f97\u65b9\u6cd5\u3092\u8a18\u8ff0\u3057\u307e\u3059\u3002<\/p>\n<ul>\n<li>list\u306e\u9593\u306b\u7a7a\u767d\u3092\u52a0\u3048\u306a\u304c\u3089\u7d50\u5408\u3057\u3066\u51fa\u529b\u3059\u308b\u306a\u3069<\/li>\n<\/ul>\n<h5>Join('\u9593\u306b\u5165\u308c\u308b\u6587\u5b57')<\/h5>\n<p>List\u3092\u7d50\u5408\u3057\u3066\u51fa\u529b\u3057\u307e\u3059\u3002<\/p>\n<h5>TakeFirst<\/h5>\n<p>List\u304b\u3089\u6700\u521d\u306e\u8981\u7d20\u3092\u53d6\u5f97\u3057\u307e\u3059\u3002<\/p>\n<h2>Item pipeline<\/h2>\n<p>Item\u306e\u30c7\u30fc\u30bf\u30af\u30ec\u30f3\u30b8\u30f3\u30b0\u3001\u30c1\u30a7\u30c3\u30af\u3001DB\u4fdd\u5b58\u306a\u3069\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002<\/p>\n<h3>pipeline\u3067\u4f7f\u3048\u308b\u30e1\u30bd\u30c3\u30c9<\/h3>\n<table>\n<tbody>\n<tr>\n<td>from_crawler(cls,crawler)<\/td>\n<td>\u30af\u30e9\u30b9\u30e1\u30bd\u30c3\u30c9\u3002pipeline\u304c\u30a4\u30f3\u30b9\u30bf\u30f3\u30b9\u5316\u3055\u308c\u308b\u969b\u306b\u5b9f\u884c\u3055\u308c\u308b\u3002<\/td>\n<\/tr>\n<tr>\n<td>open_spider(self,spider)<\/td>\n<td>spider\u306e\u958b\u59cb\u6642\u306b\u5b9f\u884c<\/td>\n<\/tr>\n<tr>\n<td>process_item(self,item,spider)<\/td>\n<td>\u5168\u3066\u306eitem pipeline\u306b\u5bfe\u3057\u3066\u5b9f\u884c<\/td>\n<\/tr>\n<tr>\n<td>close_spider(self,spider)<\/td>\n<td>\u00a0spider\u306e\u7d42\u4e86\u6642\u306b\u5b9f\u884c<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>projects\/\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\/\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\/pipelines.py<\/h3>\n<p>\u3053\u3061\u3089\u306e\u30d5\u30a1\u30a4\u30eb\u306bpipeline\u306e\u51e6\u7406\u306f\u8a18\u8ff0\u3057\u3066\u3044\u304d\u307e\u3059\u3002<\/p>\n<p>\u4ee5\u4e0b\u306f\u3001item\u306b\u5024\u304c\u8a2d\u5b9a\u3055\u308c\u3066\u3044\u308b\u304b\u30c1\u30a7\u30c3\u30af\u3057\u307e\u3059\u3002\u306a\u304b\u3063\u305f\u3089DropItem\u3067\u4f8b\u5916\u51fa\u529b\u3057\u307e\u3059\u3002<\/p>\n<pre class=\"lang:default decode:true\">from scrapy.exceptions import DropItem\r\n\r\nclass CheckItemPipeline:\r\n    def process_item(self, item, spider):\r\n        if not item.get('item\u306e\u30d5\u30a3\u30fc\u30eb\u30c9\u540d'):\r\n            return DropItem('Missing \u30d5\u30a3\u30fc\u30eb\u30c9\u540d')\r\n        return item\r\n<\/pre>\n<h3>\u8a2d\u5b9a\u30d5\u30a1\u30a4\u30eb(projects\/\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\/\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\/settings.py)<\/h3>\n<pre class=\"lang:default decode:true\"># Configure item pipelines\r\n# See https:\/\/docs.scrapy.org\/en\/latest\/topics\/item-pipeline.html\r\nITEM_PIPELINES = {\r\n   \"\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u540d.pipelines.CheckItemPipeline\": 300,\r\n}<\/pre>\n<p>\u7528\u610f\u3057\u305fpipeline\u306eclass\u3092\u8a2d\u5b9a\u30d5\u30a1\u30a4\u30eb\u306b\u767b\u9332\u3057\u307e\u3059\u3002\u6570\u5024\u304c\u5c0f\u3055\u3044\u307b\u3069\u5148\u306b\u5b9f\u884c\u3055\u308c\u307e\u3059\u3002<\/p>\n<p>\u3053\u3046\u3059\u308b\u3053\u3068\u3067\u5b9f\u884c\u6642(scrapy crawl \u30b9\u30d1\u30a4\u30c0\u30fc\u540d)\u306b\u81ea\u52d5\u7684\u306b\u30c1\u30a7\u30c3\u30af\u3092\u3057\u3066\u304f\u308c\u308b\u3088\u3046\u306b\u306a\u308a\u307e\u3059\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"Scrapy Item\u3068\u306f\uff1f Web\u30b5\u30a4\u30c8\u304b\u3089\u53d6\u5f97\u3057\u305f\u30c7\u30fc\u30bf\u3092\u683c\u7d0d\u3059\u308b\u5165\u308c\u7269\uff08\u30aa\u30d6\u30b8\u30a7\u30af\u30c8\uff09 \u3042\u3089\u304b\u3058\u3081\u5b9a\u7fa9\u3057\u305f\u30d5\u30a3\u30fc\u30eb\u30c9\u306b\u5bfe\u3057\u3066\u30c7\u30fc\u30bf\u3092\u5b9a\u7fa9\u3059\u308b\u3002 \u30c7\u30fc\u30bf\u69cb\u9020\u3092\u6b63\u78ba\u306b\u4fdd\u3064\u3053\u3068\u304c\u3067\u304d\u308b\u3002\uff08\u5b9a\u7fa9\u3057\u3066\u3044\u306a\u3044\u30d5\u30a3\u30fc\u30eb\u30c9\u306b\u5024\u3092\u5165 [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[47],"tags":[],"_links":{"self":[{"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=\/wp\/v2\/posts\/21136"}],"collection":[{"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=21136"}],"version-history":[{"count":20,"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=\/wp\/v2\/posts\/21136\/revisions"}],"predecessor-version":[{"id":21157,"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=\/wp\/v2\/posts\/21136\/revisions\/21157"}],"wp:attachment":[{"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=21136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=21136"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.code-magagine.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=21136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}