后端开发|Python教程
Python,Scrapy,爬虫
后端开发-Python教程
A、简单配置,获取单个网页上的内容。
(1)创建scrapy项目
cms开源建站系统源码,ubuntu自动定时开机,tomcat组件初始化失败,java爬虫学徒,php架构师技能搭配,全网seo软件lzw
scrapy startproject getblog
ftp客户端源码,vscode快捷键设置6,ubuntu ntfs只读,访问不到tomcat服务,中东人脸爬虫,php相关技术,郑州抖音seo搜索优化排名,经典语录网站程序lzw
(2)编辑 items.py
正版授权源码商城,ubuntu登录账户密码,tomcat的配置端口号,爬虫去水印,php定义全局数组变量,无锡首页seolzw
# -*- coding: utf-8 -*- # Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.html from scrapy.item import Item, Field class BlogItem(Item): title = Field() desc = Field()
(3)在 spiders 文件夹下,创建 blog_spider.py
需要熟悉下xpath选择,感觉跟JQuery选择器差不多,但是不如JQuery选择器用着舒服( w3school教学: /xpath/ )。
# coding=utf-8 from scrapy.spider import Spiderfrom getblog.items import BlogItemfrom scrapy.selector import Selector class BlogSpider(Spider): # 标识名称 name = log # 起始地址 start_urls = [/] def parse(self, response): sel = Selector(response) # Xptah 选择器 # 选择所有含有class属性,值为‘post_item的div 标签内容 # 下面的 第2个div 的 所有内容 sites = sel.xpath(//div[@class="post_item"]/div[2]) items = [] for site in sites:item = BlogItem()# 选取h3标签下,a标签下,的文字内容 ‘text()item[ itle] = site.xpath(h3/a/text()).extract()# 同上,p标签下的 文字内容 ‘text()item[desc] = site.xpath(p[@class="post_item_summary"]/text()).extract()items.append(item) return items
(4)运行,
scrapy crawl blog # 即可
(5)输出文件。
在 settings.py 中进行输出配置。
# 输出文件位置FEED_URI = log.xml# 输出文件格式 可以为 json,xml,csvFEED_FORMAT = xml
输出位置为项目根文件夹下。
B、基本的 — scrapy.spider.Spider
(1)使用交互shell
dizzy@dizzy-pc:~$ scrapy shell "/"
-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {LOGSTATS_INTERVAL: 0}-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines: -08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081-08-21 04:09:11+0800 [default] INFO: Spider opened-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) (referer: None)[s] Available Scrapy objects:[s] crawler [s] item {}[s] request [s] response [s] settings [s] spider [s] Useful shortcuts:[s] shelp()Shell help (print this help)[s] fetch(req_or_url) Fetch request (or URL) and update local objects[s] view(response) View response in a browser >>> # response.body 返回的所有内容 # response.xpath(//ul/li) 可以测试所有的xpath内容 More important, if you type response.selector you will access a selector object you can use toquery the response, and convenient shortcuts like response.xpath() and response.css() mapping toresponse.selector.xpath() and response.selector.css()
也就是可以很方便的,以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的,但是并不能保证每次都能正确的选择出内容。
也可使用:
scrapy shell \ --nolog# 参数 --nolog 没有日志
(2)示例
from scrapy import Spiderfrom scrapy_test.items import DmozItem class DmozSpider(Spider): name = dmoz allowed_domains = [\] start_urls = [/Computers/Programming/Languages/Python/Books/, /Computers/Programming/Languages/Python/Resources/, \] def parse(self, response): for sel in response.xpath(//ul/li):item = DmozItem()item[ itle] = sel.xpath(a/text()).extract()item[link] = sel.xpath(a/@href).extract()item[desc] = sel.xpath( ext()).extract()yield item
(3)保存文件
可以使用,保存文件。格式可以 json,xml,csv
scrapy crawl -o a.json -t json
(4)使用模板创建spider
scrapy genspider baidu # -*- coding: utf-8 -*-import scrapy class BaiduSpider(scrapy.Spider): name = "baidu" allowed_domains = [""] start_urls = ( /, ) def parse(self, response): pass
这段先这样吧,记得之前5个的,现在只能想起4个来了. ????
千万记得随手点下保存按钮。否则很是影响心情的(⊙o⊙)!
C、高级 — scrapy.contrib.spiders.CrawlSpider
例子
#coding=utf-8from scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors import LinkExtractorimport scrapy class TestSpider(CrawlSpider): name = est allowed_domains = [\] start_urls = [/] rules = ( # 元组 Rule(LinkExtractor(allow=(category\.php, ), deny=(subsection\.php, ))), Rule(LinkExtractor(allow=(item\.php, )), callback=pars_item), ) def parse_item(self, response): self.log(item page : %s % response.url) item = scrapy.Item() item[id] = response.xpath(//td[@id="item_id"]/text()).re(ID:(\d+)) item[ ame] = response.xpath(//td[@id="item_name"]/text()).extract() item[description] = response.xpath(//td[@id="item_description"]/text()).extract() return item
其他的还有 XMLFeedSpider
class scrapy.contrib.spiders.XMLFeedSpider
class scrapy.contrib.spiders.CSVFeedSpider
class scrapy.contrib.spiders.SitemapSpider
D、选择器
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
可以灵活的使用 .css() 和 .xpath() 来快速的选取目标数据
关于选择器,需要好好研究一下。xpath() 和 css() ,还要继续熟悉 正则.
当通过class来进行选择的时候,尽量使用 css() 来选择,然后再用 xpath() 来选择元素的熟悉
E、Item Pipeline
Typical use for item pipelines are:
• cleansing HTML data # 清除HTML数据
• validating scraped data (checking that the items contain certain fields) # 验证数据
• checking for duplicates (and dropping them) # 检查重复
• storing the scraped item in a database # 存入数据库
(1)验证数据
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.5 def process_item(self, item, spider): if item[price]:if item[price_excludes_vat]: item[price] *= self.vat_factor else:raise DropItem(Missing price in %s % item)
(2)写Json文件
import json class JsonWriterPipeline(object): def __init__(self): self.file = open(json.jl, wb) def process_item(self, item, spider): line = json.dumps(dict(item)) + \ self.file.write(line) return item
(3)检查重复
from scrapy.exceptions import DropItem class Duplicates(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item[id] in self.ids_seen:raise DropItem(Duplicate item found : %s % item) else:self.ids_seen.add(item[id])return item
至于将数据写入数据库,应该也很简单。在 process_item 函数中,将 item 存入进去即可了。
如果觉得《Python的Scrapy爬虫框架简单学习笔记》对你有帮助,请点赞、收藏,并留下你的观点哦!