1.目标
伯乐在线网站地址:http://blog.jobbole.com/all-posts/爬取伯乐在线的所有文章信息,包括图片网址,标题,发表日期,标签,点赞数,评论数等将爬取的数据保存至数据库(同步,异步两种方式)
2.环境需求
python 3.6MySQLscrapy 1.5
3.思路分析
对列表页抓取文章网址和封面图片的网址,并获取下一页网址进行解析; 抓取文章的信息,保存至数据库;
爬取逻辑分析
4.实战演练
4.1创建爬虫项目
cmd 中创建爬虫项目
scrapy startproject jobbole
创建我们的spider
cd jobbole
scrapy genspider blogjobbole blog.jobbole.com #默认为basic模板
打开blogjobbole.py查看已经有创建³³³³好的模板代码:
–– coding: utf-8 ––
import scrapy
class Blogjobbole(scrapy.Spider):
name = “blogjobbole”
allowed_domains = [“blog.jobbole.com”]
start_urls是一个待爬的列表,
spider会为我们下载请求网页,随后进入到parse阶段
start_urls = [‘http://blog.jobbole.com/’]
def parse(self, response):
pass
由于我们爬取的起始网址是http://blog.jobbole.com/all-posts/ ,因此需要对start_urls进行修正,爬虫将从这个页面开始爬取。
start_urls = [‘http://blog.jobbole.com/all-posts/’]
注意点: scrapy默认遵守网站协议,由于我们爬取网站可能是禁止爬虫的,因此需要配置的修改,setting.py的设置不遵守重启协议。
ROBOTSTXT_OBEY = False #将True改为False
4.2解析列表页获取网址
from scrapy import Request
def parse(self, response):
“””
-
获取文章列表页中的文章url并交给scrapy下载后并进行解析
-
获取下一页的url并交给scrapy进行下载, 下载完成后交给parse
“””
解析列表页中的所有文章url并交给scrapy下载后并进行解析
post_nodes = response.css(“#archive .floated-thumb .post-thumb a”)
for post_node in post_nodes:
image_url = post_node.css(“img::attr(src)”).extract_first(“”)
post_url = post_node.css(“::attr(href)”).extract_first(“”)
yield Request(url=parse.urljoin(response.url, post_url), meta={“front_image_url”: image_url},#meta:在下载网页的时候把获取到的封面图的url传给parse_detail的response
callback=self.parse_detail)
提取下一页并交给scrapy进行下载
next_url = response.css(“.next.page-numbers::attr(href)”).extract_first(“”)
if next_url:
yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
解析拼接网址应对HERF内有可能网址不全
from urllib import parse
url=parse.urljoin(response.url,post_url)
parse.urljoin(“http://blog.jobbole.com/all-posts/”,”http://blog.jobbole.com/111535/”)
结果为http://blog.jobbole.com/111535/
我们可以提供一个 base_url(基础链接)作为第一个参数,将新的链接作为第二个参数,该方法会分析 base_url的 scheme,netloc 状语从句: path 这3个内容并对新链接缺失的部分进行补充,最后返回结果。
网页中提取的链接有时候往往缺失了前面部分的域名,用 urljoin() 可以很好的进行网址的解析。
4.3提取伯乐在线文章内容
4.3.1 xpath
from jobbole.items import JobBoleArticleItem
def parse_detail(self, response):
article_item = JobBoleArticleItem() #加载item
front_image_url = response.meta.get(“front_image_url”, “”) # 接收parse函数文章封面图
title = response.xpath(‘//div[@class=”entry-header”]/h1/text()’).extract_first(“”)
create_date = response.xpath(“//p[@class=’entry-meta-hide-on-mobile’]/text()”).extract()[0].strip().replace(“·”,””).strip()
praise_nums = response.xpath(“//span[contains(@class, ‘vote-post-up’)]/h10/text()”).extract()[0]
fav_nums = response.xpath(“//span[contains(@class, ‘bookmark-btn’)]/text()”).extract()[0]
match_re = re.match(“.?(\d+).“, fav_nums)
if match_re:
fav_nums = match_re.group(1)
else:
fav_nums = 0
comment_nums = response.xpath(“//a[@href=’#article-comment’]/span/text()”).extract()[0]
match_re = re.match(“.?(\d+).“, comment_nums)
if match_re:
comment_nums = int(match_re.group(1))
else:
comment_nums = 0
tag_list = response.xpath(“//p[@class=’entry-meta-hide-on-mobile’]/a/text()”).extract()
tag_list = [element for element in tag_list if not element.strip().endswith(“评论”)]
tags = “,”.join(tag_list)
–把提取的内容填入到item中
article_item[“title”] = title
article_item[“url”] = response.url
try:
create_date = datetime.datetime.strptime(create_date, “%Y/%m/%d”).date()
except Exception as e:
create_date = datetime.datetime.now().date()
article_item[“create_date”] = create_date
article_item[“front_image_url”] = [front_image_url]
article_item[“praise_nums”] = praise_nums
article_item[“comment_nums”] = comment_nums
article_item[“fav_nums”] = fav_nums
article_item[“tags”] = tags
yield article_item
4.3.2 CSS
from jobbole.items import JobBoleArticleItem
def parse_detail(self, response):
article_item = JobBoleArticleItem() #加载item
front_image_url = response.meta.get(“front_image_url”, “”) #文章封面图
title = response.css(“.entry-header h1::text”).extract()[0]
create_date = response.css(“p.entry-meta-hide-on-mobile::text”).extract()[0].strip().replace(“·”,””).strip()
praise_nums = response.css(“.vote-post-up h10::text”).extract()[0]
fav_nums = response.css(“.bookmark-btn::text”).extract()[0]
match_re = re.match(“.?(\d+).“, fav_nums)
if match_re:
fav_nums = int(match_re.group(1))
else:
fav_nums = 0
comment_nums = response.css(“a[href=’#article-comment’] span::text”).extract()[0]
match_re = re.match(“.?(\d+).“, comment_nums)
if match_re:
comment_nums = int(match_re.group(1))
else:
comment_nums = 0
tag_list = response.css(“p.entry-meta-hide-on-mobile a::text”).extract()
tag_list = [element for element in tag_list if not element.strip().endswith(“评论”)]
tags = “,”.join(tag_list)
–把提取的内容填入到item中
article_item[“title”] = title
article_item[“url”] = response.url
try:
create_date = datetime.datetime.strptime(create_date, “%Y/%m/%d”).date()
except Exception as e:
create_date = datetime.datetime.now().date()
article_item[“create_date”] = create_date
article_item[“front_image_url”] = [front_image_url]
article_item[“praise_nums”] = praise_nums
article_item[“comment_nums”] = comment_nums
article_item[“fav_nums”] = fav_nums
article_item[“tags”] = tags
yield article_item
4.4.3 ItemLoader
ItemLoader()实例化一个ItemLoader对象来加载项容器类,填充数据,如果是自定义装载机继承的ItemLoader同样的用法。
基本写法:item_loader = ItemLoader(item = JobBoleArticleItem(),response=response) 参数:第一个参数:要填充数据的物品容器类注意加上括号,第二个参数:响应
ItemLoader对象下的方法:
add_xpath(‘字段名称’,’的xpath表达式’)方法,用的xpath表达式获取数据填充到指定字段 add_css(‘字段名称’,’的CSS选择器’)方法,用CSS选择器获取数据填充到指定字段 add_value(‘字段名称’,字符串内容)方法,将指定字符串数据填充到指定字段 load_item()方法无参,将所有数据生成,load_item()方法被屈服后数据被填充物品容器指定类的各个字段
from scrapy.loader import ItemLoader
from jobbole.items import JobBoleArticleItem
def parse_detail(self, response):
item_loader = ItemLoader(item = JobBoleArticleItem(),response=response)
front_image_url = response.meta.get(“front_image_url”, “”)
item_loader.add_css(“title”, “.entry-header h1::text”)
item_loader.add_value(“url”, response.url)
item_loader.add_css(“create_date”, “p.entry-meta-hide-on-mobile::text”)
item_loader.add_value(“front_image_url”, [front_image_url])
item_loader.add_css(“praise_nums”, “.vote-post-up h10::text”)
item_loader.add_css(“comment_nums”, “a[href=’#article-comment’] span::text”)
item_loader.add_css(“fav_nums”, “.bookmark-btn::text”)
item_loader.add_css(“tags”, “p.entry-meta-hide-on-mobile a::text”)
article_item = item_loader.load_item()
yield article_item
所有的值都变成了list形式
在item.py中可以对该字段进行处理
4.4 Item项字段修改
———-<>—————-
class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field()
create_date = scrapy.Field()
url = scrapy.Field()
front_image_url = scrapy.Field()
praise_nums = scrapy.Field()
comment_nums = scrapy.Field()
fav_nums = scrapy.Field()
tags = scrapy.Field()
———-<>—————-
import datetime
import scrapy
import re
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose,Join,TakeFirst
MapCompose可以传入函数对于该字段进行处理,而且可以传入多个
def return_value(value):
return value
def date_convert(value):
try:
create_date = datetime.datetime.strptime(value, “%Y/%m/%d”).date()
except Exception as e:
create_date = datetime.datetime.now().date()
return create_date
def get_nums(value):
match_re = re.match(“.?(\d+).“, value)
if match_re:
nums = int(match_re.group(1))
else:
nums = 0
return nums
def remove_comment_tags(value):
去掉tag中提取的评论
if “评论” in value:
return “”
else:
return value
def remove_time(value):
if “·” in value:
return value.replace(“·”,””).strip()
else:
return value
class ArticleItemLoader(ItemLoader):
自定义itemloader实现默认提取第一个
default_output_processor = TakeFirst()
class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field()
create_date = scrapy.Field(
input_processor=MapCompose(remove_time),
output_processor=TakeFirst()
url = scrapy.Field()
front_image_url = scrapy.Field(
output_processor=MapCompose(return_value)
praise_nums = scrapy.Field(
input_processor=MapCompose(get_nums)
comment_nums = scrapy.Field(
input_processor=MapCompose(get_nums)
fav_nums = scrapy.Field(
input_processor=MapCompose(get_nums)
tags = scrapy.Field(
input_processor=MapCompose(remove_comment_tags),
output_processor=Join(“,”)
写入数据库的语句
def get_insert_sql(self):
insert_sql=”’
insert into jobbole(title,create_date,url,front_image_url,praise_nums,comment_nums,fav_nums,tags)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE fav_nums=VALUES(fav_nums)
”’
create_date = date_convert(self[‘create_date’])
params = (
self[‘title’],create_date,self[‘url’],self[‘front_image_url’],
self[‘praise_nums’],self[‘comment_nums’],self[‘fav_nums’],self[‘tags’]
return insert_sql,params
4.5 保存到数据库(同步,异步)
4.5.1创建数据表
①使用python创建表
import pymysql
host = “localhost”
user = “root”
password = “123456”
port = 3306
database = “spider”
table = “jobboles”
连接数据库
db = pymysql.connect(host = host,user = user,password=password)
cursor = db.cursor()
创建数据库
sql1 = ”’create database {database} charset utf8”’.format(database=database)
cursor.execute(sql1)
创建数据表
sql2 = ”’create table if not exists {database}.{table}(
title varchar(255) not null ,
create_date datetime not null,
url varchar(300) not null,
front_image_url varchar(300) not null,
tags varchar(255) not null,
praise_nums int(11) not null,
comment_nums int(11) not null,
fav_nums int(11) not null,
PRIMARY KEY (url)
)charset utf8”’.format(database=database,table=table)
cursor.execute(sql2)
db.close()
②Mysql创建
创建数据库 create database +数据库名称
create database spider charset utf8;
创建数据表
create table if not exists spider.jobbole(
title varchar(255) not null ,
create_date datetime not null,
url varchar(300) not null,
front_image_url varchar(300) not null,
tags varchar(255) not null,
praise_nums int(11) not null,
comment_nums int(11) not null,
fav_nums int(11) not null,
PRIMARY KEY (url)
)charset utf8;
4.5.2管道保存到数据库
方法1:采用同步的机制写入mysql
import pymysql
from scrapy.conf import settings #将数据库的信息放到setting中
class MysqlPipeline(object):
def init(self):
配置基本的sql信息
host = settings[“MYSQL_HOST”]
user= settings[“MYSQL_USER”]
psd = settings[‘MYSQL_PASSWD’]
db = settings[‘MYSQL_DBNAME’]
try:
连接到数据库
self.connect = pymysql.connect(host=host, user=user, passwd=psd, db=db, charset=”utf8″, use_unicode=True)
通过cursor执行增删查改
self.curse = self.connect.cursor()
except pymysql.MySQLError as e:
print(e.args)
def process_item(self,item,spider):
查重处理
self.curse.execute(
“”” select * from jobbole where url = %s “””,
item[‘url’])
是否有重复值
repetition = self.curse.fetchone()
if repetition:
pass
else:
try:
insert_sql,params = item.get_insert_sql()
self.curse.execute(insert_sql,params)
self.connect.commit()
except pymysql.MySQLError as e:
print(e.args)
self.connect.rollback()
—————————————————
方法2:采用异步的机制写入mysql
import pymysql.cursors
from twisted.enterprise import adbapi
连接池ConnectionPool
def init(self, dbapiName, connargs, *connkw):
class MysqlTwistedPipline(object):
def init(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbparms = dict(
host = settings[“MYSQL_HOST”],
db = settings[“MYSQL_DBNAME”],
user = settings[“MYSQL_USER”],
passwd = settings[“MYSQL_PASSWD”],
charset=’utf8′,
cursorclass=pymysql.cursors.DictCursor,
use_unicode=True,
**dbparms–>(“MySQLdb”,host=settings[‘MYSQL_HOST’]
dbpool = adbapi.ConnectionPool(“pymysql”, **dbparms)#无需直接导入 dbmodule. 只需要告诉 adbapi.ConnectionPool 构造器你用的数据库模块的名称比如pymysql.
return cls(dbpool)
def process_item(self, item, spider):
使用twisted将mysql插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider) #处理异常
def handle_error(self, failure, item, spider):
处理异步插入的异常
print (failure)
def do_insert(self, cursor, item):
执行具体的插入
根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
4.6设置文件配置
4.6.1数据库的基本信息填入
在setting.py文件中增加数据库信息
——<>——-
MYSQL_HOST = ‘localhost’
MYSQL_DBNAME = ‘spider’
MYSQL_USER = ‘root’
MYSQL_PASSWD = ‘123456’
4.6.2开启对应的管道配置
—开启对应配置的pipeline
ITEM_PIPELINES = {
‘jobbole.pipelines.JobbolePipeline’: 300,
‘jobbole.pipelines.MysqlTwistedPipline’: 2,
‘jobbole.pipelines.MysqlPipeline’: 2,
4.7运行文件
打开cmd
cd jobbole
scrapy crawl blogjobbole
最后,附上代码地址:https://github.com/Damaomaomao/jobbole
如有问题欢迎,请多指教!
Original: https://blog.csdn.net/weixin_36359107/article/details/114469327
Author: kylaCpp
Title: scrapy读取mysql中的url_scrapy | 爬取伯乐在线全部博文(xpath/css/itemload三种提取方法,同步、异步…
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/790638/
转载文章受原作者版权保护。转载请注明原作者出处!