scrapy读取mysql中的url_scrapy | 爬取伯乐在线全部博文(xpath/css/itemload三种提取方法,同步、异步…

1.目标

伯乐在线网站地址:http://blog.jobbole.com/all-posts/爬取伯乐在线的所有文章信息,包括图片网址,标题,发表日期,标签,点赞数,评论数等将爬取的数据保存至数据库(同步,异步两种方式)

2.环境需求

python 3.6MySQLscrapy 1.5

3.思路分析

对列表页抓取文章网址和封面图片的网址,并获取下一页网址进行解析; 抓取文章的信息,保存至数据库;

爬取逻辑分析

4.实战演练

4.1创建爬虫项目

cmd 中创建爬虫项目

scrapy startproject jobbole

创建我们的spider

cd jobbole

scrapy genspider blogjobbole blog.jobbole.com #默认为basic模板

打开blogjobbole.py查看已经有创建³³³³好的模板代码:

– coding: utf-8 –

import scrapy

class Blogjobbole(scrapy.Spider):

name = “blogjobbole”

allowed_domains = [“blog.jobbole.com”]

start_urls是一个待爬的列表,

spider会为我们下载请求网页,随后进入到parse阶段

start_urls = [‘http://blog.jobbole.com/’]

def parse(self, response):

pass

由于我们爬取的起始网址是http://blog.jobbole.com/all-posts/ ,因此需要对start_urls进行修正,爬虫将从这个页面开始爬取。

start_urls = [‘http://blog.jobbole.com/all-posts/’]

注意点: scrapy默认遵守网站协议,由于我们爬取网站可能是禁止爬虫的,因此需要配置的修改,setting.py的设置不遵守重启协议。

ROBOTSTXT_OBEY = False #将True改为False

4.2解析列表页获取网址

from scrapy import Request

def parse(self, response):

“””

  1. 获取文章列表页中的文章url并交给scrapy下载后并进行解析

  2. 获取下一页的url并交给scrapy进行下载, 下载完成后交给parse

“””

解析列表页中的所有文章url并交给scrapy下载后并进行解析

post_nodes = response.css(“#archive .floated-thumb .post-thumb a”)

for post_node in post_nodes:

image_url = post_node.css(“img::attr(src)”).extract_first(“”)

post_url = post_node.css(“::attr(href)”).extract_first(“”)

yield Request(url=parse.urljoin(response.url, post_url), meta={“front_image_url”: image_url},#meta:在下载网页的时候把获取到的封面图的url传给parse_detail的response

callback=self.parse_detail)

提取下一页并交给scrapy进行下载

next_url = response.css(“.next.page-numbers::attr(href)”).extract_first(“”)

if next_url:

yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

解析拼接网址应对HERF内有可能网址不全

from urllib import parse

url=parse.urljoin(response.url,post_url)

parse.urljoin(“http://blog.jobbole.com/all-posts/”,”http://blog.jobbole.com/111535/”)

结果为http://blog.jobbole.com/111535/

我们可以提供一个 base_url(基础链接)作为第一个参数,将新的链接作为第二个参数,该方法会分析 base_url的 scheme,netloc 状语从句: path 这3个内容并对新链接缺失的部分进行补充,最后返回结果。

网页中提取的链接有时候往往缺失了前面部分的域名,用 urljoin() 可以很好的进行网址的解析。

4.3提取伯乐在线文章内容

4.3.1 xpath

from jobbole.items import JobBoleArticleItem

def parse_detail(self, response):

article_item = JobBoleArticleItem() #加载item

front_image_url = response.meta.get(“front_image_url”, “”) # 接收parse函数文章封面图

title = response.xpath(‘//div[@class=”entry-header”]/h1/text()’).extract_first(“”)

create_date = response.xpath(“//p[@class=’entry-meta-hide-on-mobile’]/text()”).extract()[0].strip().replace(“·”,””).strip()

praise_nums = response.xpath(“//span[contains(@class, ‘vote-post-up’)]/h10/text()”).extract()[0]

fav_nums = response.xpath(“//span[contains(@class, ‘bookmark-btn’)]/text()”).extract()[0]

match_re = re.match(“.?(\d+).“, fav_nums)

if match_re:

fav_nums = match_re.group(1)

else:

fav_nums = 0

comment_nums = response.xpath(“//a[@href=’#article-comment’]/span/text()”).extract()[0]

match_re = re.match(“.?(\d+).“, comment_nums)

if match_re:

comment_nums = int(match_re.group(1))

else:

comment_nums = 0

tag_list = response.xpath(“//p[@class=’entry-meta-hide-on-mobile’]/a/text()”).extract()

tag_list = [element for element in tag_list if not element.strip().endswith(“评论”)]

tags = “,”.join(tag_list)

–把提取的内容填入到item中

article_item[“title”] = title

article_item[“url”] = response.url

try:

create_date = datetime.datetime.strptime(create_date, “%Y/%m/%d”).date()

except Exception as e:

create_date = datetime.datetime.now().date()

article_item[“create_date”] = create_date

article_item[“front_image_url”] = [front_image_url]

article_item[“praise_nums”] = praise_nums

article_item[“comment_nums”] = comment_nums

article_item[“fav_nums”] = fav_nums

article_item[“tags”] = tags

yield article_item

4.3.2 CSS

from jobbole.items import JobBoleArticleItem

def parse_detail(self, response):

article_item = JobBoleArticleItem() #加载item

front_image_url = response.meta.get(“front_image_url”, “”) #文章封面图

title = response.css(“.entry-header h1::text”).extract()[0]

create_date = response.css(“p.entry-meta-hide-on-mobile::text”).extract()[0].strip().replace(“·”,””).strip()

praise_nums = response.css(“.vote-post-up h10::text”).extract()[0]

fav_nums = response.css(“.bookmark-btn::text”).extract()[0]

match_re = re.match(“.?(\d+).“, fav_nums)

if match_re:

fav_nums = int(match_re.group(1))

else:

fav_nums = 0

comment_nums = response.css(“a[href=’#article-comment’] span::text”).extract()[0]

match_re = re.match(“.?(\d+).“, comment_nums)

if match_re:

comment_nums = int(match_re.group(1))

else:

comment_nums = 0

tag_list = response.css(“p.entry-meta-hide-on-mobile a::text”).extract()

tag_list = [element for element in tag_list if not element.strip().endswith(“评论”)]

tags = “,”.join(tag_list)

–把提取的内容填入到item中

article_item[“title”] = title

article_item[“url”] = response.url

try:

create_date = datetime.datetime.strptime(create_date, “%Y/%m/%d”).date()

except Exception as e:

create_date = datetime.datetime.now().date()

article_item[“create_date”] = create_date

article_item[“front_image_url”] = [front_image_url]

article_item[“praise_nums”] = praise_nums

article_item[“comment_nums”] = comment_nums

article_item[“fav_nums”] = fav_nums

article_item[“tags”] = tags

yield article_item

4.4.3 ItemLoader

ItemLoader()实例化一个ItemLoader对象来加载项容器类,填充数据,如果是自定义装载机继承的ItemLoader同样的用法。

基本写法:item_loader = ItemLoader(item = JobBoleArticleItem(),response=response) 参数:第一个参数:要填充数据的物品容器类注意加上括号,第二个参数:响应

ItemLoader对象下的方法:

add_xpath(‘字段名称’,’的xpath表达式’)方法,用的xpath表达式获取数据填充到指定字段 add_css(‘字段名称’,’的CSS选择器’)方法,用CSS选择器获取数据填充到指定字段 add_value(‘字段名称’,字符串内容)方法,将指定字符串数据填充到指定字段 load_item()方法无参,将所有数据生成,load_item()方法被屈服后数据被填充物品容器指定类的各个字段

from scrapy.loader import ItemLoader

from jobbole.items import JobBoleArticleItem

def parse_detail(self, response):

item_loader = ItemLoader(item = JobBoleArticleItem(),response=response)

front_image_url = response.meta.get(“front_image_url”, “”)

item_loader.add_css(“title”, “.entry-header h1::text”)

item_loader.add_value(“url”, response.url)

item_loader.add_css(“create_date”, “p.entry-meta-hide-on-mobile::text”)

item_loader.add_value(“front_image_url”, [front_image_url])

item_loader.add_css(“praise_nums”, “.vote-post-up h10::text”)

item_loader.add_css(“comment_nums”, “a[href=’#article-comment’] span::text”)

item_loader.add_css(“fav_nums”, “.bookmark-btn::text”)

item_loader.add_css(“tags”, “p.entry-meta-hide-on-mobile a::text”)

article_item = item_loader.load_item()

yield article_item

所有的值都变成了list形式

在item.py中可以对该字段进行处理

4.4 Item项字段修改

———-<>—————-

class JobBoleArticleItem(scrapy.Item):

title = scrapy.Field()

create_date = scrapy.Field()

url = scrapy.Field()

front_image_url = scrapy.Field()

praise_nums = scrapy.Field()

comment_nums = scrapy.Field()

fav_nums = scrapy.Field()

tags = scrapy.Field()

———-<>—————-

import datetime

import scrapy

import re

from scrapy.loader import ItemLoader

from scrapy.loader.processors import MapCompose,Join,TakeFirst

MapCompose可以传入函数对于该字段进行处理,而且可以传入多个

def return_value(value):

return value

def date_convert(value):

try:

create_date = datetime.datetime.strptime(value, “%Y/%m/%d”).date()

except Exception as e:

create_date = datetime.datetime.now().date()

return create_date

def get_nums(value):

match_re = re.match(“.?(\d+).“, value)

if match_re:

nums = int(match_re.group(1))

else:

nums = 0

return nums

def remove_comment_tags(value):

去掉tag中提取的评论

if “评论” in value:

return “”

else:

return value

def remove_time(value):

if “·” in value:

return value.replace(“·”,””).strip()

else:

return value

class ArticleItemLoader(ItemLoader):

自定义itemloader实现默认提取第一个

default_output_processor = TakeFirst()

class JobBoleArticleItem(scrapy.Item):

title = scrapy.Field()

create_date = scrapy.Field(

input_processor=MapCompose(remove_time),

output_processor=TakeFirst()

url = scrapy.Field()

front_image_url = scrapy.Field(

output_processor=MapCompose(return_value)

praise_nums = scrapy.Field(

input_processor=MapCompose(get_nums)

comment_nums = scrapy.Field(

input_processor=MapCompose(get_nums)

fav_nums = scrapy.Field(

input_processor=MapCompose(get_nums)

tags = scrapy.Field(

input_processor=MapCompose(remove_comment_tags),

output_processor=Join(“,”)

写入数据库的语句

def get_insert_sql(self):

insert_sql=”’

insert into jobbole(title,create_date,url,front_image_url,praise_nums,comment_nums,fav_nums,tags)

VALUES (%s,%s,%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE fav_nums=VALUES(fav_nums)

”’

create_date = date_convert(self[‘create_date’])

params = (

self[‘title’],create_date,self[‘url’],self[‘front_image_url’],

self[‘praise_nums’],self[‘comment_nums’],self[‘fav_nums’],self[‘tags’]

return insert_sql,params

4.5 保存到数据库(同步,异步)

4.5.1创建数据表

①使用python创建表

import pymysql

host = “localhost”

user = “root”

password = “123456”

port = 3306

database = “spider”

table = “jobboles”

连接数据库

db = pymysql.connect(host = host,user = user,password=password)

cursor = db.cursor()

创建数据库

sql1 = ”’create database {database} charset utf8”’.format(database=database)

cursor.execute(sql1)

创建数据表

sql2 = ”’create table if not exists {database}.{table}(

title varchar(255) not null ,

create_date datetime not null,

url varchar(300) not null,

front_image_url varchar(300) not null,

tags varchar(255) not null,

praise_nums int(11) not null,

comment_nums int(11) not null,

fav_nums int(11) not null,

PRIMARY KEY (url)

)charset utf8”’.format(database=database,table=table)

cursor.execute(sql2)

db.close()

②Mysql创建

创建数据库 create database +数据库名称

create database spider charset utf8;

创建数据表

create table if not exists spider.jobbole(

title varchar(255) not null ,

create_date datetime not null,

url varchar(300) not null,

front_image_url varchar(300) not null,

tags varchar(255) not null,

praise_nums int(11) not null,

comment_nums int(11) not null,

fav_nums int(11) not null,

PRIMARY KEY (url)

)charset utf8;

4.5.2管道保存到数据库

方法1:采用同步的机制写入mysql

import pymysql

from scrapy.conf import settings #将数据库的信息放到setting中

class MysqlPipeline(object):

def init(self):

配置基本的sql信息

host = settings[“MYSQL_HOST”]

user= settings[“MYSQL_USER”]

psd = settings[‘MYSQL_PASSWD’]

db = settings[‘MYSQL_DBNAME’]

try:

连接到数据库

self.connect = pymysql.connect(host=host, user=user, passwd=psd, db=db, charset=”utf8″, use_unicode=True)

通过cursor执行增删查改

self.curse = self.connect.cursor()

except pymysql.MySQLError as e:

print(e.args)

def process_item(self,item,spider):

查重处理

self.curse.execute(

“”” select * from jobbole where url = %s “””,

item[‘url’])

是否有重复值

repetition = self.curse.fetchone()

if repetition:

pass

else:

try:

insert_sql,params = item.get_insert_sql()

self.curse.execute(insert_sql,params)

self.connect.commit()

except pymysql.MySQLError as e:

print(e.args)

self.connect.rollback()

—————————————————

方法2:采用异步的机制写入mysql

import pymysql.cursors

from twisted.enterprise import adbapi

连接池ConnectionPool

def init(self, dbapiName, connargs, *connkw):

class MysqlTwistedPipline(object):

def init(self, dbpool):

self.dbpool = dbpool

@classmethod

def from_settings(cls, settings):

dbparms = dict(

host = settings[“MYSQL_HOST”],

db = settings[“MYSQL_DBNAME”],

user = settings[“MYSQL_USER”],

passwd = settings[“MYSQL_PASSWD”],

charset=’utf8′,

cursorclass=pymysql.cursors.DictCursor,

use_unicode=True,

**dbparms–>(“MySQLdb”,host=settings[‘MYSQL_HOST’]

dbpool = adbapi.ConnectionPool(“pymysql”, **dbparms)#无需直接导入 dbmodule. 只需要告诉 adbapi.ConnectionPool 构造器你用的数据库模块的名称比如pymysql.

return cls(dbpool)

def process_item(self, item, spider):

使用twisted将mysql插入变成异步执行

query = self.dbpool.runInteraction(self.do_insert, item)

query.addErrback(self.handle_error, item, spider) #处理异常

def handle_error(self, failure, item, spider):

处理异步插入的异常

print (failure)

def do_insert(self, cursor, item):

执行具体的插入

根据不同的item 构建不同的sql语句并插入到mysql中

insert_sql, params = item.get_insert_sql()

cursor.execute(insert_sql, params)

4.6设置文件配置

4.6.1数据库的基本信息填入

在setting.py文件中增加数据库信息

——<>——-

MYSQL_HOST = ‘localhost’

MYSQL_DBNAME = ‘spider’

MYSQL_USER = ‘root’

MYSQL_PASSWD = ‘123456’

4.6.2开启对应的管道配置

—开启对应配置的pipeline

ITEM_PIPELINES = {

‘jobbole.pipelines.JobbolePipeline’: 300,

‘jobbole.pipelines.MysqlTwistedPipline’: 2,

‘jobbole.pipelines.MysqlPipeline’: 2,

4.7运行文件

打开cmd

cd jobbole

scrapy crawl blogjobbole

最后,附上代码地址:https://github.com/Damaomaomao/jobbole

如有问题欢迎,请多指教!

Original: https://blog.csdn.net/weixin_36359107/article/details/114469327
Author: kylaCpp
Title: scrapy读取mysql中的url_scrapy | 爬取伯乐在线全部博文(xpath/css/itemload三种提取方法,同步、异步…

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/790638/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球