python爬虫爬取博客园

  • 代码撸起来
Author: Lovyya
File : blog_spider
import requests
import json
from bs4 import BeautifulSoup
import re
这个是为和老师的urls一致性 匹配urls里面的数字
rule = re.compile("\d+")

urls = [f'https://www.cnblogs.com/#p{page}' for page in range(1, 31)]

pos请求网址
url = "https://www.cnblogs.com/AggSite/AggSitePostList"
headers = {
    "content-type": "application/json",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30"
}

def craw(urls):
    #idx 是'xxx.xxxx.xxx/#p{num}' 里面的num 这样写可以不用改 后面生产者消费者的代码
    idx = rule.findall(urls)[0]
    # payload参数 只需要更改 idx 就行
    payload = {
        "CategoryType": "SiteHome",
        "ParentCategoryId": 0,
        "CategoryId": 808,
        "PageIndex": idx,
        "TotalPostCount": 4000,
        "ItemListActionName": "AggSitePostList"
    }
    r = requests.post(url, data=json.dumps(payload), headers=headers)
    return r.text

def parse(html):
    # post-item-title
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all("a", class_="post-item-title")
    return [(link["href"], link.get_text()) for link in links]

if __name__ == '__main__':
    for res in parse(craw(urls[2])):
        print(res)

Original: https://www.cnblogs.com/lovy-ivy/p/16551416.html
Author: Lovyya
Title: python爬虫爬取博客园

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/498853/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球