流程:
迭代进行:种子url->Element遍历获取超链接lin->作为新种子id
⚠️注意频率和遍历深度
1.设定url及请求参数
headers = {"user-agent": "Baiduspider"}base_url = "https://www.zhihu.com"suffix = "/explore"entry_url = base_url + suffix
2.读取请求返回结果,及解析
response = requests.get(entry_url, headers=headers).textsoup = bs4.BeautifulSoup(response, "lxml")page_set = soup.find_all("a")
3.设定正则匹配规则,读取内容,取出结果
links = set()for ele in page_set: raw_link = ele["href"] # print(f"raw_link = {raw_link}") proper_link = re.compile(r"^/question/.*").findall(raw_link) for final_link in proper_link: links.add(base_url + final_link)
该流程仅遍历了一个页面的结果
Original: https://www.cnblogs.com/wanghuanyeah/p/14462221.html
Author: wanghuanyeah
Title: Spider
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/590209/
转载文章受原作者版权保护。转载请注明原作者出处!