Python实战快速上手BeautifulSoup库爬取专栏标题和地址

更新时间：2021年10月20日 14:48:50 作者：小旺不正经

BeautifulSoup是爬虫必学的技能，BeautifulSoup最主要的功能是从网页抓取数据，Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码

安装

pip install beautifulsoup4
# 上面的安装失败使用下面的 使用镜像
pip install beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple

使用PyCharm的命令行

解析标签

from bs4 import BeautifulSoup
import requests
url='https://blog.csdn.net/weixin_42403632/category_11076268.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html=requests.get(url,headers=headers).text
s=BeautifulSoup(html,'html.parser')
title =s.select('h2')
for i in title:
    print(i.text)

第一行代码：导入BeautifulSoup库

第二行代码：导入requests

第三、四、五行代码：获取url的html

第六行代码：激活BeautifulSoup库 'html.parser'设置解析器为HTML解析器

第七行代码：选取所有<h2>标签

解析属性

BeautifulSoup库支持根据特定属性解析网页元素

根据class值解析

from bs4 import BeautifulSoup
import requests
url='https://blog.csdn.net/weixin_42403632/category_11076268.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html=requests.get(url,headers=headers).text
s=BeautifulSoup(html,'html.parser')
title =s.select('.column_article_title')
for i in title:
    print(i.text)

根据ID解析

from bs4 import BeautifulSoup
html='''<div class="crop-img-before">
         <img src="" alt="" id="cropImg">
      </div>
        <div id='title'>
        测试成功
        </div>
      <div class="crop-zoom">
         <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-reduce">-</a><a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-add">+</a>
      </div>
      <div class="crop-img-after">
         <div  class="final-img"></div>
      </div>'''
s=BeautifulSoup(html,'html.parser')
title =s.select('#title')
for i in title:
    print(i.text)

多层筛选

from bs4 import BeautifulSoup
html='''<div class="crop-img-before">
         <img src="" alt="" id="cropImg">
      </div>
        <div id='title'>
        456456465
        <h1>测试成功</h1>
        </div>
      <div class="crop-zoom">
         <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-reduce">-</a><a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-add">+</a>
      </div>
      <div class="crop-img-after">
         <div  class="final-img"></div>
      </div>'''
s=BeautifulSoup(html,'html.parser')
title =s.select('#title')
for i in title:
    print(i.text)
title =s.select('#title h1')
for i in title:
    print(i.text)

提取a标签中的网址

title =s.select('a')
for i in title:
    print(i['href'])

实战-获取博客专栏标题+网址

from bs4 import BeautifulSoup
import requests
import re
url='https://blog.csdn.net/weixin_42403632/category_11298953.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html=requests.get(url,headers=headers).text
s=BeautifulSoup(html,'html.parser')
title =s.select('.column_article_list li a')
for i in title:
    print((re.findall('原创.*?\n(.*?)\n',i.text))[0].lstrip())
    print(i['href'])

到此这篇关于Python实战快速上手BeautifulSoup库爬取专栏标题和地址的文章就介绍到这了,更多相关Python BeautifulSoup库内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

您可能感兴趣的文章:

详解pycharm连接远程linux服务器的虚拟环境的方法
这篇文章主要介绍了pycharm连接远程linux服务器的虚拟环境的详细教程,本文通过图文并茂的形式给大家介绍的非常详细，对大家的学习或工作具有一定的参考借鉴价值，需要的朋友可以参考下
2020-11-11
python删掉重复行之drop_duplicates()用法示例
Pandas的drop_duplicates()方法用于从DataFrame中删除重复的行,这篇文章主要给大家介绍了关于python删掉重复行之drop_duplicates()用法的相关资料,文中通过代码介绍的非常详细,需要的朋友可以参考下
2024-08-08
Python中判断输入是否为数字的实现代码
这篇文章主要介绍了Python中判断输入是否为数字的实现代码,需要的朋友可以参考下
2018-05-05
一篇文章教你用python画动态爱心表白
这篇文章主要给大家介绍了关于如何用python画动态爱心表白的相关资料，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧
2020-11-11
OpenCV简单标准数字识别的完整实例
这篇文章主要给大家介绍了关于OpenCV简单标准数字识别的相关资料,要通过opencv 进行数字识别离不开训练库的支持，需要对目标图片进行大量的训练，才能做到精准的识别出目标数字,需要的朋友可以参考下
2021-09-09
python 中raise用法
这篇文章主要介绍了python 中raise用法，Python 允许我们在程序中手动设置异常，就是使用raise 语句来实现，下面我们就来看看raise的具体用法，文章内容介绍详细，具有一定的参考价值，需要的小伙伴可以参考一下
2021-12-12
Python处理session的方法整理
这篇文章主要介绍了Python处理session的方法以及相关知识点总结，有需要的朋友们学习下。
2019-08-08
Python qrcode 生成一个二维码的实例详解
在本篇文章里小编给大家整理的是关于Python qrcode 生成一个二维码的实例内容，需要的朋友们可以学习参考下。
2020-02-02
Python实现for循环倒序遍历列表
这篇文章主要介绍了Python实现for循环倒序遍历列表，具有很好的参考价值，希望对大家有所帮助。如有错误或未考虑完全的地方，望不吝赐教
2022-05-05
Python实现连接FTP并下载文件夹
这篇文章主要为大家介绍了如何利用Python实现链接FTP服务器，并下载相应的文件夹，文中的示例代码讲解详细，感兴趣的小伙伴可以了解一下
2022-03-03