第二章复杂HTML解析 -云博客

第二章复杂HTML解析

# 前端 2024-05-05 17:57 0 62 来源：云博客

bsObj.findAll(tagName, tagAttributes)

.get_text() 会把这些超链接、段落和标签都清除掉，只剩下一串不带标签的文字。

findAll(tag, attributes, recursive, text, limit, keywords)

find(tag, attributes, recursive, text, keywords)

.findAll({"h1","h2","h3","h4","h5","h6"})

.findAll("span", {"class":{"green", "red"}})

nameList = bsObj.findAll(text="the prince") print(len(nameList))

from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html") bsObj = BeautifulSoup(html, features="lxml") namelist = bsObj.findAll("span", {"class": "green"}) for name in namelist: print(name.get_text())

output

Anna Pavlovna Scherer Empress Marya Fedorovna Prince Vasili Kuragin Anna Pavlovna St. Petersburg the prince Anna Pavlovna Anna Pavlovna the prince the prince the prince Prince Vasili Anna Pavlovna Anna Pavlovna the prince Wintzingerode King of Prussia le Vicomte de Mortemart Montmorencys Rohans Abbe Morio the Emperor the prince Prince Vasili Dowager Empress Marya Fedorovna the baron Anna Pavlovna the Empress the Empress Anna Pavlovna‘sHer Majesty Baron Funke The prince Anna Pavlovna the Empress The prince Anatole the prince The prince Anna Pavlovna Anna Pavlovna

from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html, features="lxml") for child in bsObj.find("table", {"id": "giftList"}).children: print(child)

output

<tr><th> Item Title </th><th> Description </th><th> Cost </th><th> Image </th></tr><tr class="gift" id="gift1"><td> Vegetable Basket </td><td> This vegetable basket is the perfect gift for your health conscious (or overweight) friends! <span class="excitingNote">Now with super-colorful bell peppers!</span></td><td> $15.00</td><td><img src="../img/gifts/img1.jpg"/></td></tr><tr class="gift" id="gift2"><td> Russian Nesting Dolls </td><td> Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span></td><td> $10,000.52</td><td><img src="../img/gifts/img2.jpg"/></td></tr><tr class="gift" id="gift3"><td> Fish Painting </td><td> If something seems fishy about this painting, it‘s because it‘s a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span></td><td> $10,005.00</td><td><img src="../img/gifts/img3.jpg"/></td></tr><tr class="gift" id="gift4"><td> Dead Parrot </td><td> This is an ex-parrot! <span class="excitingNote">Or maybe he‘s only resting?</span></td><td> $0.50</td><td><img src="../img/gifts/img4.jpg"/></td></tr><tr class="gift" id="gift5"><td> Mystery Box </td><td> If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span></td><td> $1.50</td><td><img src="../img/gifts/img6.jpg"/></td></tr>

处理兄弟标签

把bsObj.find("table", {"id": "giftList"}).children改为以下，可以跳过标题

for sibling in bsObj.find("table", {"id": "giftList"}).tr.next_siblings: print(sibling)

处理父标签

from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html, features="lxml") print(bsObj.find("img", {"src": "../img/gifts/img1.jpg" }).parent.previous_sibling.get_text())

正则表达式

from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html, features=‘lxml‘) images = bsObj.findAll("img", {"src": re.compile("\.\.\/img\/gifts\/img.*\.jpg")}) for image in images: print(image["src"])

output

../img/gifts/img1.jpg ../img/gifts/img2.jpg ../img/gifts/img3.jpg ../img/gifts/img4.jpg ../img/gifts/img6.jpg

2019-10-10

17:22:17

36.VUE — 认识 Webpack 和安装

nginx + http + svn

kubernets kube-proxy的代理 iptables和ipvs

基于Docker搭建 Php-fpm + Nginx 环境

.net 5+ 知新：【1】 .Net 5 基本概念和开发环境搭建

Three.js中显示坐标轴、平面、球体、四方体

云博小周宇投稿者

96804 篇文章

0 条评论

最近文章

第二章 复杂HTML解析

相关文章

第二章复杂HTML解析