第二章 复杂HTML解析

bsObj.findAll(tagName, tagAttributes)

.get_text() 会把这些超链接、段落和标签都清除掉, 只剩下一串不带标签的文字。

findAll(tag, attributes, recursive, text, limit, keywords)

find(tag, attributes, recursive, text, keywords)

.findAll({"h1","h2","h3","h4","h5","h6"})

.findAll("span", {"class":{"green", "red"}})

nameList = bsObj.findAll(text="the prince") print(len(nameList))

from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html") bsObj = BeautifulSoup(html, features="lxml") namelist = bsObj.findAll("span", {"class": "green"}) for name in namelist: print(name.get_text())

output

Anna Pavlovna Scherer Empress Marya Fedorovna Prince Vasili Kuragin Anna Pavlovna St. Petersburg the prince Anna Pavlovna Anna Pavlovna the prince the prince the prince Prince Vasili Anna Pavlovna Anna Pavlovna the prince Wintzingerode King of Prussia le Vicomte de Mortemart Montmorencys Rohans Abbe Morio the Emperor the prince Prince Vasili Dowager Empress Marya Fedorovna the baron Anna Pavlovna the Empress the Empress Anna PavlovnasHer Majesty Baron Funke The prince Anna Pavlovna the Empress The prince Anatole the prince The prince Anna Pavlovna Anna Pavlovna

 

from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html, features="lxml") for child in bsObj.find("table", {"id": "giftList"}).children: print(child)

output

<tr><th> Item Title </th><th> Description </th><th> Cost </th><th> Image </th></tr><tr class="gift" id="gift1"><td> Vegetable Basket </td><td> This vegetable basket is the perfect gift for your health conscious (or overweight) friends! <span class="excitingNote">Now with super-colorful bell peppers!</span></td><td> $15.00</td><td><img src="../img/gifts/img1.jpg"/></td></tr><tr class="gift" id="gift2"><td> Russian Nesting Dolls </td><td> Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span></td><td> $10,000.52</td><td><img src="../img/gifts/img2.jpg"/></td></tr><tr class="gift" id="gift3"><td> Fish Painting </td><td> If something seems fishy about this painting, its because its a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span></td><td> $10,005.00</td><td><img src="../img/gifts/img3.jpg"/></td></tr><tr class="gift" id="gift4"><td> Dead Parrot </td><td> This is an ex-parrot! <span class="excitingNote">Or maybe hes only resting?</span></td><td> $0.50</td><td><img src="../img/gifts/img4.jpg"/></td></tr><tr class="gift" id="gift5"><td> Mystery Box </td><td> If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span></td><td> $1.50</td><td><img src="../img/gifts/img6.jpg"/></td></tr>

处理兄弟标签

把bsObj.find("table", {"id": "giftList"}).children改为以下,可以跳过标题

for sibling in bsObj.find("table", {"id": "giftList"}).tr.next_siblings: print(sibling)

处理父标签

from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html, features="lxml") print(bsObj.find("img", {"src": "../img/gifts/img1.jpg" }).parent.previous_sibling.get_text())

正则表达式

from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html, features=lxml) images = bsObj.findAll("img", {"src": re.compile("\.\.\/img\/gifts\/img.*\.jpg")}) for image in images: print(image["src"])

output

../img/gifts/img1.jpg ../img/gifts/img2.jpg ../img/gifts/img3.jpg ../img/gifts/img4.jpg ../img/gifts/img6.jpg

 

2019-10-10

17:22:17

相关文章