使用selenium爬取网站动态数据 -数据天堂

使用selenium爬取网站动态数据

# 前端 2024-04-03 15:40 0 9 来源：数据天堂

处理页面动态加载的爬取

selenium

selenium是python的一个第三方库，可以实现让浏览器完成自动化的操作，比如说点击按钮拖动滚轮等

环境搭建：
- 安装:pip install selenium
- 获取浏览器的驱动程序：下载地址http://chromedriver.storage.googleapis.com/index.html
- 驱动与浏览器版本对应：https://blog.csdn.net/ezreal_tao/article/details/80808729
  设置chorme浏览器无界面模式：
编码流程：

from selenium import webdriverimport time# 创建一个浏览器对象 executable_path:驱动路径bro = webdriver.Chrome(executable_path=‘./chromedriver‘)# get方法可以指定一个url，让浏览器进行请求bro.get(‘https://www.baidu.com‘)# 让浏览器进行指定词条搜索‘‘‘#使用下面的方法，查找指定的元素进行操作即可 find_element_by_id 根据id找节点 find_elements_by_name 根据name找 find_elements_by_xpath 根据xpath查找 find_elements_by_tag_name 根据标签名找 find_elements_by_class_name 根据class名字查找‘‘‘text = bro.find_element_by_id(‘kw‘)text.send_keys(‘人民币‘) # send_keys表示向文本框中录入指定内容time.sleep(3)button = bro.find_element_by_id(‘su‘)button.click()# click表示的是点击操作time.sleep(5)bro.quit()

phantomJs

phantomJs是一个无界面的浏览器，其自动化流程与上述操作谷歌自动化流程是一模一样的

from selenium import webdrvier

使用selenium爬取豆瓣电影搞笑排行榜动态数据

from selenium import webdriverimport timefrom lxml import etreebro = webdriver.Chrome(‘./chromedriver‘)url = ‘https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&interval_id=100:90&action=‘bro.get(url)# 等待五秒页面加载完毕time.sleep(5)# 重复20次使用页面滚轮for i in range(20): time.sleep(2) bro.execute_script(‘window.scrollTo(0,document.body.scrollHeight)‘)# 获取页面源代码，可以使用三种解析方式进行解析，这里使用xpath解析数据text = bro.page_sourcetree = etree.HTML(text)div_list = tree.xpath(‘//div[@class="movie-info"]‘)f = open(‘豆瓣喜剧电影排行榜.txt‘,‘w‘,encoding=‘utf-8‘)count = 0for div in div_list: # 获取电影具体数据，并进行持久化存储 try: name = div.xpath(‘./div[@class="movie-name"]/span/a/text()‘)[0] link = div.xpath(‘./div[@class="movie-name"]/span/a/@href‘)[0] man = div.xpath(‘./div[@class="movie-crew"]/text()‘)[0] country = div.xpath(‘./div[@class="movie-misc"]/text()‘)[0] num = div.xpath(‘./div[@class="movie-rating"]/span[2]/text()‘)[0] except IndexError: continue f.write(‘电影名：‘+name+‘\n链接‘+link+‘\n‘+‘导演：‘+man+‘\n国家：‘+country+‘\n评分：‘+num+‘\n-----------------------------\n\n\n‘) print(‘写入成功:‘,name) count += 1print(‘爬取完毕,共抓取%s跳数据‘%count)f.close()time.sleep(5)bro.quit()