我正在尝试使用 python 和 selenium 抓取这个网站.但是我需要的所有信息都没有在主页上,那么我该如何点击申请号"栏中的链接到该页面并抓取信息然后返回原始页面?
我试过了:
def getData():数据 = []select = Select(driver.find_elements_by_xpath('//*[@id="node-41"]/div/div/div/div/div/div[1]/table/tbody/tr/td/a/@href'))list_options = select.options对于范围内的项目(len(list_options)):item.click()driver.get(url)
网址:
要在 webtable 中打开多个 href 以通过 selenium 进行抓取,您可以使用以下解决方案:
代码块:
from selenium import webdriver从 selenium.webdriver.chrome.options 导入选项从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.common.by 导入从 selenium.webdriver.support 导入 expected_conditions 作为 EC链接 = []选项=选项()options.add_argument(开始最大化")options.add_argument(禁用信息栏")options.add_argument("--disable-extensions")options.add_argument("--disable-gpu")options.add_argument("--no-sandbox")driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:WebDriversChromeDriverchromedriver_win32chromedriver.exe')driver.get('http://www.scilly.gov.uk/planning-development/planning-applications')windows_before = driver.current_window_handle # 存储 parent_window_handle 以备将来使用elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.views-field.views-field-title>a"))) # 引入 WebDriverWait 以获得所需元素的可见性对于元素中的元素:hrefs.append(element.get_attribute("href")) # 收集所需的href属性并存储在列表中对于hrefs中的href:driver.execute_script("window.open('" + href +"');") # 在新标签页中通过execute_script方法一一打开hrefWebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2)) # 为 number_of_windows_to_be 2 引入 WebDriverWaitwindows_after = driver.window_handlesnew_window = [x for x in windows_after if x != windows_before][0] # 识别新打开的窗口# driver.switch_to_window(new_window) <!---弃用>driver.switch_to.window(new_window) # switch_to新窗口# 在这里执行你的网页抓取print(driver.title) # 打印页面标题或执行您的网页抓取driver.close() # 关闭窗口# driver.switch_to_window(windows_before) <!---弃用>driver.switch_to.window(windows_before) # switch_to parent_window_handledriver.quit() #退出你的程序
控制台输出:
规划申请:P/18/064 |锡利群岛理事会规划申请:P/18/063 |锡利群岛理事会规划申请:P/18/062 |锡利群岛理事会规划申请:P/18/061 |锡利群岛理事会规划申请:p/18/059 |锡利群岛理事会规划申请:P/18/058 |锡利群岛理事会规划申请:P/18/057 |锡利群岛理事会规划申请:P/18/056 |锡利群岛理事会规划申请:P/18/055 |锡利群岛理事会规划申请:P/18/054 |锡利群岛理事会
您可以在以下位置找到一些相关的详细讨论:
I'm trying to scrape this website using python and selenium. However all the information I need is not on the main page, so how would I click the links in the 'Application number' column one by one go to that page scrape the information then return to original page?
Ive tried:
def getData():
data = []
select = Select(driver.find_elements_by_xpath('//*[@id="node-41"]/div/div/div/div/div/div[1]/table/tbody/tr/td/a/@href'))
list_options = select.options
for item in range(len(list_options)):
item.click()
driver.get(url)
URL: http://www.scilly.gov.uk/planning-development/planning-applications
Screenshot of the site:
To open multiple hrefs within a webtable to scrape through selenium you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
hrefs = []
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:WebDriversChromeDriverchromedriver_win32chromedriver.exe')
driver.get('http://www.scilly.gov.uk/planning-development/planning-applications')
windows_before = driver.current_window_handle # Store the parent_window_handle for future use
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.views-field.views-field-title>a"))) # Induce WebDriverWait for the visibility of the desired elements
for element in elements:
hrefs.append(element.get_attribute("href")) # Collect the required href attributes and store in a list
for href in hrefs:
driver.execute_script("window.open('" + href +"');") # Open the hrefs one by one through execute_script method in a new tab
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2)) # Induce WebDriverWait for the number_of_windows_to_be 2
windows_after = driver.window_handles
new_window = [x for x in windows_after if x != windows_before][0] # Identify the newly opened window
# driver.switch_to_window(new_window) <!---deprecated>
driver.switch_to.window(new_window) # switch_to the new window
# perform your webscraping here
print(driver.title) # print the page title or your perform your webscraping
driver.close() # close the window
# driver.switch_to_window(windows_before) <!---deprecated>
driver.switch_to.window(windows_before) # switch_to the parent_window_handle
driver.quit() #Quit your program
Console Output:
Planning application: P/18/064 | Council of the ISLES OF SCILLY
Planning application: P/18/063 | Council of the ISLES OF SCILLY
Planning application: P/18/062 | Council of the ISLES OF SCILLY
Planning application: P/18/061 | Council of the ISLES OF SCILLY
Planning application: p/18/059 | Council of the ISLES OF SCILLY
Planning application: P/18/058 | Council of the ISLES OF SCILLY
Planning application: P/18/057 | Council of the ISLES OF SCILLY
Planning application: P/18/056 | Council of the ISLES OF SCILLY
Planning application: P/18/055 | Council of the ISLES OF SCILLY
Planning application: P/18/054 | Council of the ISLES OF SCILLY
You can find a couple of relevant detailed discussions in:
这篇关于如何在 webtable 中打开多个 href 以抓取 selenium的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!