我想抓取无限滚动实现的页面的所有数据.以下 python 代码有效.
I want to scrape all the data of a page implemented by a infinite scroll. The following python code works.
for i in range(100):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
这意味着每次我向下滚动到底部时,我都需要等待 5 秒,这通常足以让页面完成加载新生成的内容.但是,这可能没有时间效率.页面可能会在 5 秒内完成加载新内容.每次向下滚动时,如何检测页面是否完成加载新内容?如果我能检测到这一点,我可以在知道页面完成加载后再次向下滚动以查看更多内容.这样更省时.
This means every time I scroll down to the bottom, I need to wait 5 seconds, which is generally enough for the page to finish loading the newly generated contents. But, this may not be time efficient. The page may finish loading the new contents within 5 seconds. How can I detect whether the page finished loading the new contents every time I scroll down? If I can detect this, I can scroll down again to see more contents once I know the page finished loading. This is more time efficient.
webdriver
默认会通过.get()
方法等待页面加载.
The webdriver
will wait for a page to load by default via .get()
method.
正如@user227215 所说,您可能正在寻找某些特定元素,您应该使用 WebDriverWait
来等待页面中的元素:
As you may be looking for some specific element as @user227215 said, you should use WebDriverWait
to wait for an element located in your page:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"
我用它来检查警报.您可以使用任何其他类型的方法来查找定位器.
I have used it for checking alerts. You can use any other type methods to find the locator.
编辑 1:
我应该提到 webdriver
默认会等待页面加载.它不等待加载内部框架或 ajax 请求.这意味着当您使用 .get('url')
时,您的浏览器将等待页面完全加载,然后转到代码中的下一个命令.但是,当您发布 ajax 请求时,webdriver
不会等待,您有责任等待适当的时间来加载页面或页面的一部分;所以有一个名为 expected_conditions
的模块.
I should mention that the webdriver
will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url')
, your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver
does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions
.
这篇关于等到页面用 Selenium WebDriver for Python 加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!