1. <tfoot id='ZXJF4'></tfoot>
    2. <small id='ZXJF4'></small><noframes id='ZXJF4'>

        <legend id='ZXJF4'><style id='ZXJF4'><dir id='ZXJF4'><q id='ZXJF4'></q></dir></style></legend>
        • <bdo id='ZXJF4'></bdo><ul id='ZXJF4'></ul>

      1. <i id='ZXJF4'><tr id='ZXJF4'><dt id='ZXJF4'><q id='ZXJF4'><span id='ZXJF4'><b id='ZXJF4'><form id='ZXJF4'><ins id='ZXJF4'></ins><ul id='ZXJF4'></ul><sub id='ZXJF4'></sub></form><legend id='ZXJF4'></legend><bdo id='ZXJF4'><pre id='ZXJF4'><center id='ZXJF4'></center></pre></bdo></b><th id='ZXJF4'></th></span></q></dt></tr></i><div id='ZXJF4'><tfoot id='ZXJF4'></tfoot><dl id='ZXJF4'><fieldset id='ZXJF4'></fieldset></dl></div>

        如何实现硒刮板的并行运行

        时间:2024-04-20

          <bdo id='Fm3nJ'></bdo><ul id='Fm3nJ'></ul>
        • <tfoot id='Fm3nJ'></tfoot>

          <small id='Fm3nJ'></small><noframes id='Fm3nJ'>

                  <tbody id='Fm3nJ'></tbody>
                <legend id='Fm3nJ'><style id='Fm3nJ'><dir id='Fm3nJ'><q id='Fm3nJ'></q></dir></style></legend>
                  <i id='Fm3nJ'><tr id='Fm3nJ'><dt id='Fm3nJ'><q id='Fm3nJ'><span id='Fm3nJ'><b id='Fm3nJ'><form id='Fm3nJ'><ins id='Fm3nJ'></ins><ul id='Fm3nJ'></ul><sub id='Fm3nJ'></sub></form><legend id='Fm3nJ'></legend><bdo id='Fm3nJ'><pre id='Fm3nJ'><center id='Fm3nJ'></center></pre></bdo></b><th id='Fm3nJ'></th></span></q></dt></tr></i><div id='Fm3nJ'><tfoot id='Fm3nJ'></tfoot><dl id='Fm3nJ'><fieldset id='Fm3nJ'></fieldset></dl></div>
                1. 本文介绍了如何实现硒刮板的并行运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我正在尝试使用scrapy和Selenium抓取一个javascript网站。我使用Selenium和Chrome驱动程序打开javascript网站,使用scrapy从当前页面抓取指向不同清单的所有链接,并将它们存储在列表中(到目前为止,尝试使用seleniumRequest跟踪链接并回调到解析新页面函数会导致很多错误)。然后,我循环遍历URL列表,在Selenium驱动程序中打开它们,并从页面中抓取信息。到目前为止,这个速度是16页/分钟,考虑到这个网站上的列表数量,这并不理想。理想情况下,我会让Selenium驱动程序并行打开链接,如下所示:

                  How can I make Selenium run in parallel with Scrapy?

                  https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

                  但是,我不知道如何在Selenium-scrapy代码中实现并行处理。`

                      import scrapy
                      import time
                      from scrapy.selector import Selector
                      from scrapy_selenium import SeleniumRequest
                      from selenium.webdriver.common.keys import Keys
                      from selenium.webdriver.support.ui import Select
                      from selenium.webdriver.support.ui import WebDriverWait
                      from selenium.webdriver.common.by import By
                      from selenium.webdriver.support import expected_conditions as EC
                  
                  class MarketPagSpider(scrapy.Spider):
                      name = 'marketPagination'
                  def start_requests(self):
                      yield SeleniumRequest(
                          url="https://www.cryptoslam.io/nba-top-shot/marketplace",
                          wait_time=5,
                          wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
                          callback=self.parse
                      )
                  
                  responses = []
                  
                  def parse(self, response):
                      # initialize driver
                      driver = response.meta['driver']
                      driver.set_window_size(1920,1080)
                  
                      time.sleep(1)
                      WebDriverWait(driver, 10).until(
                          EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
                      )
                  
                      rows = response_obj.xpath("//tbody/tr[@role='row']")
                      for row in rows:
                          link = row.xpath(".//td[4]/a/@href").get()
                          absolute_url = response.urljoin(link)
                  
                          self.responses.append(absolute_url)
                  
                      for resp in self.responses:
                          driver.get(resp)
                          html = driver.page_source 
                          response_obj = Selector(text=html)
                  
                          yield {
                          'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
                          'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()
                          
                          }
                  

                  我知道scrapy-spash可以处理多进程,但我试图刮掉的网站不能在spash中打开(至少我不这么认为)

                  此外,我还删除了用于分页的代码行,以保持代码简洁。

                  我对此非常陌生,并乐于接受有关使用Selenium进行多处理的任何建议和解决方案。

                  推荐答案

                  以下示例程序出于演示目的创建了一个只有2个线程的线程池,然后抓取4个URL以获取其标题:

                  from multiprocessing.pool import ThreadPool
                  from bs4 import BeautifulSoup
                  from selenium import webdriver
                  import threading
                  import gc
                  
                  class Driver:
                      def __init__(self):
                          options = webdriver.ChromeOptions()
                          options.add_argument("--headless")
                          # suppress logging:
                          options.add_experimental_option('excludeSwitches', ['enable-logging'])
                          self.driver = webdriver.Chrome(options=options)
                          print('The driver was just created.')
                  
                      def __del__(self):
                          self.driver.quit() # clean up driver when we are cleaned up
                          print('The driver has terminated.')
                  
                  
                  threadLocal = threading.local()
                  
                  def create_driver():
                      the_driver = getattr(threadLocal, 'the_driver', None)
                      if the_driver is None:
                          the_driver = Driver()
                          setattr(threadLocal, 'the_driver', the_driver)
                      return the_driver.driver
                  
                  
                  def get_title(url):
                      driver = create_driver()
                      driver.get(url)
                      source = BeautifulSoup(driver.page_source, "lxml")
                      title = source.select_one("title").text
                      print(f"{url}: '{title}'")
                  
                  # just 2 threads in our pool for demo purposes:
                  with ThreadPool(2) as pool:
                      urls = [
                          'https://www.google.com',
                          'https://www.microsoft.com',
                          'https://www.ibm.com',
                          'https://www.yahoo.com'
                      ]
                      pool.map(get_title, urls)
                      # must be done before terminate is explicitly or implicitly called on the pool:
                      del threadLocal
                      gc.collect()
                  # pool.terminate() is called at exit of with block
                  

                  打印:

                  The driver was just created.
                  The driver was just created.
                  https://www.google.com: 'Google'
                  https://www.microsoft.com: 'Microsoft - Official Home Page'
                  https://www.ibm.com: 'IBM - United States'
                  https://www.yahoo.com: 'Yahoo'
                  The driver has terminated.
                  The driver has terminated.
                  

                  这篇关于如何实现硒刮板的并行运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  上一篇:抓取动态数据硒-无法定位元素 下一篇:让Tkinter等待直到按钮被按下

                  相关文章

                    <bdo id='w4Kph'></bdo><ul id='w4Kph'></ul>
                2. <i id='w4Kph'><tr id='w4Kph'><dt id='w4Kph'><q id='w4Kph'><span id='w4Kph'><b id='w4Kph'><form id='w4Kph'><ins id='w4Kph'></ins><ul id='w4Kph'></ul><sub id='w4Kph'></sub></form><legend id='w4Kph'></legend><bdo id='w4Kph'><pre id='w4Kph'><center id='w4Kph'></center></pre></bdo></b><th id='w4Kph'></th></span></q></dt></tr></i><div id='w4Kph'><tfoot id='w4Kph'></tfoot><dl id='w4Kph'><fieldset id='w4Kph'></fieldset></dl></div>

                  <tfoot id='w4Kph'></tfoot>
                3. <legend id='w4Kph'><style id='w4Kph'><dir id='w4Kph'><q id='w4Kph'></q></dir></style></legend>

                  1. <small id='w4Kph'></small><noframes id='w4Kph'>