1. <legend id='H70gs'><style id='H70gs'><dir id='H70gs'><q id='H70gs'></q></dir></style></legend>

    <small id='H70gs'></small><noframes id='H70gs'>

  2. <tfoot id='H70gs'></tfoot>
        <bdo id='H70gs'></bdo><ul id='H70gs'></ul>
      <i id='H70gs'><tr id='H70gs'><dt id='H70gs'><q id='H70gs'><span id='H70gs'><b id='H70gs'><form id='H70gs'><ins id='H70gs'></ins><ul id='H70gs'></ul><sub id='H70gs'></sub></form><legend id='H70gs'></legend><bdo id='H70gs'><pre id='H70gs'><center id='H70gs'></center></pre></bdo></b><th id='H70gs'></th></span></q></dt></tr></i><div id='H70gs'><tfoot id='H70gs'></tfoot><dl id='H70gs'><fieldset id='H70gs'></fieldset></dl></div>

    1. 使用 Python 抓取网页的 JavaScript 页面

      时间:2024-04-21

    2. <i id='npxoq'><tr id='npxoq'><dt id='npxoq'><q id='npxoq'><span id='npxoq'><b id='npxoq'><form id='npxoq'><ins id='npxoq'></ins><ul id='npxoq'></ul><sub id='npxoq'></sub></form><legend id='npxoq'></legend><bdo id='npxoq'><pre id='npxoq'><center id='npxoq'></center></pre></bdo></b><th id='npxoq'></th></span></q></dt></tr></i><div id='npxoq'><tfoot id='npxoq'></tfoot><dl id='npxoq'><fieldset id='npxoq'></fieldset></dl></div>

              <bdo id='npxoq'></bdo><ul id='npxoq'></ul>
                <tbody id='npxoq'></tbody>

            • <small id='npxoq'></small><noframes id='npxoq'>

              • <tfoot id='npxoq'></tfoot>
              • <legend id='npxoq'><style id='npxoq'><dir id='npxoq'><q id='npxoq'></q></dir></style></legend>
                本文介绍了使用 Python 抓取网页的 JavaScript 页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                问题描述

                I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.

                For example, if some JavaScript code adds some text, I can't see it, because when I call

                response = urllib2.urlopen(request)
                

                I get the original text without the added one (because JavaScript is executed in the client).

                So, I'm looking for some ideas to solve this problem.

                解决方案

                EDIT Sept 2021: phantomjs isn't maintained any more, either

                EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.

                dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.

                Once you have installed Phantom JS, make sure the phantomjs binary is available in the current path:

                phantomjs --version
                # result:
                2.1.1
                

                #Example To give an example, I created a sample page with following HTML code. (link):

                <!DOCTYPE html>
                <html>
                <head>
                  <meta charset="utf-8">
                  <title>Javascript scraping test</title>
                </head>
                <body>
                  <p id='intro-text'>No javascript support</p>
                  <script>
                     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
                  </script> 
                </body>
                </html>
                

                without javascript it says: No javascript support and with javascript: Yay! Supports javascript

                #Scraping without JS support:

                import requests
                from bs4 import BeautifulSoup
                response = requests.get(my_url)
                soup = BeautifulSoup(response.text)
                soup.find(id="intro-text")
                # Result:
                <p id="intro-text">No javascript support</p>
                

                #Scraping with JS support:

                from selenium import webdriver
                driver = webdriver.PhantomJS()
                driver.get(my_url)
                p_element = driver.find_element_by_id(id_='intro-text')
                print(p_element.text)
                # result:
                'Yay! Supports javascript'
                


                You can also use Python library dryscrape to scrape javascript driven websites.

                #Scraping with JS support:

                import dryscrape
                from bs4 import BeautifulSoup
                session = dryscrape.Session()
                session.visit(my_url)
                response = session.body()
                soup = BeautifulSoup(response)
                soup.find(id="intro-text")
                # Result:
                <p id="intro-text">Yay! Supports javascript</p>
                

                这篇关于使用 Python 抓取网页的 JavaScript 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                上一篇:lambda 函数闭包捕获了什么? 下一篇:re.findall 行为怪异

                相关文章

                  <i id='NKpp3'><tr id='NKpp3'><dt id='NKpp3'><q id='NKpp3'><span id='NKpp3'><b id='NKpp3'><form id='NKpp3'><ins id='NKpp3'></ins><ul id='NKpp3'></ul><sub id='NKpp3'></sub></form><legend id='NKpp3'></legend><bdo id='NKpp3'><pre id='NKpp3'><center id='NKpp3'></center></pre></bdo></b><th id='NKpp3'></th></span></q></dt></tr></i><div id='NKpp3'><tfoot id='NKpp3'></tfoot><dl id='NKpp3'><fieldset id='NKpp3'></fieldset></dl></div>

                    <legend id='NKpp3'><style id='NKpp3'><dir id='NKpp3'><q id='NKpp3'></q></dir></style></legend>
                  1. <tfoot id='NKpp3'></tfoot>
                    • <bdo id='NKpp3'></bdo><ul id='NKpp3'></ul>

                    <small id='NKpp3'></small><noframes id='NKpp3'>