<bdo id='OMoNb'></bdo><ul id='OMoNb'></ul>
    <i id='OMoNb'><tr id='OMoNb'><dt id='OMoNb'><q id='OMoNb'><span id='OMoNb'><b id='OMoNb'><form id='OMoNb'><ins id='OMoNb'></ins><ul id='OMoNb'></ul><sub id='OMoNb'></sub></form><legend id='OMoNb'></legend><bdo id='OMoNb'><pre id='OMoNb'><center id='OMoNb'></center></pre></bdo></b><th id='OMoNb'></th></span></q></dt></tr></i><div id='OMoNb'><tfoot id='OMoNb'></tfoot><dl id='OMoNb'><fieldset id='OMoNb'></fieldset></dl></div>
  1. <tfoot id='OMoNb'></tfoot>

    1. <small id='OMoNb'></small><noframes id='OMoNb'>

      <legend id='OMoNb'><style id='OMoNb'><dir id='OMoNb'><q id='OMoNb'></q></dir></style></legend>

    2. 如何使用 Simple-HTML-DOM 提取完整的子链接?

      时间:2023-07-15
    3. <tfoot id='OMKgw'></tfoot>
      1. <legend id='OMKgw'><style id='OMKgw'><dir id='OMKgw'><q id='OMKgw'></q></dir></style></legend>
          <bdo id='OMKgw'></bdo><ul id='OMKgw'></ul>
            <i id='OMKgw'><tr id='OMKgw'><dt id='OMKgw'><q id='OMKgw'><span id='OMKgw'><b id='OMKgw'><form id='OMKgw'><ins id='OMKgw'></ins><ul id='OMKgw'></ul><sub id='OMKgw'></sub></form><legend id='OMKgw'></legend><bdo id='OMKgw'><pre id='OMKgw'><center id='OMKgw'></center></pre></bdo></b><th id='OMKgw'></th></span></q></dt></tr></i><div id='OMKgw'><tfoot id='OMKgw'></tfoot><dl id='OMKgw'><fieldset id='OMKgw'></fieldset></dl></div>

                <small id='OMKgw'></small><noframes id='OMKgw'>

                  <tbody id='OMKgw'></tbody>

              • 本文介绍了如何使用 Simple-HTML-DOM 提取完整的子链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                问题描述

                以下是我用来从页面中提取子链接的基本代码:

                The following is the basic code I use to extract sublinks from a page:

                <?php
                    include_once('simple_html_dom.php');
                    function extract_links($target_url)
                    {   
                        $html = new simple_html_dom();
                        $html->load_file($target_url);  
                        $i=0;
                        $crawl =array();
                        foreach($html->find('a') as $link)
                        {
                            $crawl[$i] = $link->href;
                            $i++;
                        }
                        var_dump($crawl);
                    }
                    extract_links('http://stackoverflow.com');
                ?>
                

                输出如下:

                array
                  0 => string 'http://stackexchange.com' (length=24)
                  1 => string '/users/login' (length=12)
                  2 => string 'http://careers.stackoverflow.com' (length=32)
                  3 => string 'http://chat.stackoverflow.com' (length=29)
                  4 => string 'http://meta.stackoverflow.com' (length=29)
                  5 => string '/about' (length=6)
                  6 => string '/faq' (length=4)
                  7 => string '/' (length=1)
                  8 => string '/questions' (length=10)
                  9 => string '/tags' (length=5)
                  10 => string '/users' (length=6)
                  11 => string '/badges' (length=7)
                  12 => string '/unanswered' (length=11)
                  13 => string '/questions/ask' (length=14)
                  14 => string '?tab=interesting' (length=16)
                  15 => string '?tab=featured' (length=13)
                  16 => string '?tab=hot' (length=8)
                  17 => string '?tab=week' (length=9)
                  18 => string '?tab=month' (length=10)
                  19 => string '/questions/14611052/basic-standalone-jpa-example-with-postgres-using-eclipse' (length=76)
                  20 => string '/questions/tagged/eclipse' (length=25)
                  21 => string '/questions/tagged/postgresql' (length=28)
                  22 => string '/questions/tagged/jpa' (length=21)
                  23 => string '/questions/14611052/basic-standalone-jpa-example-with-postgres-using-eclipse' (length=76)
                  24 => string '/users/865448/tostao' (length=20)
                  25 => string '/questions/14611172/unable-to-fully-print-a-page-containing-iframes-in-chrome' (length=77)
                  26 => string '/questions/tagged/javascript' (length=28)
                  27 => string '/questions/tagged/jquery' (length=24)
                  28 => string '/questions/tagged/html' (length=22)
                  29 => string '/questions/tagged/html5' (length=23)
                  30 => string '/questions/tagged/google-chrome' (length=31)
                  31 => string '/questions/14611172/unable-to-fully-print-a-page-containing-iframes-in-chrome' (length=77)
                  32 => string '/users/962868/tejas' (length=19)
                  33 => string '/questions/14609779/how-can-i-configure-bash-to-handle-crlf-shell-scripts' (length=73)
                  34 => string '/questions/tagged/linux' (length=23)
                  35 => string '/questions/tagged/windows' (length=25)
                  36 => string '/questions/tagged/bash' (length=22)
                  37 => string '/questions/tagged/line-endings' (length=30)
                  38 => string '/questions/14609779/how-can-i-configure-bash-to-handle-crlf-shell-scripts/?lastactivity' (length=87)
                  39 => string '/users/1899640/that-other-guy' (length=29)
                  40 => string '/questions/14611169/using-one-socket-for-peer-to-peer-communication' (length=67)
                  41 => string '/questions/tagged/sockets' (length=25)
                  42 => string '/questions/tagged/p2p' (length=21)
                  43 => string '/questions/14611169/using-one-socket-for-peer-to-peer-communication' (length=67)
                  44 => string '/users/911651/xsnrg' (length=19)
                  45 => string '/questions/14611166/possible-mistake-in-ios-dev-guide' (length=53)
                  46 => string '/questions/tagged/iphone' (length=24)
                  47 => string '/questions/tagged/ios' (length=21)
                  48 => string '/questions/tagged/objective-c' (length=29)
                  49 => string '/questions/14611166/possible-mistake-in-ios-dev-guide' (length=53)
                  50 => string '/users/107715/matt-n' (length=20)
                  51 => string '/questions/14611163/how-to-use-dispatcher-in-wpf-to-make-a-timer' (length=64)
                  52 => string '/questions/tagged/wpf' (length=21)
                  53 => string '/questions/tagged/timer' (length=23)
                  54 => string '/questions/tagged/dispatcher' (length=28)
                  55 => string '/questions/14611163/how-to-use-dispatcher-in-wpf-to-make-a-timer' (length=64)
                  56 => string '/users/1741800/nashat' (length=21)
                  57 => string '/questions/14610879/how-can-i-handle-an-access-violation-in-visual-studio-c' (length=75)
                  58 => string '/questions/tagged/visual-c%2b%2b' (length=32)
                  59 => string '/questions/tagged/exception-handling' (length=36)
                  60 => string '/questions/tagged/access-violation' (length=34)
                  61 => string '/questions/tagged/structured-exception' (length=38)
                  62 => string '/questions/14610879/how-can-i-handle-an-access-violation-in-visual-studio-c/?lastactivity' (length=89)
                  63 => string '/users/901812/big-endian' (length=24)
                  64 => string '/questions/14611162/mvc-condintional-authorization' (length=50)
                  65 => string '/questions/tagged/c%23' (length=22)
                  66 => string '/questions/tagged/asp.net-mvc' (length=29)
                  67 => string '/questions/tagged/asp.net-mvc-4' (length=31)
                  68 => string '/questions/tagged/authorization' (length=31)
                  69 => string '/questions/14611162/mvc-condintional-authorization' (length=50)
                  70 => string '/users/644969/cadrell0' (length=22)
                  71 => string '/questions/14611160/get-customer-role-nopcommerce' (length=49)
                  72 => string '/questions/tagged/c%23' (length=22)
                  73 => string '/questions/tagged/razor' (length=23)
                  74 => string '/questions/tagged/nopcommerce' (length=29)
                  75 => string '/questions/14611160/get-customer-role-nopcommerce' (length=49)
                  76 => string '/users/1378841/mlg74' (length=20)
                  77 => string '/questions/14611158/iframe-resizing-nested-in-gridview' (length=54)
                  78 => string '/questions/tagged/resize' (length=24)
                  79 => string '/questions/14611158/iframe-resizing-nested-in-gridview' (length=54)
                  80 => string '/users/2026451/satish-patil' (length=27)
                  81 => string '/questions/14611157/php-how-to-check-the-value-got-this-word-from-a-var' (length=71)
                  82 => string '/questions/tagged/php' (length=21)
                  83 => string '/questions/tagged/preg-match' (length=28)
                  84 => string '/questions/tagged/strpos' (length=24)
                  85 => string '/questions/14611157/php-how-to-check-the-value-got-this-word-from-a-var' (length=71)
                  86 => string '/users/963414/samual99' (length=22)
                  87 => string '/questions/14611155/how-to-get-the-coordinates-of-boundries-of-drawable-on-the-mapview' (length=86)
                  88 => string '/questions/tagged/android' (length=25)
                  89 => string '/questions/tagged/google-maps' (length=29)
                  90 => string '/questions/14611155/how-to-get-the-coordinates-of-boundries-of-drawable-on-the-mapview' (length=86)
                  91 => string '/users/1520564/blubar' (length=21)
                  92 => string '/questions/14611153/why-css-is-empty-when-ssl-is-on-and-appcache-is-enabled-ipad-safari' (length=87)
                  93 => string '/questions/tagged/css' (length=21)
                  94 => string '/questions/tagged/ipad' (length=22)
                  95 => string '/questions/tagged/ssl' (length=21)
                  96 => string '/questions/tagged/mobile-safari' (length=31)
                  97 => string '/questions/tagged/html5-appcache' (length=32)
                  98 => string '/questions/14611153/why-css-is-empty-when-ssl-is-on-and-appcache-is-enabled-ipad-safari' (length=87)
                  99 => string '/users/2026375/twoface' (length=22)
                  100 => string '/questions/14611149/laravel-how-to-temporarily-store-eloquent-models-in-db-without-a-proper-schem' (length=97)
                  101 => string '/questions/tagged/php' (length=21)
                  102 => string '/questions/tagged/laravel' (length=25)
                  103 => string '/questions/14611149/laravel-how-to-temporarily-store-eloquent-models-in-db-without-a-proper-schem' (length=97)
                  104 => string '/users/291557/duality' (length=21)
                  105 => string '/questions/13928812/xmlserializer-generateserializer-and-collections' (length=68)
                  106 => string '/questions/tagged/c%23' (length=22)
                  107 => string '/questions/tagged/xml-serialization' (length=35)
                  108 => string '/questions/13928812/xmlserializer-generateserializer-and-collections/?lastactivity' (length=82)
                  109 => string '/users/1200614/phil' (length=19)
                  110 => string '/questions/14611145/keep-buttons-in-view-when-keyboard-opens-android' (length=68)
                  111 => string '/questions/tagged/android' (length=25)
                  112 => string '/questions/tagged/keyboard' (length=26)
                  113 => string '/questions/tagged/resize' (length=24)
                  114 => string '/questions/tagged/window' (length=24)
                  115 => string '/questions/tagged/views' (length=23)
                  116 => string '/questions/14611145/keep-buttons-in-view-when-keyboard-opens-android' (length=68)
                  117 => string '/users/1137413/725623452362' (length=27)
                  118 => string '/questions/14611144/ssdp-discovery-from-a-browser' (length=49)
                  119 => string '/questions/tagged/silverlight' (length=29)
                  120 => string '/questions/tagged/flash' (length=23)
                  121 => string '/questions/14611144/ssdp-discovery-from-a-browser' (length=49)
                  122 => string '/users/191882/legege' (length=20)
                  123 => string '/questions/14611143/how-to-syncrhonize-on-site-in-memory-no-sql-datasources-with-central-database-in' (length=100)
                  124 => string '/questions/tagged/architecture' (length=30)
                  125 => string '/questions/tagged/nosql' (length=23)
                  126 => string '/questions/tagged/java-ee-6' (length=27)
                  127 => string '/questions/tagged/in-memory-database' (length=36)
                  more elements...
                

                现在考虑数组中的/about"子链接.我希望它显示为https://stackoverflow.com/about".为什么只返回子链接的子部分,而在某些情况下返回完整的子链接?还有一些链接以?"符号开头.如何清理这些链接?

                Now consider '/about' sublink in the array. I want it to be displayed as 'https://stackoverflow.com/about'. Why only subpart of sublink is returned while in some cases complete sublink is returned ? Also some links are starting with '?' sign. How to sanitize these links ?

                考虑http://en.wikipedia.org/wiki/Web_crawler".现在,如果我对其执行 extract_links,我会得到一个这样的子链接http://en.wikipedia.org/wiki/Web_crawler/wiki/Web_search_engine",这是无效的并且大多数链接都是这种格式.正确的链接是http://en.wikipedia.org/wiki/Web_search_engine".我在另一个程序中使用这个函数,该函数将传递一个链接数组,所以我不能保持 if 条件静态.以下是我现在使用的代码片段:

                Consider "http://en.wikipedia.org/wiki/Web_crawler". Now if I perform extract_links on it, I get a sublink like this "http://en.wikipedia.org/wiki/Web_crawler/wiki/Web_search_engine" which is invalid and most of the links are of this format. The correct link is "http://en.wikipedia.org/wiki/Web_search_engine". And I am using this function in another program which will pass an array of links so I cannot keep the if conditions static. The following is the code fragment I am using now:

                foreach($html->find('a') as $link)
                {   
                    $href = $link->href;
                    $fchr = substr($href, 0, 1);
                    if ($fchr === '/')
                    {
                        $href = $target_url.$href;
                    }
                    else if ($fchr === '?')
                    {
                        $href = $target_url.'/'. $href;
                    }
                }
                

                推荐答案

                @pguardiario's comment

                正如他的评论中所建议的,phpUri 是将相对 URL 转换为绝对 URL 的完美解决方案.你可以在这里找到,

                @pguardiario's comment

                As suggested in his comment phpUri is the perfect solution for converting relative URLs to absolute. You can find it here,

                这篇关于如何使用 Simple-HTML-DOM 提取完整的子链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                上一篇:如何防止被第二个正则表达式重新替换? 下一篇:使用polylang时如何确定wordpress页面的当前语言?

                相关文章

                  <tfoot id='62Arb'></tfoot>
                  <legend id='62Arb'><style id='62Arb'><dir id='62Arb'><q id='62Arb'></q></dir></style></legend>
                    • <bdo id='62Arb'></bdo><ul id='62Arb'></ul>

                    <small id='62Arb'></small><noframes id='62Arb'>

                    <i id='62Arb'><tr id='62Arb'><dt id='62Arb'><q id='62Arb'><span id='62Arb'><b id='62Arb'><form id='62Arb'><ins id='62Arb'></ins><ul id='62Arb'></ul><sub id='62Arb'></sub></form><legend id='62Arb'></legend><bdo id='62Arb'><pre id='62Arb'><center id='62Arb'></center></pre></bdo></b><th id='62Arb'></th></span></q></dt></tr></i><div id='62Arb'><tfoot id='62Arb'></tfoot><dl id='62Arb'><fieldset id='62Arb'></fieldset></dl></div>