python BeautifulSoup4 获取 script 节点问题

2019/04/10 10:10
阅读数 1.2K

在爬取12306站点名时发现,BeautifulSoup检索不到station_version的节点

因为script标签在</html>之外,如果用‘lxml’解析器会忽略这一部分,而使用html5lib则不会。

  ...
1
<!-- 购物车 --> 2 <div style="display: none;" class="buy-cart"><div class="cart-hd"><span class="num">0</span> 3 </div> 4 <div class="cart-bd" style="display: none;"><div class="cart-bd-top"><h3><span id="hbTrainDate">候补购票需求列表</span> 5 <a id="hbClear" href="javascript:void(0)" shape="rect">[清空]</a> 6 </h3> 7 <a href="javascript:void(0)" class="close" shape="rect">×</a> 8 </div> 9 <div class="cart-bd-con"><ul class="cart-tlist"></ul> 10 </div> 11 <div class="cart-bd-ft"><p class="cart-ft-tips">1、候补订单需求中可包含2个相邻乘车日期,每个乘车日期可包含2个不同“车次+席别”的组合需求。</p> 12 <p class="cart-ft-tips">2、排位是指您的订单在待兑现订单中的位置。当前排位仅供参考,实际排位以支付成功后为准。</p> 13 <a id="hbSubmit" href="javascript:void(0)" class="btn72 fr" shape="rect">添加乘客</a> 14 </div> 15 </div> 16 </div> 17 </body> 18 </html>  # 用‘lxml’得到的汤到此为止 19 <script type="text/javascript" src="/otn/resources/js/framework/station_name.js?station_version=1.9115" xml:space="preserve"></script> 20 <script type="text/javascript" src="/otn/resources/js/framework/favorite_name.js" xml:space="preserve"></script> 21 <script type="text/javascript" src="/otn/resources/merged/queryLeftTicket_end_js.js?scriptVersion=1.9158" xml:space="preserve"></script>
  ...

 

1 >>> url = "https://kyfw.12306.cn/otn/leftTicket/init?linktypeid=dc&fs=%E4%B8%87%E5%B7%9E,WYW&ts=%E8%A5%BF%E5%AE%89,XAY&date=2019-11-05&flag=N,N,Y"
 2 ... response = requests.get(url, timeout=10)
 3 ... response.encoding = 'utf-8'
 4 ... lxml = bs(response.text, 'lxml')
 5 ... html5lib = bs(response.text, 'html5lib')
 6 ... response.close()
 7 >>> lxml.find_all(src=re.compile(".*station_version.*")) 8 [] 9 >>> html5lib.find_all(src=re.compile(".*station_version.*")) 10 [<script src="/otn/resources/js/framework/station_name.js?station_version=1.9115" type="text/javascript" xml:space="preserve"></script>]

 

原文出处:https://www.cnblogs.com/wawawawa-briefnote/p/11801636.html

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部