Nutch中的自定义连接池
Nutch中的自定义连接池
强子哥哥 发表于3年前
Nutch中的自定义连接池
  • 发表于 3年前
  • 阅读 121
  • 收藏 0
  • 点赞 0
  • 评论 3

【腾讯云】如何购买服务器最划算?>>>   

如果你以为看懂了Nutch的源码,就是一个专业的爬虫Crawler,

那你就太年轻了,无非就是Nutch+Hadoop呗,服务器越多你爬的越多速度也越快。

但是实际上对于不差钱有足够服务器的土豪,

我想说:其中一个瓶颈在于如何绕开目标服务器屏蔽爬虫。

正好也是我现在碰到的问题。

环境背景介绍:

需要爬的数据量:HDFS文件存储的8000+万条URL

Hadoop环境中机器数量: 10~20台之间的某个数字

Nutch的Http组件我自己加了代码实现简单的一个连接池功能

单个虚拟机保证连接数在【0,100】之间。

=================以下基于上面的背景介绍=========================

---第一次爬的惨痛教训

Nutch爬取的一些参数:threads  500  -depth  20  -topN  10000000

前期拿了200万条数据作为开路先锋探探路,发现顺利抓取的数量只有几万条

数据如下:

 

共抓取2014368条URL,成功25942条,失败1988426条。成功率:0.01287848099255不到2%。

耗时:6mins, 34sec

随后就是无法完成任务的忧虑,然后就是百度上找各种前辈的攻略。

=======================================================

以下是各种攻略测试结果:

攻略1:调整抓取频率做一个厚道的人

修改前,我的参数是:

<property>
  <name>fetcher.server.delay</name>
  <value>1.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>
<property>
  <name>fetcher.server.min.delay</name>
  <value>0.0</value>
  <description>The minimum number of seconds the fetcher will delay between
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>

但是其实这2个配置是失效的,因为代码有这个:

 

private void setEndTime(long endTime, boolean asap) {

      if (!asap)

        nextFetchTime.set(endTime + (maxThreads > 1 ? minCrawlDelay : crawlDelay));

      else

        nextFetchTime.set(endTime);

    }

因为我这里的MaxThreads是大于1的,所以实际取的值是 minCrawlDelay .

根据上面的配置,我这里是0,也就是nextFetchTime永远不增加。

导致抓取频率过快,然后服务器总是返回503 Service Unavailable错误。

好吧,调整

<property>
  <name>fetcher.server.min.delay</name>
  <value>1.0</value>
  <description>The minimum number of seconds the fetcher will delay between
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>

调整为1看看。

结果如下:

42821/2014368 = 0.021257784079175

成功率提高到2%,说明 1 这个方向可以作为一个攻破点 2仍然非有效方式。

--------细细一思考。实际上因为存在连接池,所以不应该在fetchItem这块做文章。

我的意思是这里主要一个线程想取一个fetchItem,就尽可能让它获取,因为此时并没有跟服务器发生交互。

而瓶颈在于跟服务器的交互。所以吧,我要把这里的两个参数重新设置为0

而在我的连接池那块大做文章。

结论:攻略1失败,采取方案2

------------- 方案2:进可能的让一个fetchThread获取一个fetchItem,优化连接池。

重新设置上面的2个参数为0.0,即进可能快的让一个线程获取一个Item.

优化连接池的思路如下:

直接上代码了:

package org.apache.nutch.protocol.http;

import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.Socket;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;

//Commons Logging imports
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class HttpPool {

	public static final Logger LOG = LoggerFactory.getLogger(HttpPool.class);

	private static HashMap<String, SocketPool> httpPool = new HashMap<String, SocketPool>();
	private static int MAX_SOCKET_VALVE = 200;
	private static int MIN_SOCKET_VALVE = 0;

	static class SocketFeedThread extends Thread {

		private SocketPool pool;
		private String host;
		private int port;
		private int time;

		public SocketFeedThread(SocketPool sp, String h, int p, int t) {
			pool = sp;
			host = h;
			port = p;
			time = t;
		}

		@Override
		public void run() {
			while (true) {
				int count = 0;
				synchronized (pool) {
					// calculate how many connection to create
					count = MAX_SOCKET_VALVE - pool.out
							- pool.getSocketList().size();
				}
				while (count > 0) {
					SocketObject socket = null;
					try {
						socket = new SocketObject(pool);
						socket.setSoTimeout(time);
						socket.setReuseAddress(true);
						socket.setTcpNoDelay(true);
						socket.setKeepAlive(true);
						InetSocketAddress sockAddr = new InetSocketAddress(
								host, port);
						socket.connect(sockAddr, time);
						synchronized (pool) {
							pool.getSocketList().add(socket);
							pool.notify();
						}
						count--;
					} catch (Exception e) {
						// sorry
						socket = null;
					}
				}
				// sleep some time
				try {
					sleep(100);
				} catch (Exception e) {

				}
			}
		}
	}

	static class SocketObject extends Socket {
		private SocketPool pool;

		public SocketObject(SocketPool p) {
			super();
			pool = p;
		}

		public void setPool(SocketPool p) {
			pool = p;
		}

		public SocketPool getPool() {
			return pool;
		}
	}

	static class SocketPool {
		private ArrayList<SocketObject> sockets = new ArrayList<SocketObject>();
		private int out = 0;

		public SocketPool() {
		}

		public void decreaseOut() {
			out--;
		}

		public ArrayList<SocketObject> getSocketList() {
			return sockets;
		}

	}

	public static boolean isValid(Socket socket) {
		if (null == socket)
			return false;

		if (socket.isClosed() || false == socket.isConnected()
				|| socket.isInputShutdown() || socket.isOutputShutdown()) {
			return false;
		}

		return true;
	}

	// static method
	private static SocketPool getSocketPool(String host, int port, int time) {
		SocketPool pool = httpPool.get(host + ":" + port);
		if (null == pool) {
			synchronized (httpPool) {
				pool = httpPool.get(host);
				if (null == pool) {
					pool = new SocketPool();
					httpPool.put(host + ":" + port, pool);
					new SocketFeedThread(pool, host, port, time).start();
				}
			}
		}
		return pool;
	}

	public static Socket getSocket(String host, int port, int time) {
		SocketObject socket = null;
		SocketPool pool = getSocketPool(host, port, time);
		synchronized (pool) {
			while (null == socket) {
				ArrayList<SocketObject> sockets = pool.getSocketList();
				if (sockets.size() > 0) {
					socket = sockets.remove(0);
					if (!isValid(socket)) {
						continue;
					}
					pool.out++;
				} else {
					try {
						pool.wait();
					} catch (Exception e) {
						
					}
					socket = null;
				}
			}
		}
		return socket;
	}

}

 

由于这里面把创建连接的代码从HttpResponse中挪出来了,所以HttpResponse中

直接写入

socket = HttpPool.getSocket(http.useProxy() ? http.getProxyHost()
     : host, http.useProxy() ? http.getProxyPort() : port, http
     .getTimeout());

并且在最后释放的代码为:

SocketPool pool = ((SocketObject) socket).getPool();
			synchronized (pool) {
				if (exception || connectionClose) {
					pool.decreaseOut();
					//disponsed it
					if(null!=socket)socket.close();
					socket = null;				
				} else {
					// return it
					pool.decreaseOut();
					pool.getSocketList().add((SocketObject) socket);					
					pool.notify();
				}
			}

然后抓包测试连接池的功能。截图如下:

详细的文字版:

GET /ws/NSearch?type=music&key=%E9%82%93%E7%B4%AB%E6%A3%8B+++%E7%88%B1%E4%BD%A0+++%E9%93%83%E5%A3%B0+ainy+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:13 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 GET /ws/NSearch?type=music&key=Dj+++%E5%93%AD%E6%B3%A3%E7%9A%84%E5%B0%8F%E5%A6%B9%E5%A6%B9+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:16 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 GET /ws/NSearch?type=music&key=Warning+De+Di+By+Pawan+Pilaniya+ft+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:18 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 [444 bytes missing in capture file]GET /ws/NSearch?type=music&key=Dj+++%E9%9D%9E%E4%B8%BB%E6%B5%81%E6%93%A6%E8%82%A9%E8%80%8C%E8%BF%87%E7%BB%8F%E5%85%B8+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



[1040 bytes missing in capture file]HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:24 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 [864 bytes missing in capture file]GET /ws/NSearch?type=music&key=%E5%B0%8F%E6%A1%A5%E6%B5%81%E6%B0%B4+Bandari+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



[520 bytes missing in capture file]HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:28 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 GET /ws/NSearch?type=music&key=04+Justin+Bieber++Nukleuz+Recordz+++ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:31 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 GET /ws/NSearch?type=music&key=01++Pyaw+par+say+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive



HTTP/1.1 403 forbidden

Server: nginx

Date: Wed, 17 Dec 2014 02:20:33 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 304

Connection: keep-alive

Retry-After: 0

X-Cache: MISS from 12localwebserver




 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html>
   <head>
     <title>403 forbidden</title>
   </head>
   <body>
     <h1>Error 403 forbidden</h1>
     <p>forbidden</p>
   </body>
 </html>
 GET /ws/NSearch?type=music&key=Ek+Villain++Galliyan++Female+Version+ HTTP/1.1

Host: sou.kuwo.cn

Accept-Encoding: x-gzip, gzip, deflate

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1

Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3

Connection: keep-alive

结论:连接池有效。

那么问题来了:大堆的403 forbidden,然后我用wget命令,结果仍然是403.

IP被封了,怎么破?

----------------------

经过测试:以下方法可以有效提高爬取成功率。

1 降低连接池的连接数。

2 一个线程拿到一个连接后等待若干时间再发送请求。

共有 人打赏支持
强子哥哥
粉丝 839
博文 666
码字总数 706083
作品 8
评论 (3)
泥沙砖瓦浆木匠
同是 强哥 膜拜
强子哥哥

引用来自“泥沙砖瓦浆木匠”的评论

同是 强哥 膜拜
一起膜拜技术
不会指针
您好,我最近再看Nutch的源码,学习您的博客,看插件的时候遇到了问题,请问怎样获取插件的源码并导入eclipse呢
×
强子哥哥
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: