spark做聚合计算

原创
2017/06/27 16:07
阅读数 81
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("My Spark Application").setMaster("local")
sc = SparkContext(conf=conf)
text = sc.textFile('/root/common_command/url_data.csv') 
url_info = text.map(lambda line:line.split(","))
id_info = url_info.map(lambda fields:((fields[0], fields[1]),(fields[3])))
url_cnt = id_info.countByKey().items()
url_num = id_info.distinct().countByKey().items()
x = sc.parallelize(url_cnt) 
y = sc.parallelize(url_num)
result = sorted(x.fullOuterJoin(y).collect())
print(result)
print("executed successfully!")
展开阅读全文
打赏
0
0 收藏
分享
加载中
KYO4321博主
用Spark Python进行数据处理和特征提取
http://blog.csdn.net/u013719780/article/details/51768720
2017/06/29 09:55
回复
举报
KYO4321博主
##也可以使用python进行处理
import pandas as pd
df = pd.read_csv('url_data.txt', sep=',')
df.groupby(['ID','name'])['url'].count()

"""
ID name
id1 user1 4
id2 user2 4
Name: url, dtype: int64
"""


df1= df.iloc[:, [0,1,3]]
df1 = df1.drop_duplicates()
df1.groupby(['ID','name']).count()
"""
url
ID name
id1 user1 1
id2 user2 2

"""
2017/06/27 16:14
回复
举报
KYO4321博主
##源数据
##如下几个字段ID,name,cnt,url
##统计用户的总访问次数和去除访问同一个URL之后的总访问次数
"""
id1,user1,2,http://www.hupu.com
id1,user1,2,http://www.hupu.com
id1,user1,3,http://www.hupu.com
id1,user1,100,http://www.hupu.com
id2,user2,2,http://www.hupu.com
id2,user2,1,http://www.hupu.com
id2,user2,50,http://www.hupu.com
id2,user2,2,http://touzhu.hupu.com

"""
2017/06/27 16:14
回复
举报
KYO4321博主
将正文中的代码封装到new_demo.py文件中,然后运行如下代码
sh run_spark.sh new_demo.py

即可得到相应结果:
[root@SZC-L0033614 common_command]# sh run_spark.sh new_demo.py
17/06/27 16:05:08 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/06/27 16:05:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/27 16:05:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[((u'id1', u'user1'), (4, 1)), ((u'id2', u'user2'), (4, 2))]
executed successfully!
2017/06/27 16:12
回复
举报
KYO4321博主
[root@SZC-L0033614 common_command]# more call_pyspark.sh
cd /usr/lib/spark-2.1.0-bin-hadoop2.7/bin
./pyspark
[root@SZC-L0033614 common_command]# more run_spark.sh
spark_file=$1
/usr/lib/spark-2.1.0-bin-hadoop2.7/bin/spark-submit ${spark_file}
2017/06/27 16:11
回复
举报
更多评论
打赏
5 评论
0 收藏
0
分享
返回顶部
顶部