如何采集reddit

原创
2020/06/17 17:34
阅读数 1.3K

Reddit采集

主要步骤:

  1. 创建reddit app账号, app类型选择script
  2. 使用app账号登陆reddit获取授权
  3. 授权后,采集reddit 信息

如果仅测试,建议使用专门的测试账号。

有用的链接地址

  1. 官方api说明文档
  2. 官方api使用规范
  3. reddit 授权说明
  4. reddit app列表地址
  5. praw 文档
  6. python采集reddit例子

创建reddit APP 账号

登陆reddit app 后,进入reddit app列表地址,创建一个reddit app。本例中创建的APP类型是脚本(script)类型。

Reddit%201f48340ce86f4083ad85b87755dd0044/Untitled.png

reddit API使用规范

为了防止开发账号被封,需要遵守一定的规范,详细的规范地址:https://github.com/reddit-archive/reddit/wiki/API

主要关注:

  1. 每分钟60个请求
  2. User-Agent规范(不可以User-Agent欺骗):<platform>:<app ID>:<version string> (by /u/<reddit username>)。例如android:com.example.myredditapp:v1.2.3 (by /u/kemitche)

reddit 获取tiktok 相关内容例子

# coding: UTF-8
#!/use/bin/env python3

import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(
    client_id='your-clientID',
    client_secret='your secret',
    user_agent='your_platform:dev_tmp:v0.1 (by /u/dev_tmp)',
    # redirect_uri='http://localhost:8080'
    username='dev_tmp',
    password='HGFhgf123'
)
# print(reddit.auth.url(["identity"], "...", "permanent"))
print(reddit.user.me())

# all 是<class 'praw.models.reddit.subreddit.Subreddit'>类型,
# 具体使用见:https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html
all = reddit.subreddit("all")
print(type(all))

# submission 的类型是<class 'praw.models.reddit.submission.Submission'>,
# 具体属性列表见:https://praw.readthedocs.io/en/latest/code_overview/models/submission.html
messages = {
    "id": [],
    "url": [],
    "title": [],
    "score": [],
    "comms_num": [],
    "body": [],
    "created": []
}
for submission in all.search("tiktok", limit=5):
    messages["id"].append(submission.id)
    messages["url"].append(submission.url)
    messages["title"].append(submission.title)
    messages["score"].append(submission.score)
    messages["comms_num"].append(submission.num_comments)
    messages["body"].append(submission.selftext)
    messages["created"].append(submission.created)

# search 结果类型是<class 'praw.models.listing.generator.ListingGenerator'>
# praw 中的其他类型文档为:https://praw.readthedocs.io/en/latest/code_overview/other.html
# res = all.search("tiktok")
# print(type(res))

data = pd.DataFrame(messages)
data.to_csv('data.csv', index=False)

结果CSV:

Reddit%201f48340ce86f4083ad85b87755dd0044/Untitled%201.png

展开阅读全文
加载中
点击引领话题📣 发布并加入讨论🔥
0 评论
0 收藏
0
分享
返回顶部
顶部