文档章节

『Data Science』R语言学习笔记,获取数据

灰大羊
 灰大羊
发布于 2016/07/18 00:00
字数 1142
阅读 84
收藏 1

obtaining Data Motivation

  • This course covers the basic ideas behind getting data ready for analysis
    • Finding and extracting raw data
    • Tidy data principles and how to make data tiny
    • practical implementation through a range of R packages
  • What this course depends on
  • What would be useful
    • Exploratory analysis
    • Reporting Data and Reproducible Research

ps: Free big data source

GOAL: Raw data -> Processing script -> tidy data -> data analysis -> data communication

Raw and Processed Data

Data are values of qualitative or quantitative variables, belonging to a set of items.

  • Qualitative: Country of origin, sex, treatment
  • Quantitative: Height, weight, blood pressure

The components of tidy data

  1. The raw data.
  2. A tidy data set.
  3. A code book describing each variable and its values in the tidy data set.
  4. An explicit and recipe you used to go from 1 -> 2,3.

The tidy data

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.

Others:

  • Include a row at the top of each file with variable names.
  • Make variable names human readable AgeAtDiagnosis instead of AgeDx.
  • In general data should be saved in one file per table.

Dowdloading Data

  1. Get/set your working directory
  • getwd()
  • setwd()
  1. Checking for and creating directories
  • file.exists("directoryName")
  • dir.create("directoryName")
  1. Getting data from the internet
  • download.file()
if(!file.exists(('db'))){
  dir.create('db')
}

fileUrl <- "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/callsforservice.csv", method = "curl")
list.files('./db')

PS: 在使用上面的方法下载数据文件的时候,出现了下面的的错误信息,这个是由于我系统里面没有安装curl造成的,把method = "curl"改成method = "auto"解决。

Warning messages:
1: running command 'curl  "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"  -o "./db/callsforservice.csv"' had status 127
2: In download.file(fileUrl, destfile = "./db/callsforservice.csv",  :
  下载退出状态不是零

Loading flat files

  • read.table()
calssData <- read.table('./db/callsforservice.csv', sep = ',', header = T)
head(calssData)

All Reading Functions

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
            dec = ",", fill = TRUE, comment.char = "", ...)

Reading XML Data

  • Extensible markup language
  • Frequently used to store structured data
  • Particularly widely used in internet applications
  • Extracting XML is the basis for most web scraping
  • Components
    • Markup - label that give the text structured
    • Content - the actual text of the document
library(XMl)

html <- "http://stackoverflow.com/search?q=XML+content+does+not+seem+to+be+XML%3A"
doc <- htmlTreeParse(html, useInternal = T)
content <- xpathSApply(doc, "//div[[[[@class](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758) = 'result-link']", xmlValue)

Reading JSON Data

  • jsonlite
install.packages('jsonlite')
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData)
names(jsonData$owner$login)

## print JSON data in pretty way
myjson <- toJSON(jsonData$owner, pretty = T)
cat(myjson)

data.table()

> library(data.table)
> DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DF, 3)
         x y          z
1 1.239493 a -0.3917245
2 1.090748 a  0.3640152
3 2.462106 a  1.3424369

> DT = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))
> head(DT)
           x y           z
1  0.1235667 a  0.94765708
2 -1.1491418 a  1.23264715
3 -2.3339784 a -0.70625463
4  0.4896532 b  0.07144038
5  0.7731791 b  0.45262096
6  0.1601838 b -0.30345490
DT[2,]
DT[DT$y == "a",]
DT[, c(2,3)]
DT[, list(mean(x), sum(z))]
DT[, table(y)]
DT[, w:=z^2]

DT[, m:= {tmp <- (x+y); log2(tmp+5)}]

Reading from MySQL

install.packages("RMySQL")
library(RMySQL)

ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);

hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)

dbListFields(hg19, "affyU133Plus2")

dbGetQuery(hg19, "select count(*) from affyU13Plus2")

affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)

## processing big data table
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantitle(affyMis$misMatches)

affyMisSmall <- fetch(query, n = 10); dbClearResult(query);
dim(affyMisSmall)

dbDisconnect(hg19)    ## close db connection

Reading from HDF5

  • Used for stroing large data sets.
  • Supports storing a range of data types
  • Heirarchical data format
  • groups containting zero or more data sets and metadata
    • Have a group header with group name and list of attributes
    • Have a group symbol table with a list of bjects in group
  • datasets multidmensional array of data elements with metadata
    • Have a header with name, datatype, dataspace, and storage layout
    • Have a data array with the data

Reading from web

Get web document

  1. Use built-in functions, url() and readLines
> con = url("http://www.baidu.com")
> htmlCode = readLines(con)
> close(con)
> htmlCode
  1. Use XML package
> library(XML)

> url <- "http://www.baidu.com"
> html <- htmlTreeParse(url, useInternalNodes = T)

> xpathSApply(html, "//div", xmlValue)
  1. Use httr and XML packages
install.packages("httr")
library(httr)
url <- "http://www.baidu.com"
html <- GET(url)
content = content(html, as="text")

library(XML)
parsedHtml = htmlParse(content, asText = T)
xpathSApply(parsedHtml, "//div", xmlValue)

Accessing websites with passwords

  1. before Login
> pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
> pg1
Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2016-07-17 15:33
  Status: 401
  Content-Type: <unknown>
<EMPTY BODY>
  1. Loging In
> pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg2
Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2016-07-17 15:34
  Status: 200
  Content-Type: application/json
  Size: 47 B
{
  "authenticated": true,
  "user": "user"
}
> names(pg2)
 [1] "url"         "status_code" "headers"     "all_headers" "cookies"     "content"     "date"        "times"      
 [9] "request"     "handle"
  1. handle the web site with cookies, sessions and so on.
> pg = handle("http://httpbin.org")
> login = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
> pg1 = GET(handle = pg, path = "/")
> pg2 = GET(handle = pg, path = "about")

Reading data from APIs

library(oauth_app)
myapp = oauth_app("twitter", key = "yourConsumerKeyHere", secret = "yourConsumerSecretHere")
sig = sign_oauth1.0(myapp, token = "youerTokenHere", token_secret = "yourTokenSecretHere")
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)

json1 = content(homtTL)
json2 = jsonlite::fromJSON(toJSON(json1))
json2[1, 1:4]

© 著作权归作者所有

灰大羊
粉丝 4
博文 96
码字总数 95147
作品 0
浦东
项目经理
私信 提问
R语言学习笔记之相关性矩阵分析及其可视化

计算相关矩阵 R内置函数 cor() 可以用来计算相关系数:cor(x, method = c("pearson", "kendall", "spearman")),如果数据有缺失值,用cor(x, method = "pearson", use = "complete.obs")。 ......

R语言中文社区
2018/02/05
0
0
ggplot2学习笔记系列之利用ggplot2绘制误差棒及显著性标记

绘制带有误差棒的条形图 library(ggplot2) #创建数据集 df <- data.frame(treatment = factor(c(1, 1, 1, 2, 2, 2, 3, 3, 3)), response = c(2, 5, 4, 6, 9, 7, 3, 5, 8), group = factor(c......

R语言中文社区
2018/02/12
0
0
R语言可视化学习笔记之相关矩阵可视化包ggcorrplot

基于ggplot2包以及corrplot包的相关矩阵可视化包ggcorrplot,ggcorrplot包提供对相关矩阵重排序以及在相关图中展示显著性水平的方法,同时也能计算相关性p-value 安装方法就不提了,不懂的可...

R语言中文社区
2018/01/25
0
0
R语言data manipulation学习笔记之创建变量、重命名、数据融合

作者简介Introduction taoyan:R语言中文社区特约作家,伪码农,R语言爱好者,爱开源。 个人博客: https://ytlogos.github.io/ 公众号:生信大讲堂 往期回顾 数据分析中数据处理也就是data ...

R语言中文社区
2018/03/26
0
0
R语言学习笔记之聚类分析

使用k-means聚类所需的包: factoextra cluster #加载包 library(factoextra) library(cluster)l #数据准备 使用内置的R数据集USArrests #load the dataset data("USArrests") #remove any m......

R语言中文社区
2018/01/16
0
0

没有更多内容

加载失败,请刷新页面

加载更多

rsync工具常用选项以及同步的两种方式

rsync -av /etc/passwd /tmp/1.txt #rsync的本机传输写法 rsync -av /tmp/1.txt 192.168.188.128:/tmp/2.txt #rsync的远程传输rsync格式rsync [OPTION] … SRC ......

林怡丰
今天
3
0
GatewayWorker 报错:stream_socket_server(): unable to connect to tcp://0.0.0.0:1238

GatewayWorker 报错:stream_socket_server(): unable to connect to tcp://0.0.0.0:1238 (Address already in use) 官方文档虽然有相同的问题,但是对我的问题没起作用…… 后面发现自己手贱...

wenzhizhong
昨天
3
0
REST接口

文章来源 https://zhuanlan.zhihu.com/p/28674721?group_id=886181549958119424 http://www.ruanyifeng.com/blog/2014/05/restful_api.html REST 对请求的约定 REST 用来规范应用如何在 HTTP......

Airship
昨天
6
0
Spring Cloud Config 统一配置中心

Spring Cloud Config 统一配置中心 一、统一配置中心 统一管理配置 通常,我们会使用配置文件来管理应用的配置。如一个 Spring Boot 的应用,可以将配置信息放在 application.yml 文件中,如...

非摩尔根
昨天
6
0
android ------ AAPT2 error: check logs for details解决方法

AAPT 是全称是 Android Asset Packaging Tool,它是构建 App,甚至是构建 Android 系统都必不可少的一个工具。它的作用是将所有资源文件压缩打包到Android APK 当中。我们在 Android SDK 目录...

切切歆语
昨天
3
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部