文档章节

『Data Science』R语言学习笔记,观察数据

灰大羊
 灰大羊
发布于 2016/07/19 20:57
字数 753
阅读 31
收藏 0

Getting the data from Web

if(!file.exists("./db")){
    dir.create("./db")
}

fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./db/restaurants.csv", method = "auto")
restData <- read.csv("./db/restaurants.csv")

Looking at a bit of the data

head(restData, n=3)
tail(restData, n=3)

Make summary

summary(restData)

More in depth information

str(restData)

Quantiles of quantitative variables

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

> quantile(restData$councilDistrict, na.rm = T)
  0%  25%  50%  75% 100%
   1    2    9   11   14
> quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))
50% 75% 90%
  9  11  12
  • x - numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE.
  • probs - numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.)
  • na.rm - logical; if true, any NA and NaN's are removed from x before the quantiles are computed.
  • names - logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.
  • type - an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
  • ... - further arguments passed to or from other methods.

Make table

> table(restData$zipCode, useNA = "ifany")

-21226  21201  21202  21205  21206  21207  21208  21209  21210  21211  21212  21213  21214  21215  21216  21217  21218  21220
     1    136    201     27     30      4      1      8     23     41     28     31     17     54     10     32     69      1

> table(restData$councilDistrict, restData$zipCode)

     -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213 21214 21215 21216 21217 21218 21220 21222 21223
  1       0     0    37     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     7     0
  2       0     0     0     3    27     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  3       0     0     0     0     0     0     0     0     0     0     0     2    17     0     0     0     3     0     0     0
  4       0     0     0     0     0     0     0     0     0     0    27     0     0     0     0     0     0     0     0     0
  5       0     0     0     0     0     3     0     6     0     0     0     0     0    31     0     0     0     0     0     0
  6       0     0     0     0     0     0     0     1    19     0     0     0     0    15     1     0     0     0     0     0

Check for missing values

sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)

Row and column sums

colSums(is.na(restData))
all(colSums(is.na(restData)) == 0)
all(restData$zipCode > 0)

Values with specific characteristics

> table(restData$zipCode %in% c("21212"))

FALSE  TRUE
 1299    28

> table(restData$zipCode %in% c("21212", "21213"))

FALSE  TRUE
 1268    59

> restData[restData$zipCode %in% c("21212", "21213"), ]
                                     name zipCode                neighborhood councilDistrict policeDistrict
29                      BAY ATLANTIC CLUB   21212                    Downtown              11        CENTRAL
39                            BERMUDA BAR   21213               Broadway East              12        EASTERN
92                              ATWATER'S   21212   Chinquapin Park-Belvedere               4       NORTHERN
111            BALTIMORE ESTONIAN SOCIETY   21213          South Clifton Park              12        EASTERN
187                              CAFE ZEN   21212                    Rosebank               4       NORTHERN

Cross tabs

data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
DF
summary(DF)

xt <- xtabs(Freq ~ Gender + Admit, data = DF)   ## Freq must be a column which could be compute, like integer or numeric
xt

Flat tables

> warpbreaks$replicate <- rep(1:9, len = 54)
> xt = xtabs(breaks ~., data = warpbreaks)        ## equals to xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)
> xt
, , replicate = 1

    tension
wool  L  M  H
   A 26 18 36
   B 27 42 20

, , replicate = 2

    tension
wool  L  M  H
   A 30 21 21
   B 14 26 21

, , replicate = 3

    tension
wool  L  M  H
   A 54 29 24
   B 29 19 24


> ftable(xt)
             replicate  1  2  3  4  5  6  7  8  9
wool tension                                     
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

Size of a data set

> fakeData = rnorm(1e5)
> object.size(fakeData)
800040 bytes
> print(object.size(fakeData), units = "Mb")
0.8 Mb

© 著作权归作者所有

灰大羊
粉丝 4
博文 96
码字总数 95147
作品 0
浦东
项目经理
私信 提问
R语言学习笔记之相关性矩阵分析及其可视化

计算相关矩阵 R内置函数 cor() 可以用来计算相关系数:cor(x, method = c("pearson", "kendall", "spearman")),如果数据有缺失值,用cor(x, method = "pearson", use = "complete.obs")。 ......

R语言中文社区
2018/02/05
0
0
ggplot2学习笔记系列之利用ggplot2绘制误差棒及显著性标记

绘制带有误差棒的条形图 library(ggplot2) #创建数据集 df <- data.frame(treatment = factor(c(1, 1, 1, 2, 2, 2, 3, 3, 3)), response = c(2, 5, 4, 6, 9, 7, 3, 5, 8), group = factor(c......

R语言中文社区
2018/02/12
0
0
R语言可视化学习笔记之相关矩阵可视化包ggcorrplot

基于ggplot2包以及corrplot包的相关矩阵可视化包ggcorrplot,ggcorrplot包提供对相关矩阵重排序以及在相关图中展示显著性水平的方法,同时也能计算相关性p-value 安装方法就不提了,不懂的可...

R语言中文社区
2018/01/25
0
0
R语言data manipulation学习笔记之创建变量、重命名、数据融合

作者简介Introduction taoyan:R语言中文社区特约作家,伪码农,R语言爱好者,爱开源。 个人博客: https://ytlogos.github.io/ 公众号:生信大讲堂 往期回顾 数据分析中数据处理也就是data ...

R语言中文社区
2018/03/26
0
0
R语言学习笔记之聚类分析

使用k-means聚类所需的包: factoextra cluster #加载包 library(factoextra) library(cluster)l #数据准备 使用内置的R数据集USArrests #load the dataset data("USArrests") #remove any m......

R语言中文社区
2018/01/16
0
0

没有更多内容

加载失败,请刷新页面

加载更多

OSChina 周二乱弹 —— 吾不好梦中插人

Osc乱弹歌单(2019)请戳(这里) 【今日歌曲】 @鱼豆腐233 :#今日歌曲分享# 分享My Chemical Romance的单曲《I Don't Love You》: 《I Don't Love You》- My Chemical Romance 手机党少年们...

小小编辑
47分钟前
17
4
ss5 vpn 安装(linux版本)

1. 创建一个文件夹 /ss5 你也可以自定义,不过后续的地方需要注意自己的地址 2. 下载ss5文件(如果你的服务器没有安装wget请使用 yum -y install wget 命令安装 如果连yum都没安装自己查去)(下...

太黑_thj
今天
2
0
八、RabbitMQ的集群原理

集群架构 写在前面 RabbitMQ集群是按照低延迟环境设计的,千万不要跨越WAN或者互联网来搭建RabbitMQ集群。如果一定要在高延迟环境下使用RabbitMQ集群,可以参考使用Shovel和Federation工具。...

XuePeng77
今天
5
0
mac系统下,brew 安装mysql,用终端可以连接,navicat却连接不上?

问题: 1.报错? 2059 - Authentication plugin 'caching_sha2_password' cannot be loaded: dlopen(../Frameworks/caching_sha2_password.so, 2): image not found 2.自己通过设置,已经把密......

写bug的攻城狮
昨天
3
0
老生常谈,HashMap的死循环

问题 最近的几次面试中,我都问了是否了解HashMap在并发使用时可能发生死循环,导致cpu100%,结果让我很意外,都表示不知道有这样的问题,让我意外的是面试者的工作年限都不短。 由于HashMap...

群星纪元
昨天
6
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部