文档章节

『Data Science』R语言学习笔记,基础语法

灰大羊
 灰大羊
发布于 2016/07/12 22:06
字数 3198
阅读 82
收藏 0

Data Types

Data Object & Vector

x <- c(0.5, 0.6)        ## numeric
x <- c(TRUE, FALSE)     ## logical
x <- c(T, F)            ## logical
x <- c("a","b","c")     ## character
x <- 9:29               ## integer
x <- c(1+0i, 2+4i)      ## complex

x <- vector("numeric", length = 10) ## create a numeric vector, which length is 10.

x <- 0.6    ## get the class type of the variable
class(x)    ## print the class type of "x".

x <- 1:10   ## set the class type to the variable forcibly.
as.character(x)

List

x <- list("...", "...", ...)

Matrices

Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of lenght 2 (nrow, ncol).

m <- matrix(nrow = 2, ncol = 3)
n <- matrix(1:6, nrow = 2, ncol = 3)

dim(m)          ## get the value of "norw, ncol" of the matrix.

attributes(m)   ## get the a of  

m <- 1:10           ## create a new numeric vector, from 1 to 10
dim(m) <- c(2,5)    ## put the vector "m" into a matrix, and assign the value (nrow = 2, ncol = 3) to it.
m                   ## print the value of "m".

x <- 1:3
y <- 10:12
cbind(x, y)     ## create a matrix by "cbind", binding the value of columns with variables, which has 3 rows and 2 columns.
rbind(x, y)     ## create a matrix by "rbind", binding the value of rows with variables, which has 2 rows and 3 columns.

Factors

Factors are used to represent categorical data. One can think of a factor is an integer vector where each integer has a label.

x <- factor(c("yes", "yes", "yes", "yes", "no", "no"))  ## create a factor with a character vector.
x                                                       ## print the factor.
table(x)                                                ## list the label (with its quantity) of the factor in a table.
unclass(x)                                              ## list the value and the label of the factor.

x <- factor(c("yes", "yes", "no", level("yes", "no")))  ## create a factor with a character vector which had set the "levels" in it.

Missing Values

Missing values are denoted by NA of NaN for undefined mathematical operations.

is.na()     
is.nan()    

x <- c(1, 2, NaN, NA, 4)    ## Create a vector for test the functions, ```is.na()``` and ```is.nan()```.
is.na(x)                    ## NA values have a class also, so there are integer NA, character NA, etc.
is.nan(x)                   ## A NaN value is also NA but the converse is not true.

Whole codes below:

> x <- c(1, 2, NA, 10, 3)
> is.na(x)
[1] FALSE FALSE  TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE FALSE  TRUE FALSE FALSE

Data Frames

Data frames are used to store tabular data.

  • They are represented as a special type of list where every element of the list has to have the same length.
  • Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.
  • Unlike matrices, data frames can store different classes of objects in each column (just like lists);matrices must have every element be the same class.
  • Data frames also have a special attribute called row.names.
  • Data frames are usually created by calling read.table() or read.csv().
  • Can be converted to a matrix by calling data.matrix().
> x <- data.frame(foo = 1:4, bar = c(T,T,F,F))  ## create a Data Frame Object which has two columns and four rows.
> x
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE

Names

R objects can also have names, which is very useful for writing readable code and self-describing objects.

> x <- 4:6                              ## Create a integer vector 'x' which has three elements.
> names(x) <- c("foo", "bar", "norf")   ## Assign names to vector 'x'.
> x                                     ## Print the value of 'x'.
 foo  bar norf 
   4    5    6

Data Reading

Reading Data

  • read.table, read.csv, for reading tabular data, which return a data.frame object.
  • readLines, for reading lines of a text file.
  • source, for reading in R code files(inverse of dump).
  • dget, for reading in R code files(inverse of dput).
  • load, for reading in saved workspaces.
  • unserialize, for reading single R objects in binary form.

read.table

Description: Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

Main Arguments:

  • file
  • header
  • sep, columns separate, like ,.
  • colClasses, the data class types of the column.
  • nrows, number of the rows.
  • comment.character, a character vector indicating the class of each column in the dataset.
  • skip, the number of lines to skip from the beginning.
  • stringsAsFactors, should character variables be coded as factors?

Usages:

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
            dec = ",", fill = TRUE, comment.char = "", ...)

Writing Data

Description: write.table prints its required argument x (after converting it to a data frame if it is not one nor a matrix) to a file or connection.

Main Points:

  • write.table
  • writeLines
  • dump
  • dput
  • save
  • serialize

Usages:

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
            eol = "\n", na = "NA", dec = ".", row.names = TRUE,
            col.names = TRUE, qmethod = c("escape", "double"),
            fileEncoding = "")

write.csv(...)
write.csv2(...)

Reading Large Tables

  • Read the help page for read.table, which contains many hints.
  • Make a rough calculation of the memory required to store your dataset. If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.
  • Set comment.char = "" if there are no commented lines in your file.
  • Use the colClasses argument. Specifying this option instead of using the default can make read.table run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are "numeric", for example, then you can just set colClasses = "numeric". A quick an dirty way to figure out the classes of each column is the following:
> initial <- read.table("db.txt", nrows = 100, sep = "\t")
> classes <- sapply(initial, class)
> tabAll <- read.table("db.txt", sep = "\t", colClasses = classes)
  • Set nrows. This doesn't make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool wc to calculate the number of lines in a file.

Reading Data Formats

dput and dget

> y <- data.frame(a = 1, b = "a") ## Create a `data.frame` object for `dput`
> dput(y)                         ## `dput` the object created before

structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", 
"b"), row.names = c(NA, -1L), class = "data.frame")

> dput(y, file = 'y.R')           ## `dput` the object created before into a file which named 'y.R'
> new.y <- dget('y.R')            ## get the data store in the file 'y.R'
> new.y                           ## print the data in the 'y.R'

  a b
1 1 a

dump

Multiple objects can be deparsed using the dump function and read back in using source.

> x <- "foo"                          ## create the first data object
> y <- data.frame(a = 1, b = "a")     ## create the second data object
> dump(c("x", "y"), file = "data.R")  ## store the both data object in to a file called 'data.R'
> rm(x, y)                            ## remove the both data object from RAM
> source("data.R")                    ## import the dumped file 'data.R'
> y                                   ## print the data object 'y' from 'data.R'
  a b
1 1 a
> x                                   ## print the data object 'x' from 'data.R'
[1] "foo"

Connections: Interfaces to the Outside World

Data are read in using connection interfaces. Connections can be made to files (most common) or to other more exotic things.

  • file, opens a connection to a file
  • gzfile, opens a connection to a file compressed with gzip
  • bzfile, opens a connection to a file compressed with bzip2
  • url, opens a connection to a webpage.
> con <- file('db.txt', 'r')
> readLines(con)

Subsetting

  • [always returns an object of the same class as the original; can be used to select more than one element (there is one exception)
  • [[is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame.
  • $ is used to extract elements of a list or data frame by name; semantics are similar to hat of [[.

Basic

> x <- c("a", "b", "c", "d", "e")
> x[1]
[1] "a"
> x[2]
[1] "b"
> x[1:3]
[1] "a" "b" "c"
> x[x > "a"]
[1] "b" "c" "d" "e"
> u  <- x>"a"
> u
[1] FALSE  TRUE  TRUE  TRUE  TRUE
> x[u]
[1] "b" "c" "d" "e"

Lists

> x <- list(foo = 1:4, bar = 0.6)

> x[1]
$foo
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x[[2]]
[1] 0.6

> x$bar
[1] 0.6
> x$foo
[1] 1 2 3 4

> x[["bar"]]
[1] 0.6
> x["bar"]
$bar
[1] 0.6
> x <- list(foo = 1:4, bar = 0.6, baz = "hello")

> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"

> name <- "foo"
> x[[name]]
[1] 1 2 3 4
> x$name          ## `name` is a variable, not a `level`, so does not has x$name in the list `x`.
NULL
> x$foo
[1] 1 2 3 4

Matrices

Matrices can be subsetted in the usual way with (i,j) type indices.

> x <- matrix(1:6, 2, 3)

> x[1,2]
[1] 3

> x[1,]
[1] 1 3 5

> x[,2]
[1] 3 4

> x[1, 2, drop = FALSE]
     [,1]
[1,]    3

> x[1, , drop = FALSE]
     [,1] [,2] [,3]
[1,]    1    3    5

Partial Matching

Partial matching of names is allowed with [[ and $.

> x <- list(aardvark = 1:5)

> x$a
[1] 1 2 3 4 5

> x[["a"]]
NULL

> x[["a", exact = FALSE]]
[1] 1 2 3 4 5

Removing NA Values

> x <- c(1, 2, NA, 4, NA, 5)

> bad <- is.na(x)

> x[!bad]
[1] 1 2 4 5

Use built-in function complete.cases() to get a logical vector indicating which cases are complete, i.e., have no missing values.

> x <- c(1, 2, NA, 4, NA, 5)
> y <- c("a", "b", NA, "d", NA, "f")

> good <- complete.cases(x, y)

> good
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

> x[good]
[1] 1 2 4 5
> y[good]
[1] "a" "b" "d" "f"

From data frame

> airquality[1:6,]                    ## call a matrix 
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5   ## there a NA value in this vector
6    28      NA 14.9   66     5   6   ## there a NA value in this vector

> good <- complete.cases(airquality)  ## as there a NA value in 6s/7s row, so it is filtered.

> airquality[good, ][1:6, ]
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4   
7    23     299  8.6   65     5   7
8    19      99 13.8   59     5   8

Vectorized Operations

  • many operations in R are vectorized making code more efficient, concise, and easier to read.
> x <- 1:4; y <- 6:9

> x + y
[1]  7  9 11 13

> x > 2
[1] FALSE FALSE  TRUE  TRUE

> y >= 2
[1] TRUE TRUE TRUE TRUE

> y == 8
[1] FALSE FALSE  TRUE FALSE

> x * y
[1]  6 14 24 36

> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444

Logic Control

if-else

> if (x > 3) {
+     y <- 10
+ } else {
+     y <- 0
+ }

For

> x <- c("a", "b", "c", "d")
> for (i in 1:4) {
+     print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"

> for(i in seq_along(x)) {
+     print(x[i])
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"

> for(letter in x){
+     print(letter)
+ }
[1] "a"
[1] "b"
[1] "c"
[1] "d"

> for(i in 1:4) print(x[i])
[1] "a"
[1] "b"
[1] "c"
[1] "d"

While

> count <- 0
> while(count < 10) {
+     print(count)
+     count <- count + 1
+ }
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9

> z <- 5
> while(z >=3 && z <= 10) {
+     print(z)
+     coin <- rbinom(1, 1, 0.5)
+     
+     if(coin == 1) {
+         z <- z + 1
+     } else {
+         z <- z - 1
+     }
+ }
[1] 5
[1] 4
[1] 3
[1] 4
[1] 5
[1] 4
[1] 5
[1] 4
[1] 3

Repeat

> x0 <- 1
> tol <- 1e-8
> repeat {
+     x1 <- computeEstimate()
+     if(abs(x1 - x0) < tol) {
+         break
+     } else {
+         x0 <- x1
+     }
+ }
> for(i in 1:100) {
+     if(i <= 20) {
+         next        ## jump into next loop
+     }
+ }

Function

> add2 <- function(x, y) {
+   x + y
+ }

> add2(2,3)
[1] 5
> above <- function(x, n = 10) {
+   use <- x >n
+   x[use]
+ }

> x <- 1:20
> above(x, 10)
 [1] 11 12 13 14 15 16 17 18 19 20
> columnmean <- function(y, removeNA = TRUE) {
+   nc <- ncol(y)
+   means <- numeric(nc)
+   for(i in 1:nc) {
+     means[i] <- mean(y[,i], na.rm = removeNA)
+   }
+   means                       ## return result
+ }

> columnmean(airquality)        ## compute the mean of values of columns of `airqulity`.
[1]  42.129310 185.931507   9.957516  77.882353   6.993464  15.803922

The ... Argument

... is often used when extending another function and you don't want to copy the entire argument list of the original function.

myplot <- function(x, y, type = "1", ...) {
  plot(x, y, type = type, ...)
}

The ... argument is also necessary when the number of arguments passed to the function cannot be known in advance.

> args(paste)     ## view the description of arguments of function `paste`. 
function (..., sep = " ", collapse = NULL) 
NULL

> args(cat)
function (..., file = "", sep = " ", fill = FALSE, labels = NULL, 
    append = FALSE) 
NULL

> paste("a", "b", sep = ":")
[1] "a:b"
> paste("a", "b", se = ":")
[1] "a b :"

Scoping Rules

A Diversion on Binding Values to Symbol

When R tries to bind a value to a symbol, it searches through a series of environments to find the apropriate value. When you are working on the command line and need to retrieve the value of an R object, the order is roughly

  1. Search the global environment for a symbol name matching the one requested.
  2. Search the namespaces of each of the packages on the search list.

Free Variable

> z <- 1

> lm <- function(x, y) {
+   x + y + z   ## z is a free variable
+ }
 
> lm(1, 1)
[1] 3

Coding Standard

  1. Always use text files / text editor.
  2. Indent your code.
  3. Limit the width of your code.
  4. Limit the length of your function.

Dates and Times

  • Dates are represented by the Date class
  • Times are represented by the POSIXct or the POSIXlt class
  • Dates are stored internally as the number of days since 1970-01-01
  • Times are stored internally as the number of seconds since 1970-01-01
> Sys.time()
[1] "2016-07-13 22:22:37 CST"
> timeNow <- Sys.time()
> datestring <- c(timeNow)
> x <- strptime(datestring, "%B %d, %Y %H:%M")    ## format the time string
> x
[1] NA
> class(x)
[1] "POSIXlt" "POSIXt" 

Loop Functions

  • lapply Loop over a list and evaluate a functin on each element.
  • sapply Same as lapply but try to simplify the result.
  • apply Apply a function over the margins of an array.
  • taply Apply a function over subsets of a vector.
  • mapply Multivariate version of lapply.
  • An auxiliary function split is also useful, particularly in conjunction with lapply.

lapply

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<bytecode: 0x000000000b606e90>
<environment: namespace:base>

For an instance below.

> x <- list(a = 1:5, b = rnorm(10))
> lapply(x, mean)
$a
[1] 3

$b
[1] -0.1931699
  • rnorm: Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd.
  • runif, dunif, punif, qunif: These functions provide information about the uniform distribution on the interval from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile function and runif generates random deviates.
> x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))

> lapply(x, function(elt) elt[,1])
$a
[1] 1 2

$b
[1] 1 2 3

sapply

sapply will try to simplify the result of lapply if possible.

  • If the result is a list where every element is length 1, then a vector is returned.
  • If the result is a list where every element is a vector of the same length (>1), a matrix is returned.
  • If it can't figure things out, a list is returned.

apply

apply is used to a evaluate a function (often an anonymous one) over the margins of an array.

  • It is most often used to apply a function to the rows or columns of a matrix.
  • It can be used with general arrays, e.g. taking the average of an array of matrices.
  • It is not really faster than writing a loop, but it works in one line!
> str(apply)
function (X, MARGIN, FUN, ...)  
  • x is an array
  • MARGIN is an integer vector indicating which margins should be "retained"
  • FUN is a function to be applied.
  • ... is for other arguments to be passed to FUN

> x <- matrix(1:4, 2, 2)
> x
     [,1] [,2]
[1,]    1    3
[2,]    2    4
> apply(x, 1, mean)
[1] 2 3
> apply(x, 2, mean)
[1] 1.5 3.5
  • MARGIN = 1 Compute the mean at every row, and return a vector as result.
  • MARGIN = 1 Compute the mean at every column, and return a vector as result.

Other shortcuts.

  • rowSums = apply(x, 1, sum)
  • rowMeans = apply(x, 1, mean)
  • colSums = apply(x, 2, sum)
  • colMeans = apply(x, 2, mean)

Apply in multiple dimensions array, in the source below , we use a vector as a MARGIN value to complete the compute of multiple dimensions compute.

> a <- array(rnorm(2 * 2 * 10), c(2, 2, 10))
> apply(a, c(1, 2), mean)
           [,1]        [,2]
[1,]  0.6869065 -0.66529430
[2,] -0.1136978 -0.04124547

mapply

mapply is a multivariate apply of sorts which applies a function in parallel over a set of arguments.

> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
  • FUN is a function to apply.
  • ... contains arguments to apply over.
  • MoreArgs is a list of other arguments to FUN.
  • SIMPLIFY indicates whether the result should be simplified.

tapply

tapply is used to apply a function over subsets of a vector.

split

split divides the data in the vector x into the groups defined by f. The replacement forms replace values corresponding to such a division. unsplit reverses the effect of split.

> s <- split(airquality, airquality$Month)
> sapply(s, function(x) colMeans(x[,c("Ozone", "Wind")]))
             5        6        7        8     9
Ozone       NA       NA       NA       NA    NA
Wind  11.62258 10.26667 8.941935 8.793548 10.18

© 著作权归作者所有

共有 人打赏支持
灰大羊
粉丝 4
博文 96
码字总数 95147
作品 0
浦东
项目经理
私信 提问
R语言入门 – 什么是R语言,为什么选择R语言用于机器学习或数据挖掘

  R语言最近在TIOBE排名上取得了不小的进步,上升到了第13名。因为机器学习的火爆的程度和R语言特殊的语法,使得它的关注度越来越高。好多人对于R语言的特性和语法不太了解,而因我在上学的...

全栈开发
2018/03/06
0
0
R语言学习笔记之相关性矩阵分析及其可视化

计算相关矩阵 R内置函数 cor() 可以用来计算相关系数:cor(x, method = c("pearson", "kendall", "spearman")),如果数据有缺失值,用cor(x, method = "pearson", use = "complete.obs")。 ......

R语言中文社区
2018/02/05
0
0
R语言可视化学习笔记之相关矩阵可视化包ggcorrplot

基于ggplot2包以及corrplot包的相关矩阵可视化包ggcorrplot,ggcorrplot包提供对相关矩阵重排序以及在相关图中展示显著性水平的方法,同时也能计算相关性p-value 安装方法就不提了,不懂的可...

R语言中文社区
2018/01/25
0
0
ggplot2学习笔记系列之利用ggplot2绘制误差棒及显著性标记

绘制带有误差棒的条形图 library(ggplot2) #创建数据集 df <- data.frame(treatment = factor(c(1, 1, 1, 2, 2, 2, 3, 3, 3)), response = c(2, 5, 4, 6, 9, 7, 3, 5, 8), group = factor(c......

R语言中文社区
2018/02/12
0
0
R语言data manipulation学习笔记之创建变量、重命名、数据融合

作者简介Introduction taoyan:R语言中文社区特约作家,伪码农,R语言爱好者,爱开源。 个人博客: https://ytlogos.github.io/ 公众号:生信大讲堂 往期回顾 数据分析中数据处理也就是data ...

R语言中文社区
2018/03/26
0
0

没有更多内容

加载失败,请刷新页面

加载更多

分布式项目(五)iot-pgsql

书接上回,在Mapping server中,我们已经把数据都整理好了,现在利用postgresql存储历史数据。 iot-pgsql 构建iot-pgsql模块,这里我们写数据库为了性能考虑不在使用mybatis,换成spring jd...

lelinked
今天
2
0
一文分析java基础面试题中易出错考点

前言 这篇文章主要针对的是笔试题中出现的通过查看代码执行结果选择正确答案题材。 正式进入题目内容: 1、(单选题)下面代码的输出结果是什么? public class Base { private Strin...

一看就喷亏的小猿
今天
1
0
cocoapods 用法

cocoapods install pod install 更新本地已经install的仓库 更新所有的仓库 pod update --verbose --no-repo-update 更新制定的仓库 pod update ** --verbose --no-repo-update...

HOrange
今天
3
0
linux下socket编程实现一个服务器连接多个客户端

使用socekt通信一般步骤 1)服务器端:socker()建立套接字,绑定(bind)并监听(listen),用accept()等待客户端连接。 2)客户端:socker()建立套接字,连接(connect)服务器,连接上后...

shzwork
昨天
3
0
android自定义viewgroup画背景

设计部要求背景实现一个背景边框带圆弧的效果: 所以想着用自定义控件画一个背景。 为了方便,继承的是LinearLayout,在onMeasure中先获取控件宽高: @Overrideprotected void onMeasure(in...

醉雨
昨天
1
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部