R Instructions for Chapter 4
Transcription
R Instructions for Chapter 4
MAT 4375 Multivariate statistical methods Summary of R commands (Chapter 4) Univariate Normal Distribution 1. To generate a random sample of size n from a univariate normal distribution N (µ, σ 2 ) with mean µ and standard deviation σ (say n = 100, µ = 0 and σ = 1): rnorm(100,mean=0,sd=1) or simply rnorm(100,0,1) 2. To compute the value of the N (µ, σ 2 ) density function: (x − µ)2 1 exp − f (x) = √ 2σ 2 2πσ 2 at point x: (say µ = 0, σ = 1 and x = 1.5) dnorm(1.5,mean=0,sd=1) or simply dnorm(1.5,0,1) 3. To plot the graph of the density function of N (µ, σ 2 ) between the points a and b (say we use again µ = 0, σ = 1 and the points a = −3 and b = 3): curve(dnorm(x,mean=0,sd=1),-3,3) or simply curve(dnorm(x,0,1),-3,3) 1 If we omit writing the values a and b, by default the graph will be between a = 0 and b = 1. 4. To compute a normal probability, i.e to find the value P (X ≤ x) for given x, where X ∼ N (µ, σ 2 ): (say µ = 0, σ = 1 and x = 1.96) pnorm(1.96,mean=0,sd=1) or simply pnorm(1.96,0,1) 5. To compute a normal quantile, i.e. to find the value x such that P (X ≤ x) = p for a given value p ∈ (0, 1), where X ∼ N (µ, σ 2 ): (say µ = 0, σ = 1 and p = 0.975): qnorm(0.975,mean=0,sd=1) or simply qnorm(0.975,0,1) 6. To do a normal QQ plot of univariate data (saved in a variable x), there are two methods: Method 1: qqnorm(x) abline(mean(x),sd(x)) # Do the QQ plot # Plot the line of best fit Method 2: (assume that n is the sample size) p=ppoints(n) # Compute the probabilities p_j=(j-3/8)/(n+1/4) z=qnorm(p,0,1) # Compute the normal quantiles corresponding to p_j y=sort(x) # Arrange the data in increasing order plot(z,y) # Plot the pairs (z_j,y_j) for j=1,...,n abline(mean(x),sd(x)) #Plot the line of best fit 2 Multivariate Normal distribution 1. To generate a random sample of size n from a multivariate normal distribution Np (µ, Σ) with mean vector µ and covariance matrix Σ. Say p = 2, n = 100, µ = (0, 0)0 and 1 0.5 Σ= 0.5 1 We need to install the package MASS: install.packages("MASS") library(MASS) mvrnorm(100,mu=c(0,0),Sigma=matrix(c(1,0.5,0.5,1),2)) In the last line, we could simply type: mvrnorm(100,c(0,0),matrix(c(1,0.5,0.5,1),2)) 2. To plot the Np (µ, Σ) density function: f (x) = 1 (2π)p/2 |Σ|1/2 1 exp{− (x − µ)0 Σ−1 (x − µ)} 2 In the example below, we take p = 2, µ = (0, 0)0 and Σ as above. #Specifies the parameters of the bivariate normal mu1 = 0 # expected value of x mu2 = 0.5 # expected value of y sigma1 = 0.5 # variance of x sigma2 = 2 # variance of y rho = 0.5 # corr(x, y) #Gives the range for x-axis and y-axis xm =-3 xp = 3 ym = -3 yp = 3 3 #Specifies a sequence of values (x,y) for which we will do the plot x = seq(xm, xp, length= as.integer((xp + abs(xm)) * 10)) y = seq(ym, yp, length= as.integer((yp + abs(ym)) * 10)) #Defines the density function bivariate = function(x,y){ term1 = 1 / (2 * pi * sigma1 * sigma2 * sqrt(1 - rho^2)) term2 = (x - mu1)^2 / sigma1^2 term3 = -(2 * rho * (x - mu1)*(y - mu2))/(sigma1 * sigma2) term4 = (y - mu2)^2 / sigma2^2 z = term2 + term3 + term4 term5 = term1 * exp((-z / (2 *(1 - rho^2)))) return (term5) } # Computes the density values z =outer(x,y,bivariate) # Plot persp(x, y, z) 3. To compute a quantile corresponding to the χ2k distribution, i.e. to find a value x such that P (X ≤ x) = p for a given value p ∈ (0, 1), where X ∼ χ2k : (here k = 7 and p = 0.95) qchisq(0.95,df=7) or simply qchisq(0.95,7) 4. Consider a variable x (which contains observations x1 , . . . , xn ) and a a fixed value. To compute how many observations xj are smaller than a, we first create a boolean variable b which takes values TRUE (i.e. 1) if xj ≤ a and FALSE (i.e. 0) if xj > a, and then sum all the values of b: (here we take a = 1.5) b=x<1.5 sum(b) # Create the boolean variable # Count how many observations are smaller than 1.5 4 5. Suppose we have data consisting of n p-dimensional vectors x1 , . . . , xn with sample mean x and sample covariance matrix S. We will assume here that data is saved in a data frame called “table” whose columns are called V1 , . . . , Vp . For simplicity we work only with the first two variables in the table, so that p = 2. a) To compute the squared Mahalanobis distances d2j = (xj − x)S−1 (xj − x), x1=table$V1 # x2=table$V2 # x=cbind(x1,x2) # m=colMeans(x) # S=cov(x) # d=mahalanobis(x,m,S) print(d) j = 1, . . . , n Extract variable x1 from table Extract variable x2 from table Create a new variable x from x1 and x2 Compute the column means of x Compute the covariance matrix of x # Compute the squared Mahalanobis distances # Print the values of d b) To compute the number and the proportion of observations which are smaller than χ22 (0.50) we use the following commands: chisq.quantile=qchisq(0.50,df=2) #Calculate the chi-square quantile print(chisq.quantile) number.obs=sum(d<chisq.quantile) #Calculate the number of observations print(number.obs) proportion=number.obs/n # Calculate the proportion of observations print(proportion) b) To see if the data in columns V1 and V2 has a bivariate normal distribution, we use the following commands: p=ppoints(n) # Compute the probabilities p_j=(j-3/8)/(n+1/4) q=qchisq(p,df=2) # Compute the chi-square quantiles corresponding to p_j y=sort(d) # Arrange d in increasing order plot(q,d) # Plot the pairs (q_j,d_j) for j=1,...,n abline(0,1) # Plot the line of best fit 5