IST 387 Fall 2020 - Details

Levels:

1) Chapter 1 View

Questions:

20 questions

🇬🇧		🇬🇧

True or False: You need to "library" a package each time you start a new R session.

True

How would you check the distribution of a categorical variable?

Table(mtcars$gear_factor)

How do you declare a variable in R?

My_value <- 5 my_str <- "Hello world" my_vector <- c(5,65,23,1) names <- c("Ann", "Bob", "Clyde", "Lu") my_df <- data.frame(names, my_vector) my_df$names <- as.character(my_df$names)

What is a factor variable and how can you create one in R?

A factor variable is a variable that can take on a limited number of discrete values, i.e. a categorical variable. mtcars$gear_factor<-as.factor(mtcars$gear)

What is the difference between a histogram and bar chart?

Histograms are used for continuous variables; bar graphs are used for discrete/categorical variables. ggplot(data = mtcars,aes(x=mpg))+geom_histogram() ggplot(data = mtcars,aes(x=gear))+geom_bar()

How can the mean of a df column be calculated?

Mean(mtcars$mpg)

How do you create a boxplot in R?

Boxplot(mpg ~ am, data=mtcars)

What are scatterplots useful for and how can you create one in R?

Method 1, using base graphs: plot(airquality$Ozone, airquality$Wind) Method 2, using ggplot2: ggplot(airquality, aes(x=Ozone, y=Wind)) + geom_point()

What packages can be used for data mining in R?

Ggplot2: visualization tm: text mining lm: linear regression arules: association rules mining caret: machine learning

Name an R package which can be used for data imputation.

ImputeTS (for time series data); imputeR

True or False: You need to install a package every time you start a new R session.

False

How do you install a package in R?

Install.packages("name_of_package")

What is R?

R is an open-source language for statistical computing and data science. It can be used in command-line mode or with "R scripts;" in its stand-alone version (base R), or in its integrated development environment (IDE) - RStudio. RStudio is also available on the cloud - RStudio Cloud.

What is the basic syntax in R?

<- is the "assignment operator," used to declare new variables and assign values to them (technically, = can be used for assignment too) # in the beginning of a line of code is used to mark that line as a comment (aka "comment it out") name_of_function() - you can identify functions in R by the parentheses following them. For example, mean(name_of_df_column) is applying the mean() function to all numbers in a dataframe column, i.e. the function arguments, or what you want to apply the function to, go inside the parentheses; in this case, the mean() function returns a single value, the average of the numbers in the dataframe column new_df <- df[df$likelihood_to_recommend == 8, ] - this is a typical way of "subsetting" from a dataframe called df. In this case, new_df is a subset of df containing all of df's columns (because there is nothing following the comma inside the square brackets - remember, the comma is used to separate the rows we want - before the comma, from the columns - after the comma), but only certain rows - the rows for which the likelihood_to_recommend column in df has a value of exactly 8. You can modify this condition - e.g. you can change == to >, in which case only rows with likelihood_to_recommend values greater than 8 will be included in the new dataframe. $ - this operator is used for "getting inside" a dataframe. E.g. df$likelihood_to_recommend means we want to access the likelihood_to_recommend column in the df dataframe. df$text means we want to access another column in that dataframe - the column called "text."

What are some of the advantages and disadvantages of R?

+ Open-source Runs on all major platforms Large and active R user community = ample online resources Developed by statisticians specifically for data analysis One of the top programming languages for data science - Its performance depends on your machine's memory resources (in particular, your RAM) Because of that, it may be slower than Python for data-intensive operations Some of us experienced difficulties loading certain packages - package compatibility issues and conflicts between different packages (e.g. tidyverse and ggplot2) are a drawback

What are some common data types in R?

Logical (TRUE or FALSE) Numeric (e.g. 5, 0.643, 1.e+9) Character (e.g. "a", "abc", "Hello", "This is my code")

What are some common data objects in R?

Single data values (e.g. 6, 23455, "What is this?", y) Vectors Data frames Matrices

Why is R useful for data science?

R was created specifically for the purposes of statistical analysis which makes it a great candidate for data science data manipulations since it offers great functionality when it comes to data cleaning, model building and evaluation, and data visualization. There are R packages specifically geared towards data science such as caret.

How do you get the name of the current working directory in R?

The working directory is the folder on your computer R checks for a file whenever you want to import data into R. For example, you can set your Downloads folder as your working directory, and then you'll only need to supply the name of the file you want to import instead of the full path to that file: df <- read_csv("myFile.csv") instead of: df <- read_csv("C:\\User\\Downloads\\myFile.csv") To see what your current working directory is, type: getwd() And to change it: setwd("path\\to\\new\\working\\directory")

How do you access the element in the 2nd column and 4th row of a dataframe named D?

D[4,2]

IST 387 Fall 2020

Practice Known Questions

Exams

Learn New Questions

IST 387 Fall 2020 - Leaderboard

IST 387 Fall 2020 - Details

Levels:

Questions: