



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Stat 1601 Exam 1 Introduction To Data Science With R Questions With Answers Rated A 2025.
Typology: Exams
1 / 27
This page cannot be seen from the preview
Don't miss anything!
data
descriptions of the world around us, collected through observation and stored on computers
all has CONTEXT, can be names, numbers, etc
must clean out missing values first, look for/deal with outliers
computers
enable us to infer properties of the world from data
big data
data sets that are so large that traditional methods of storage and analysis are inadequate; these data are recorded and stored electronically in vast data repositories called data warehouses
statistics
collecting, classifying, summarizing, organization, analyzing and interpreting numerical info; addresses same fundamental challenges as data science: how to draw robust conclusions about the world using incomplete info
data science
extends the field of statistics by taking full advantage of computing, data visualization, machine learning, optimization and access to information; the discipline of drawing conclusions from data using computation
(marriage between data and statistics)
proves why things will end up a certain way, can help businesses/people make the best decisions
bigger data set=more accurate results
provides the means to make precise, reliable and quantitative arguments about any set of observations
exploration, prediction and inference
the three core aspects of effective data analysis
time series data
variables that are measured at regular intervals over time
typical measuring points=months, quarters or years
cross sectional data
when several variables are all measured at the same time point
ex=data on sales revenue, number of customers and expenses for last month at each starbucks at one point in time
ex= how much avg student spent on textbooks
qualitative (categorical variable)
measurements that cannot be measured on a natural numerical scale; they can only be classified into one of a group of categories. if this type of variable has only 2 categories its called binary
good key=if doesnt make sense to find the average
zip codes would be one, jersey nubers
explanatory (predictor) variable
the variable whose effect you want to study
response variable
the variable that measures the outcome of interest, that you suspect might be affected by the other
biased sample
a sampling procedure is this if it tends to systematically overrepresent certain segments of the population and systematically underrepresent others
selection bias
results when a subset of the experimental units in the population is excluded so that these units have no chance of being selected for the sample
a form of sampling error
in FDR study, only asked people with telephone books and vehicle registrations, so resulted in this type of bias
anticipation bias or non response bias/participation bias
likelihood that only people who felt very strongly about the topic responded
results when researchers conducting a survey or study are unable to obtain data on all experimental units selected for the sample
*a non sampling error
ex=yelp
simple random sampling
every sample of the same size has an equal chance of being selected. computers are often used to generate random telephone numbers,
only time you can really generalize to entire pop, best one
ex=computer generates numbers to pick random students or pick from hat UVA computing IDs
hard bc non very feasible bc dont always have access to entire pop data
ex pop dividing into 6 groups, you select 2 and 5 so everyone in 2 and 5 is your sample
measurement error
a type of non sampling error
refers to inaccuracies in the values of the data recorded
in surveys, the error may be due to ambiguous questions
ex=when a survey doesnt tell what units its in, can also happen when data being input and add an extra zero on accident, etc
missing data
observations that were planned but could not be made
ex=people dont answer parts of survey
observational study
units are observed in natural setting and variables of interest are recorded. reserachers observe individuals and measure variables of interest but do not attempt to influence responses
the exp variable is not imposed by the researchers, the goal=describe the situation and perhaps discover association between variables but cannot draw cause and effect conclsuion
designed experiment
study in which the experimenter actively imposes the explanatory variable group on the subjects/observational units
the explanatory variable group is called a treatment
the reseracher can legitimately draw a cause and effect conclusion between the explanatory and response variables
ex=new drug, get volunteers to sign up, randomly assign to either be on placebo or drug (Exp variable), response variable=effectiveness of helping with migraines
random assignment
what allows you to make cause and effect conclusions and limit confounding variables, randomly assigning participants to conditions
we often want placebo effect to work so we can make cause and effect conclusions
confounding variable
one whose potential effects on a response variable cannot be distinguished from those of the explanatory variable
is related to both explanatory and response variable; and because of potential of this one cannot legitimately draw cause and effect conclusions from observational studies
2 way contingency table
basic functions of R
*need parenthesis with functions!
an object
any number you can store in the memory of R, can be a single value resulting from a calculation or a collection of information as a table of linear regression coefficients
a variable is a memory location for this, which can be thought of as a label for it, created when is assigned to it
if naming a categorical variable, must use quotation marks for each category!
rules of variable names
cannot have spaces
cannot be started with a number
if a function is already defined in R you cant use it as your variable name
CAN start with a period but it couldnt be then followed by a number
can consist of letters, numbers, periods and underscores.
vector
how to declare a list of variables in R, different depending on the variable type; are series of values and can be numerical or categorical
c() /conconate function
function to create a vector
_(17,35,18,15,17) then could find mean of this list, etc
data frames
analogous to rectangular spreadsheets: they are presentations of datasets in R where rows correspond to observations and columns to variables that describe the observations
seq(), seq (1:100)=increments of one
if vector with series involved, instead of typing entire list can use this function. can increment it by numbers different that one by adding ,.5 for example
ex=(1,100,.5)
data.frame()
read.csv()
function to get data/file from your working directory, import it to R
rmarkdown
to compile a pdf format of report
getwd()
will show you exactly where your working directory is
setwd()
to set your working directory
frequency table
to SUMMARIZE a categorical variable by recording totals and category names
the names of the categories label each row in the table
visualizing through graphs helps us reveal things that cant be seen in a table of numbers, show important features and patterns in the data and provide an excellent means for reporting findings to others
table(Day1Data$Gender)
to make a frequency table, ex if getting data from Day 1 data and want frequency of gender
file.choose
to import any data file w csv
view()
to see the data of the file you imported
fix()
to see the data of the file you imported on a separate tab
geom
the geometric object in question. this refers to the type of object we can observe in a plot. for ex points, lines and bars
aes
aesthetic attributes of geometric objects. for example, x and y position, color, shape and size; these are mapped to variables in the data set
all stacks look about the same
association/dependent/relationship categorical variables
if the conditional distributions of one variable are not identical for every category of the other variable
if stacks are diff its this
histogram, boxplots
used to visualize 1 numerical variable; cross sectional data (data collected at same time)
boxplot
uses 5 umber summary: min, ql, median, qu and max to display a numerical variable
bar chart, pie chart
to VISUALIZE a categorical variable
linegraph
shows the relationship between 2 numerical variables when the variable on the x axis=explanatory variable is of sequential nature; in other words there is an inherent ordering to the variable
scatterplot
tool to examine relationship between TWO quantitative variables; put explanatory/predictor variable on x axis and respond on y
side by side boxplot or histogram
to examine numeric-categorical relationships
do the squareroot of the number of observations in your set
to get the idea of how many bins you should have in a histogram
form of association, direction, strength and unusual features
to describe a scatterplot discuss
mean
most common measure of central tendency, acts as the balance point, done to summarize numerical variables
HIGHLY AFFECTED BY OUTLIERS so dont use this if extreme outlier
is a natural summary for unimodal, symmetric distributions
distribution skewed right
outliers on right/right tail, dont use mean use median
median is less than mean
symmetrical distribution
where mean, median and mode are all the same
mode
the most frequently occurring score(s) in a distribution
summarizing numerical variables
mean(Day1Data$Age); median(Day1Data$Age);
sd(Day1Data$Age); range(Day1Data$Age); min(Day1Data$Age);
max(Day1Data$Age); order(Day1Data$Age); sum(Day1Data$Age);
IQR(Day1Data$Age)
na.rm=T
to remove a missing value
formula for standard deviation
the square root of the variance, square root of deviation 2/n- 1
if all numbers are the same the standard deviation would be zero
summary()
gives output of all numbers in a box plot at once
to describe a distribution
talk about its shape, center and spread
shape
talk about:
how many modes (especially if histogram)
symmetry (left or right)
outliers
center
talk about median
spread