Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Stat 1601 Exam 1 Introduction To Data Science With R Questions With Answers Rated A 2025., Exams of Data Analysis & Statistical Methods

Stat 1601 Exam 1 Introduction To Data Science With R Questions With Answers Rated A 2025.

Typology: Exams

2024/2025

Available from 07/11/2025

drillmaster
drillmaster 🇺🇸

5

(5)

944 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stat 1601 Exam 1 Introduction To Data
Science With R Questions With Answers
Rated A 2025.
data
descriptions of the world around us, collected through observation and stored on computers
all has CONTEXT, can be names, numbers, etc
must clean out missing values first, look for/deal with outliers
computers
enable us to infer properties of the world from data
big data
data sets that are so large that traditional methods of storage and analysis are inadequate; these data
are recorded and stored electronically in vast data repositories called data warehouses
statistics
collecting, classifying, summarizing, organization, analyzing and interpreting numerical info; addresses
same fundamental challenges as data science: how to draw robust conclusions about the world using
incomplete info
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Stat 1601 Exam 1 Introduction To Data Science With R Questions With Answers Rated A 2025. and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!

Stat 1601 Exam 1 Introduction To Data

Science With R Questions With Answers

Rated A 2025.

data

descriptions of the world around us, collected through observation and stored on computers

all has CONTEXT, can be names, numbers, etc

must clean out missing values first, look for/deal with outliers

computers

enable us to infer properties of the world from data

big data

data sets that are so large that traditional methods of storage and analysis are inadequate; these data are recorded and stored electronically in vast data repositories called data warehouses

statistics

collecting, classifying, summarizing, organization, analyzing and interpreting numerical info; addresses same fundamental challenges as data science: how to draw robust conclusions about the world using incomplete info

data science

extends the field of statistics by taking full advantage of computing, data visualization, machine learning, optimization and access to information; the discipline of drawing conclusions from data using computation

(marriage between data and statistics)

proves why things will end up a certain way, can help businesses/people make the best decisions

bigger data set=more accurate results

provides the means to make precise, reliable and quantitative arguments about any set of observations

exploration, prediction and inference

the three core aspects of effective data analysis

time series data

variables that are measured at regular intervals over time

typical measuring points=months, quarters or years

cross sectional data

when several variables are all measured at the same time point

ex=data on sales revenue, number of customers and expenses for last month at each starbucks at one point in time

ex= how much avg student spent on textbooks

qualitative (categorical variable)

measurements that cannot be measured on a natural numerical scale; they can only be classified into one of a group of categories. if this type of variable has only 2 categories its called binary

good key=if doesnt make sense to find the average

zip codes would be one, jersey nubers

explanatory (predictor) variable

the variable whose effect you want to study

response variable

the variable that measures the outcome of interest, that you suspect might be affected by the other

biased sample

a sampling procedure is this if it tends to systematically overrepresent certain segments of the population and systematically underrepresent others

selection bias

results when a subset of the experimental units in the population is excluded so that these units have no chance of being selected for the sample

a form of sampling error

in FDR study, only asked people with telephone books and vehicle registrations, so resulted in this type of bias

anticipation bias or non response bias/participation bias

likelihood that only people who felt very strongly about the topic responded

results when researchers conducting a survey or study are unable to obtain data on all experimental units selected for the sample

*a non sampling error

ex=yelp

simple random sampling

every sample of the same size has an equal chance of being selected. computers are often used to generate random telephone numbers,

only time you can really generalize to entire pop, best one

ex=computer generates numbers to pick random students or pick from hat UVA computing IDs

hard bc non very feasible bc dont always have access to entire pop data

ex pop dividing into 6 groups, you select 2 and 5 so everyone in 2 and 5 is your sample

measurement error

a type of non sampling error

refers to inaccuracies in the values of the data recorded

in surveys, the error may be due to ambiguous questions

ex=when a survey doesnt tell what units its in, can also happen when data being input and add an extra zero on accident, etc

missing data

observations that were planned but could not be made

ex=people dont answer parts of survey

observational study

units are observed in natural setting and variables of interest are recorded. reserachers observe individuals and measure variables of interest but do not attempt to influence responses

the exp variable is not imposed by the researchers, the goal=describe the situation and perhaps discover association between variables but cannot draw cause and effect conclsuion

designed experiment

study in which the experimenter actively imposes the explanatory variable group on the subjects/observational units

the explanatory variable group is called a treatment

the reseracher can legitimately draw a cause and effect conclusion between the explanatory and response variables

ex=new drug, get volunteers to sign up, randomly assign to either be on placebo or drug (Exp variable), response variable=effectiveness of helping with migraines

random assignment

what allows you to make cause and effect conclusions and limit confounding variables, randomly assigning participants to conditions

we often want placebo effect to work so we can make cause and effect conclusions

confounding variable

one whose potential effects on a response variable cannot be distinguished from those of the explanatory variable

is related to both explanatory and response variable; and because of potential of this one cannot legitimately draw cause and effect conclusions from observational studies

2 way contingency table

basic functions of R

  • log() (log base e or ln). Example log(3)
  • log10() (log base 10). Example log10(3)
  • sqrt() (square root). Example sqrt(9)
  • factorial() (factorial). Example factorial(4)
  • sum() (add). Example sum(1, 2, 3)
  • Need help with a function? Just type ?sum or help(sum)

*need parenthesis with functions!

an object

any number you can store in the memory of R, can be a single value resulting from a calculation or a collection of information as a table of linear regression coefficients

a variable is a memory location for this, which can be thought of as a label for it, created when is assigned to it

if naming a categorical variable, must use quotation marks for each category!

rules of variable names

cannot have spaces

cannot be started with a number

if a function is already defined in R you cant use it as your variable name

CAN start with a period but it couldnt be then followed by a number

can consist of letters, numbers, periods and underscores.

vector

how to declare a list of variables in R, different depending on the variable type; are series of values and can be numerical or categorical

c() /conconate function

function to create a vector

_(17,35,18,15,17) then could find mean of this list, etc

data frames

analogous to rectangular spreadsheets: they are presentations of datasets in R where rows correspond to observations and columns to variables that describe the observations

seq(), seq (1:100)=increments of one

if vector with series involved, instead of typing entire list can use this function. can increment it by numbers different that one by adding ,.5 for example

ex=(1,100,.5)

data.frame()

read.csv()

function to get data/file from your working directory, import it to R

rmarkdown

to compile a pdf format of report

getwd()

will show you exactly where your working directory is

setwd()

to set your working directory

frequency table

to SUMMARIZE a categorical variable by recording totals and category names

the names of the categories label each row in the table

visualizing through graphs helps us reveal things that cant be seen in a table of numbers, show important features and patterns in the data and provide an excellent means for reporting findings to others

table(Day1Data$Gender)

to make a frequency table, ex if getting data from Day 1 data and want frequency of gender

file.choose

to import any data file w csv

view()

to see the data of the file you imported

fix()

to see the data of the file you imported on a separate tab

geom

the geometric object in question. this refers to the type of object we can observe in a plot. for ex points, lines and bars

aes

aesthetic attributes of geometric objects. for example, x and y position, color, shape and size; these are mapped to variables in the data set

all stacks look about the same

association/dependent/relationship categorical variables

if the conditional distributions of one variable are not identical for every category of the other variable

if stacks are diff its this

histogram, boxplots

used to visualize 1 numerical variable; cross sectional data (data collected at same time)

boxplot

uses 5 umber summary: min, ql, median, qu and max to display a numerical variable

bar chart, pie chart

to VISUALIZE a categorical variable

linegraph

shows the relationship between 2 numerical variables when the variable on the x axis=explanatory variable is of sequential nature; in other words there is an inherent ordering to the variable

USED FOR TIME SERIES DATA

scatterplot

tool to examine relationship between TWO quantitative variables; put explanatory/predictor variable on x axis and respond on y

side by side boxplot or histogram

to examine numeric-categorical relationships

do the squareroot of the number of observations in your set

to get the idea of how many bins you should have in a histogram

form of association, direction, strength and unusual features

to describe a scatterplot discuss

mean

most common measure of central tendency, acts as the balance point, done to summarize numerical variables

HIGHLY AFFECTED BY OUTLIERS so dont use this if extreme outlier

is a natural summary for unimodal, symmetric distributions

distribution skewed right

outliers on right/right tail, dont use mean use median

median is less than mean

symmetrical distribution

where mean, median and mode are all the same

mode

the most frequently occurring score(s) in a distribution

summarizing numerical variables

mean(Day1Data$Age); median(Day1Data$Age);

sd(Day1Data$Age); range(Day1Data$Age); min(Day1Data$Age);

max(Day1Data$Age); order(Day1Data$Age); sum(Day1Data$Age);

IQR(Day1Data$Age)

na.rm=T

to remove a missing value

formula for standard deviation

the square root of the variance, square root of deviation 2/n- 1

if all numbers are the same the standard deviation would be zero

summary()

gives output of all numbers in a box plot at once

to describe a distribution

talk about its shape, center and spread

shape

talk about:

how many modes (especially if histogram)

symmetry (left or right)

outliers

center

talk about median

spread