




















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Sid Kumar notes if population have been split to see for differences in allocation according to birth order, sex. 4. What data do we need to study genes and ...
Typology: Lecture notes
1 / 60
This page cannot be seen from the preview
Don't miss anything!
TH
MAY 26 TH
Genes and Complex Human Phenotypes Meeting at UC Berkeley, May 26-27, 2016 TABLE OF CONTENTS
**1. Introduction ....................................................................................................................... 2
SNPs are significant and identifying their location, and looking at chromosomal regions where genes may be linked, and see what these genes do. It is also important to consider if these genes are regulatory, that is, if they regulate other genes, or if they are structural genes that act as catalysts or shunts. According to one study by Hunter Fraser, most SNPs that affect genes are in regulatory regions. That is an interesting phenomenon because you can use these locations to formulate testable hypotheses about 1) gene function 2) genetic correlations between phenotypes
Introduction My research concerns the causes of socioeconomic gradients in health and the aging process. I study genetics because the features of human development that influence disease and disability processes are heritable; they’re influenced by genetic differences between individuals. A second reason I study genetics is that DNA is special; it comes before almost everything else humans experience. The idea behind this project is that by understanding how DNA regulates human behavior and achievement, we can learn something about causal processes that can
provide targets for interventions that will apply broadly across the population, regardless of genotype. I want to start with some background on how I think about genetics. Until recently, research to uncover molecular genetic causes of human phenotypes was hypothesis driven. The method was to reverse-engineer the biology of a phenotype to identify a protein or molecule of interest, use that protein/molecule to identify a gene, and then screen that gene for variation that could then be tested for association with the phenotype. We called that the “candidate gene” approach. These days, with the aid of new technology, we can take a comprehensive survey of the genome and do hypothesis free search to identify variants linked with phenotypes. The variants we study most often in this hypothesis-free work are single-letter changes in the DNA sequence that occur in at least 1% of the population, called single nucleotide polymorphisms (SNPs). This hypothesis free search for SNPs, called a “genome-wide association study” (GWAS) is the basis for the work I will present today. GWAS is a brute force approach to discovery. Millions of SNPs are tested for association with a phenotype of interest. The resulting correction for multiple testing is severe, placing high demand on statistical power in GWAS design. Theory and empirical evidence suggest the best way to increase power in GWAS is to accumulate very large samples, even at the expense of having relatively coarse phenotypes. In the social sciences, this means studying phenotypes like educational attainment, which can be measured easily in 100s of thousands of individuals, instead of psychological characteristics or economic behaviors that might be more precise quantifications of the phenotype of interest. The need for extremely large samples also precludes analysis of detailed information about those samples – where and when they grew up, how they lived, and so on. This leaves many unanswered questions about mechanisms connecting genotype and phenotype. Learning from GWAS by “working from the top down.” The traditional approach to following up GWAS discoveries works from the bottom up, tracing a molecular path from DNA sequence, through RNA transcription, and on up to phenotype. The goal of bottom-up work is usually to find a molecule that can serve as a drug target. In my work, I start from the top and work down. The goal is to trace a path backward from the GWAS phenotype to its developmental antecedents. The goal of this top-down research is to find a behavior of developmental process that can serve as a target for a non-drug intervention like a policy or a program. How do we do this top-down work? Individual genetic variants have only very small effects, but many thousands of variants contribute to the sorts of human variation we are interested in studying. So to conduct effective follow-up, methods are needed to summarize genetic influences from across the genome. The method we use is called polygenic scoring. Polygenic scoring begins by using results from GWAS to assign weighted values to large numbers of variants in a person’s genome. Then, those weighted values are summed to form that person’s polygenic score. The result is a normally distributed measure that quantifies the level of genetic load toward a particular trait or outcome. All this background is discussed in greater detail in Belsky and Israel 2014, Biodemography & Social Biology. In previous work, we and others have established that a polygenic score based on GWAS of educational attainment (1) predicts educational attainment in many different samples spanning different historical periods and geographic/ socio-political contexts; and (2) is more than an artifact of socially-advantaged ancestry. This second point is important. Children inherit genes and social position from their parents. It could have been the case that the genetics of educational
Further analysis examined developmental and behavioral mechanisms underpinning genetic influence on social mobility. Beginning early in life, children with higher polygenic scores started to distinguish themselves from peers. They spoke at earlier ages, learned to read sooner, and went on to do better in school. After school, these children more often sought experiences overseas and, in fact, were more likely emigrate (defined as residing outside of NZ for at least the past year at the time of most recent data collection when they were aged 38 years). Children with higher polygenic scores tended to select life partners with more education and higher income. And by midlife they were better at taking care of their own finances. Finally, we examined psychological characteristics that might mediate the pattern of life course success we observed for children with higher polygenic scores. As expected, children with higher polygenic scores had higher IQ scores. But they also distinguished themselves in other domains. They had better self-control and were more interpersonally skillful (cooperative, friendly, communicative, etc.). Collectively, these cognitive and non-cognitive characteristics accounted for around half of genetic associations with life course attainments. Strikingly, children’s polygenic scores were not related to their physical health – as measured from blood pressure, lung function, height, weight, balance, and reviews of medical dossiers. Conclusions GWAS discoveries for education are not about education only. Instead, genetic discoveries for educational attainment correspond to a pattern of characteristics that manifest beginning early in life as accelerated language acquisition and mental development, mature into
academic achievement in school, and extend through patterns of career formation and mate selection, yielding a life course pattern of upward socioeconomic mobility socioeconomic success by midlife. In addition to intelligence, genetic influences on life course socioeconomic success are mediated by better self-control and interpersonal skill, but not better childhood physical health. Next steps include tests of replication of findings in different contexts – other cohorts, other countries, other policy regimes – and investigation of environmental factors that may amplify or mitigate genetic influences. Effect sizes were small. Use of genetic testing to engineer “precision” education is not yet possible. However, public conversation is needed about how genetics that correlate with social outcomes may be used in biomedicine and education. Ultimately, the promise of work such as this lies in the identification of developmental and behavioral mechanisms that mediate genetic influence on socioeconomic success. These mechanisms can then serve as targets for interventions that can be delivered to anyone, regardless of genotype. For example, results from our study are consistent with the notion that intervention to accelerate language development and acquisition of reading skill may provide one path to promote upward social mobility. Belsky DW, Moffitt TE, Corcoran DL, Domingue B, Harrington HL, Hogan S, Houts R, Ramrakha S, Sugden K, Williams B, Poulton R, Caspi A. The genetics of success: How SNPs associated with educational attainment relate to life course development. Psychological Science , 27(7) 957 - 972. doi: 10.1177/0956797616643070. PMC4946990.
Introduction A GWAS of human reproductive behavior was conducted to examine genetic variants of reproduction. Reproductive behavior was operationalized using the following phenotypes: age at first birth (AFB) and number of children ever born (NEB). In other words, we tried to answer the following question: is fertility in our genes? Our research has implications for demography, sociology, medical science, and evolutionary biology. Background Previous research, including many twin studies, have focused on reproductive outcomes, yet few genetic loci for reproductive behavior have been identified, and the biological mechanisms mediating said behavior are poorly understood. Little is also known about the (^1) The material presented here was subsequently published in Nature Genetics. The link to the article is: http://dx.doi.org/10.1038/ng.3698. Frequently Asked Questions about the paper for press and the public is available at http://www.sociogenome.com/data/NG2016FAQ/. A film from Oxford explaining the research is at https://www.youtube.com/watch?v=PWSfWSb5KwE
We then tried to understand the biological mechanisms behind these genetic loci (e.g. are these genes involved in gene methylation, do they have regulatory consequences, are they in linkage disequilibrium with other genes, do they influence gene expression via pathways, etc)? We discovered that these loci influence gene expression via several pathways. Using multiple techniques, 50 SNPs were determined to be functional. 2 sets stand out: Chr 1 (18) and Chr 3 (25 SNPs) seem to play roles in active regulation and transcription of proteins related to sperm differentiation and ova/ovum. Then, a literature-based search was performed on the 12 top hits from the GWAS to see if any studies have been conducted connecting this genes in a causal manner to reprorduction and fertility. For men, there were hits related to spermatid differentiation (CREB3L4), spermatid maturation and acrosome reaction (HYAL3), and spermatogenesis (RBM5; CYHR1; GPT; RECQL4; PPP1R16A). Hits were also associated with:
and testis and plays a role in spermatid differentiation. SNPs in LD in Chr 3 likely are involved with functional genes associated with methylation status and expression of CRTC2, which is implicated in polycystic ovarian syndrome. Overall, while the predictive power of the GWAS is quite low – the PGS score is a fraction of what is found in twin and family studies – the findings are still significant. Non-additive genetic effects, epistatic effects, rare variants and inflated estimates from twin studies due to shared environment, could all contribute to the low predictive power observed. Questions/Discussion Shripad Tuljapurkar suggests that a future direction for Mills’ research could be at looking breast cancer risk. If you look at breast cancer, many genes implicated in breast cancer are not BRCA-related. Looking at AFB, ate at menarche, and NEB can provide interesting insight on the etiology of breast cancer. Melinda Mills notes that the group actually joined forces with an endometriosis center to see if there are any overlap from this GWAS with breast cancer. Ken Wachter asks about control variables, given that there are a very limited number of control variables that are assessed across the spectrum of GWAS studies. Melinda Mills acknowledges that controls are a critical consideration, and that her other study adjusts for birth cohort and country to account for some socially mediated factors that may influence AFB and NEB. Sid Kumar notes if population have been split to see for differences in allocation according to birth order, sex.
Introduction Today we will be talking about the evolution of genetic idea in Add Health and the research it produced, challenges in producing genetic data, what geneticists and social scientists care about, and finally current limitations in data collection. Background
Here, one observes that as parental education increases, genetics play an increasing role on vocabulary compared to the environment. At the low end of SES, the environment is more important, suggesting that there is a role for intervention. During Wave II in 1996, Add Health became the first socio-demographic study to collect DNA in the US. DNA collection kits were made to check for zygosity on 89 twin pairs. Then during Wave III in 2001, molecular data were collected on twins and full siblings (N~2600) using DNA kits that used the mouthwash method to retrieve buccal cell DNA. Such data allowed for the identification of candidate genes in dopamine and serotonin pathways, and generated some of the first genetic analysis with molecular data. We faced many challenges in collecting and making use of all of this data. First, it was important to keep the theoretical motivation in mind when collecting genetic data (i.e. understanding the biological mechanisms behind genes), which can easily get lost during study design. We also faced challenges in data collection methods. Sometimes, an interview could not be conducted at home, so innovations needed to be made, including interviewing at other locations. Furthermore, the logistics of conducting a national field study are difficult. Hours are spent on logistics. For example, we collected urine to test for STIs, but this urine had a 48-hour window in which the sample must be returned, making samples from Alaska and Hawaii virtually unusable. In addition, the idea that lab assay results could be linked to 20 years of data on an individual faced resistance. Cost was another issue. During Wave IV in 2008, DNA was collected from participants’ saliva using oragene kits. Logistically, we had to take every one of these samples and record them. When the sample is big, the cost is big. DNA collection in Wave IV and genotyping cost around $2.5 million. Finally, the quality of research was variable. Though there are thoughtful, collaborative, and interdisciplinary approaches that have produced provocative findings, other studies saw social scientists conducting research with no genetic training and vice versa. So why do we care about this data?
the most compelling directions for future research are social genomics and epigenetics. These two fields emphasize social exposures, and it is important to get at the mechanisms that mediate the relationship between genetics and environment. In terms of Add Health, the future of the dataset includes transcriptome data, which was just funded (N=5,000), and methylation data (N=5,000), which a grant is currently in progress. Questions/Discussion Sid Kumar asks how related the biological adolescent pairs actually are and what filter was used to select schools? Kathleen Mullan Harris clarifies that these pairs are some sort of sibling, including MZ and DZ twins. 80 high schools were selected from across the nation and matching feeder middle schools. We went in and interviewed everyone in that school. We sampled 20,000 participants from the total population of 100,000 students. Dalton Conley remarks that Add Health is the one of the best national U.S. representative sample. A dataset like HRS, for example, has some limitations because one needs to survive to a certain age to be included. The Add Health dataset could contribute to our understanding of the deep history of reproduction. Many critics ask: why are we interested in genetics as social scientists. One of the best ways to defend this is to look at genetics through social history. But I could not find much literature in the US on this topic. Is this because Add Health is one of the only resources to do this? Are you aware of sociologists doing this type of research? Kathleen Mullan Harris responds by saying that the only other young dataset is Cardia, which is not representative as it is only in 3 cities. I don’t know of others that are trying to do this type of research, and do not here people requesting this data for that purpose. Melinda Mills comments that there is a lot of medical data, but not much social science data. She noted that on the Data Access committee in the UK where she sits, they restricted access to studies that examine homosexuality, intelligence, and criminality. They didn’t want the data to be used for that reason. Harris responds that they give the link, and they have access to all the data. It costs us more to customize in any way. Well maybe we would take out the criminal data, but it’s difficult taking. Kumar notes that the ugly data is linking gene associations as causal, rather than associated; asks if this is common. Harris responds by saying this is common in bad science, and Belksy notes that this is bad science journalism.
There has been a demonization of biomedical science, but I first wanted to give some perspective for population genetics from the biomedical science community. Second, I wanted to discuss the data explosion that is occurring now. In general, we are in better shape than we think are, and we are heading in an interesting direction that will allow for some exciting work in the future. The impact of linked datasets for the social scientist will be tremendous. Furthermore, social scientists are beginning to acquire new tools that will better allow for causal inference. Yesterday, there was a big data conference at Stanford, and there was a meeting with Claudia Williams, the White House Liaison for the precision medicine initiative. The initiative will enroll a million volunteers, who will have extensive biologic sampling done, with electronic medical records linked. I asked about the extent of environmental and social science data in the initiative, and how that data is going to be collected. Claudia looked at me and said, “we need to
from you during an ER visit. All this data can be linked geospatially or to administrative data. Of course, there is a huge privacy issue, which may make causal inference difficult, but the fact is, there is so much data out there: social security, IRS, Census, etc. With all this data, we can nail down childhood circumstances, which is pretty amazing. This is frankly the direction data are going. Yes, there are limitations, but if we are propelling in our efforts to link data, then we are in for better times. We should be the first ones in to mine this data religiously to learn about the life course. Ben Seligman elaborates on the idea of linking data, suggesting that we should reach out to the biomedical community to link up data. Because of the Affordable Care Act, pretty much all hospitals have to move to electronic medical records, and there is also this subtle pushing towards the formation of ACOs. This creates competition between hospital systems to get market share and to follow patients through their life course. Hospitals are starting to track patients across the life course from beginning to end. Doctors are so used to giving drugs, but now that we follow patients outside of the hospital, environmental interventions are becoming more of a viable and important option. If we’re working at academic hospitals with medical data, we can link these sorts of data. Sid Kumar remarks that linking data comes with a host of privacy issues. Daniel Belsky asserts that there is a distinction between using data for commercial purposes, and using it for research purposes. IRB protocols make research data highly regulated. Sid Kumar counters by noting that fraud happens when it comes to data access. Mark Cullen then poses the following question: what if we had access to Safeway purchasing data of consumers? While there are a lot of anxieties about this data being collected, having this data also solves many gaps in constructing social history. Kathleen Harris remarks that there are IRB issues with linking data. IRS data would give us the address of a respondent’s life for every year, which would give us a tremendous amount of information regarding social history. Melinda Mills points out a problem in linking data. There is a huge legal aspect with regard to which data can be linked and which cannot, as participants may have agreed to a way data was going to be used and distributed. Mark Cullen notes that for health volunteers in the health precision initiative, there is breadth of consent, so this would not be a problem and would allow us to study the life course better. Shripad Tuljapurkar asks how people feel about the idea of forming partnerships with commercial entities? Mark Cullen responds that we should not be so proud to view commercial entities as unworthy partners. Commercial entities attract a lot of talent and have troves of information that are for their own profit-making reasons, but may be useful to us too. Putting aside some of the individual ethical issues, commercial platforms offer extremely valuable data. Amal Harrati asks to what extent selection plays a part in the new data paradigm. Specifically, health volunteers may have a predilection to health-seeking behavior. Mark Cullen agrees with this sentiment, noting that selection is huge. Dalton Conley notes that although there is criticism about selected samples that are not representative, with post-stratification adjustment, convenience samples can be used more reliably. As long as we know some basic demographics, we can adjust the data accordingly. Ken Wachter elaborates on big data control for selection biases. Post-stratification adjustments have been successful in the field of political science with very strange convenience samples, though sample size here was huge. It’s going to be difficult to perform post-stratification adjustment, unless data linkage becomes more open. Ben Seligman raises two issues with big data. First, there is the issue of consent. It used to be the case that consent was opt-out. The assumption was that if blood was drawn at an
academic hospital, data would be used for research. Now, consent is opt-in. Should we revisit the issue of opt-out, particularly when data is extended for other purposes than the original purpose? Second, there is the issue of representation. We don’t care so much that data has to be equally representative, we care that there is just enough data to be representative of an individual population. In other words, we don’t necessarily want representativeness, we want representation. Amal Harrati emphasizes that selection and representativeness are two different issues. The samples matter a lot. When we think about effect sizes, who’s coming to the sample to begin with, and the differentiation within these samples that lead to a proposed effect, we need to be cognizant of how a sample is selected. Daniel Belsky puts this in more concrete terms. In longevity research, this is an extreme problem. What determines variation in lifespan in individuals older and younger individuals may be different. What we uncover may be tailored to populations that we don’t fully understand. Those in treatment who also volunteer their data may be a unique population in itself. Shripad Tuljapurkar brings up two other points with regard to selection and representativeness. In the case of mortality selection, looking at variances between sub-groups could be informative. Looking at variance in age of death between high school dropouts and people with college education reveals that age of death is about the same in the two populations Adding this component to Ben Domingue’s mortality selection analysis would be interesting to see. Second, in terms of data there can sometimes be representativeness within a population, but not representation of populations outside of it. Ben Domingue states that in that case we can not necessarily translate how genetic markers affect educational attainment in New Zealand populations to U.S. populations. Ronald Lee asks if it is known how to use a person’s Safeway data. Mark Cullen notes that consumption data from Safeway employees could be really interesting to examine as employees get a 20% discount, so it seems plausible that they buy all their food from Safeway. This would allow us to study medical v. wellness interventions. This study could not be done generally in the population because people have too many choices on where to buy food. Ronald Lee adds that one would want to start by constructing a dozen indicators based on this massive data. Ken Wachter concludes the discussion by recalling the selection issue. The value of randomization in classical experimental design is not controlling for things we know about, it’s about controlling things about things we don’t know about. While I do agree that it is about representation and not representativeness is key, it does not combat the unobserved variables that are not accounted for in models.
Introduction Let’s assume the following hypothetical model: there’s a risk factor and an outcome (e.g. occurrence of a disease). There is a positive association between the risk factor and outcome,
To give some context, there has been some literature suggesting that for a heritable disease with perfect information, one would see an AUC of 0.93. Incorporating birth year interaction and random forest variables did not improve the model measurably. We are able to predict as well as we can based on a relatively simple set of factors. Melinda Mills asks if HRS has data from death certificates related to cause of death, as this could be a good method to examine mortality more specifically. Ben Domingue responds that the data is likely there, but that is something that has not yet been explored. We can do a pretty good job at minimizing survival differences between genotyped and non-genotyped individuals based on our models. Below are Cox survival curves based on genotype status, age at first interview, birth year and interaction of genotype status for white males and females. The black lines represent those born in 1930, while the red lines represent those born in 1945. The dashed lines represent non-genotyped respondents, and the solid lines represent genotyped respondents. The question is, can we reduce differences in survival between genotyped and non-genotyped respondents using our model for mortality? Visually this would result in same-color curves being closer together in the Cox survival curves. In particular, I selected those with a high probability of having lived until 2006 based on my model. I adjusted the naïve association estimates for mortality selection such that they become average treatment effects, and did this for a number of outcomes: BMI, height, education, smoking. For all of these, I used polygenic scores. I then looked at static marginal genetic effects (time-invariant) and dynamic effects by birth cohort (time-dependent), and then examined these effects before and after inverse propensity weighting. Before I discuss these results, I wanted to discuss changes in mean polygenic scores as a function of birth cohort for four different types of outcomes: BMI, education, height, and smoking. You can see that those born in 1920 have a much higher polygenic score for education than those born in 1950. The other three measures show similar results. Clearly, there has not been a long enough time for these changes to be due to evolution or selection, so these trends are presumably due to mortality selection. Looking at the raw marginal associations – the main effects of the polygenic scores on the outcomes, we see that polygenic scores are significantly correlated with BMI, height, education, and smoking. I wanted to see how these associations changed in magnitude with mortality selection using different types of estimates. With smoking, the marginal effect of smoking is 4-5% smaller after adjusting for mortality selection. More interesting is what happens when considering samples with enhanced mortality selection. We looked at those who had died in the most recent
wave of HRS, and we treated those individuals as if they did not have genetic data. We re- estimated associations, and across all variables, we see changes in the association estimates that are counter to the changes we see after weighting. The fact that these things are in opposite direction relative to the naïve estimates suggests that we really are able to control for the effect of mortality selection. Conclusion The first question we wanted to answer was how we can effectively model mortality selection? Our models do a pretty good job of reducing the discrepancy in survival differences between genotyped and non-genotyped individuals. Second, we asked how might this information change inference about certain types of associations? We found that dynamics of polygenic score associations may be biased if not corrected for mortality selection. Questions/Discussion Sid Kumar asks how do you determine age of onset of smoking? Ben Domingue responds that in the HRS, they ask people the age of onset, and it generally works. Kathleen Mullan Harris agrees, noting that this is how it is done in Add Health. Daniel Belsky comments that people are pretty good at reporting how many cigarettes they’ve smoked, and that retrospective recall of age of onset is also pretty good. Interestingly, the ever-smoker phenotype seems to be more predictive in GWAS studies, but it is unclear what to make of that prediction. Shripad Tuljpurkar asks if estimated coefficients can be corrected for in an actual study. Ben Domingue responds that this is exactly what the group is trying to address, but the situation is messy. Dalton Conley asks about how non-compliance was approached. Ben Domingue responds that people certainly refused to give genetic samples in the HRS. This is a really complicated group, and I’m not trying to model this group. We only have genetic samples and good polygenic scores for white respondents, which is why we only showed associations for these groups. Daniel Belsky adds that less educated, non-white participants in the HRS are less likely to consent to this kind of research. He notes that there may be ways to control for this. Ken Wachter asks about the differences in the Cox curves and whether they are truly picking up mortality selection. Ben Domingue notes that if you look at the Cox curves, we have survival differences between genotyped and non-genotyped respondents. If conditioned on only those who lived until 2006, you see much smaller differences. The curves would fall within each others’ confidence interval, so what we’re picking up is definitely mortality selection.
Introduction The question I’m analyzing today is one of the oldest questions in genetics: what is the percentage of variation in a trait can be explained by the additive genetic contribution. This is commonly phrased as the narrow-sense heritability. In the past, we didn’t have genetic data, so heritability was computed indirectly. In the 2000s, we started to collect a lot of genetic data, and we begin to ask can we compute the heritability directly. When genetic data was first used, we a