報 告 人:馬彥源 教授
報告題目:Doubly Flexible Estimation under Label Shift
報告時間:2023 年7月14日(周五)上午9:30-10:30
報告地點:靜遠樓1506學術報告廳
主辦單位:數學研究院、數學與統計學院、科學技術研究院
報告人簡介:
馬彥源,現為賓夕法尼亞州立大學統計系教授,北京大學學士學位,麻省理工學院博士學位。其主要研究興趣包括:降維、測量誤差模型、潛在變量模型、混合樣本、非參數、半參數、生存分析等。已公開發表論文150余篇,其中有40余篇發表在國際統計學和計量經濟學頂級期刊如JRSSB 、 AoS、 JASA 、Biometrika和 JoE。曾擔任國際統計學頂級期刊JRSSB、JASA、Biometrics的副主編。
報告摘要:
In studies ranging from clinical medicine to policy research, complete data are usually available from a population P, but the quantity of interest is often sought for a related but different population Q which only has partial data. In this paper, we consider the setting that both outcome Y and covariate X are available from P whereas only X is available from Q, under the so-called label shift assumption, i.e., the conditional distribution of X given Y remains the same across the two populations. To estimate the parameter of interest in population Q via leveraging the information from population P, the following three ingredients are essential: (a) the common conditional distribution of X given Y, (b) the regression model of Y given X in population P, and (c) the density ratio of the outcome Y between the two populations. We propose an estimation procedure that only needs some standard nonparametric regression technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be misspecified whereas our proposal here can allow both (b) and (c) to be misspecified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the Y -data in population Q. Furthermore, even though the estimation of (b) is sometimes on?-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator, and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database.