Swift: Compiled Inference for Probabilistic Programming Languages
Transcription
Swift: Compiled Inference for Probabilistic Programming Languages
Swift: Compiled Inference for Probabilistic Programming Languages Yi Wu UC Berkeley jxwuyi@gmail.com Lei Li Toutiao.com lileicc@gmail.com Abstract A probabilistic program defines a probability measure over its semantic structures. One common goal of probabilistic programming languages (PPLs) is to compute posterior probabilities for arbitrary models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines—even the compiled ones—incur significant runtime interpretation overhead, especially for contingent and open-universe models. This paper describes Swift, a compiler for the BLOG PPL. Swift-generated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle memory management for possible worlds of varying sizes. Experiments comparing Swift with other PPL engines on a variety of inference problems demonstrate speedups ranging from 12x to 326x. 1 Introduction Probabilistic programming languages (PPLs) aim to combine sufficient expressive power for writing real-world probability models with efficient, general-purpose inference algorithms that can answer arbitrary queries with respect to those models. One underlying motive is to relieve the user of the obligation to carry out machine learning research and implement new algorithms for each problem that comes along. Another is to support a wide range of cognitive functions in AI systems and to model those functions in humans. General-purpose inference for PPLs is very challenging; they may include unbounded numbers of discrete and continuous variables, a rich library of distributions, and the ability to describe uncertainty over functions, relations, and the existence and identity of objects (so-called open-universe models). Existing PPL inference algorithms include likelihood weighting (LW) [Milch et al., 2005b], parental Metropolis– Hastings (PMH) [Milch and Russell, 2006; Goodman et al., 2008], generalized Gibbs sampling (Gibbs) [Arora et al., 2010], generalized sequential Monte Carlo [Wood et al., 2014], Hamiltonian Monte Carlo (HMC) [Stan Development Team, 2014], variational methods [Minka et al., 2014; Kucukelbir et al., 2015] and a form of approximate Bayesian Stuart Russell Rastislav Bodik UC Berkeley University of Washington russell@cs.berkeley.edu bodik@cs.washington.edu computation [Mansinghka et al., 2013]. While better algorithms are certainly possible, our focus in this paper is on achieving orders-of-magnitude improvement in the execution efficiency of a given algorithmic process. A PPL system takes a probabilistic program (PP) specifying a probabilistic model as its input and performs inference to compute the posterior distribution of a query given some observed evidence. The inference process does not (in general) execute the PP, but instead executes the steps of an inference algorithm (e.g., Gibbs) guided by the dependency structures implicit in the PP. In many PPL systems the PP exists as an internal data structure consulted by the inference algorithm at each step [Pfeffer, 2001; Lunn et al., 2000; Plummer, 2003; Milch et al., 2005a; Pfeffer, 2009]. This process is in essence an interpreter for the PP, similar to early Prolog systems that interpreted the logic program. Particularly when running sampling algorithms that involve millions of repetitive steps, the overhead can be enormous. A natural solution is to produce model-specific compiled inference code, but, as we show in Sec. 2, existing compilers for general open-universe models [Wingate et al., 2011; Yang et al., 2014; Hur et al., 2014; Chaganty et al., 2013; Nori et al., 2014] miss out on optimization opportunities and often produce inefficient inference code. The paper analyzes the optimization opportunities for PPL compilers and describes the Swift compiler, which takes as input a BLOG program [Milch et al., 2005a] and one of three inference algorithms (LW, PMH, Gibbs) and generates target code for answering queries. Swift includes three main contributions: (1) elimination of interpretative overhead by joint analysis of the model structure and inference algorithm; (2) a dynamic slice maintenance method (FIDS) for incremental computation of the current dependency structure as sampling proceeds; and (3) efficient runtime memory management for maintaining the current-possible-world data structure as the number of objects changes. Comparisons between Swift and other PPLs on a variety of models demonstrate speedups ranging from 12x to 326x, leading in some cases to performance comparable to that of hand-built model-specific code. To the extent possible, we also analyze the contributions of each optimization technique to the overall speedup. Although Swift is developed for the BLOG language, the overall design and the choices of optimizations can be applied to other PPLs and may bring useful insights to similar AI Classical Probabilistic Program Optimizations (insensitive to inference algorithm) Classical Algorithm Optimizations (insensitive to model) Inference Code 𝑷𝑰 Probabilistic Program (Model) 𝑷𝑴 𝓕𝑰 𝑷𝑴 → 𝑷𝑰 Algorithm-Specific Code 𝓕cg 𝑷𝑰 → 𝑷𝑻 Target Code (C++) 𝑷𝑻 Model-Specific Code Our Optimization: FIDS (combine model and algorithm) Our Optimization: Data Structures for FIDS 1 2 3 4 5 6 7 8 9 10 type Ball; type Draw; type Color; //3 types distinct Color Blue, Green; //two colors distinct Draw D[2]; //two draws: D[0], D[1] #Ball ~ UniformInt(1,20);//unknown # of balls random Color color(Ball b) //color of a ball ~ Categorical({Blue -> 0.9, Green -> 0.1}); random Ball drawn(Draw d) ~ UniformChoice({b for Ball b}); obs color(drawn(D[0])) = Green; // drawn ball is green query color(drawn(D[1])); // query Figure 2: The urn-ball model Figure 2: The urn-ball model 1 type Cluster; type Data; Figure 1: PPL compilers and optimization opportunities. systems for real-world applications. 2 Existing PPL Compilation Approaches In a general purpose programming language (e.g., C++), the compiler compiles exactly what the user writes (the program). By contrast, a PPL compiler essentially compiles the inference algorithm, which is written by the PPL developer, as applied to a PP. This means that different implementations of the same inference algorithm for the same PP result in completely different target code. As shown in Fig. 1, a PPL compiler first produces an intermediate representation combining the inference algorithm (I) and the input model (PM ) as the inference code (PI ), and then compiles PI to the target code (PT ). Swift focuses on optimizing the inference code PI and the target code PT given a fixed input model PM . Although there have been successful PPL compilers, these compilers are all designed for a restricted class of models with fixed dependency structures: see work by Tristan [2014] and the Stan Development Team [2014] for closed-world Bayes nets, Minka [2014] for factor graphs, and Kazemi and Poole [2016] for Markov logic networks. Church [Goodman et al., 2008] and its variants [Wingate et al., 2011; Yang et al., 2014; Ritchie et al., 2016] provide a lightweight compilation framework for performing the Metropolis–Hastings algorithm (MH) over general openuniverse probability models (OUPMs). However, these approaches (1) are based on an inefficient implementation of MH, which results in overhead in PI , and (2) rely on inefficient data structures in the target code PT . For the first point, consider an MH iteration where we are proposing a new possible world w0 from w accoding to the proposal g(·) by resampling random variable X from v to v 0 . The computation of acceptance ratio α follows g(v 0 → v) Pr[w0 ] α = min 1, . (1) g(v → v 0 ) Pr[w] Since only X is resampled, it is sufficient to compute α using merely the Markov blanket of X, which leads to a much simplified formula for α from Eq.(1) by cancelling terms in Pr[w] and Pr[w0 ]. However, when applying MH for a contingent model, the Markov blanket of a random variable cannot be determined at compile time since the model dependencies vary during inference. This introduces tremendous runtime overhead for interpretive systems (e.g., BLOG, Figaro), which track all theD[20]; dependencies at runtime 2 distinct Data // 20 data points even though typ3 #Cluster Poisson(4); // the number of clusters ically only a~ tiny portion of dependency structure may 4 random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean change per iteration. Church simply avoids keeping track of 5 random Cluster z(Data d) dependencies and uses Eq.(1), including 6 ~ UniformChoice({c for Cluster c}); a potentially huge 7 random for Real x(Data d) ~ Gaussian(mu(z(d)), overhead redundant probability computations.1.0); // data 8 obs x(D[0]) = 0.1; // omit other data points For the second point, in order to track the existence of 9 query size({c for Cluster c}); the variables in an open-universe model, similar to BLOG, Church also maintains a complicated dynamic string-based Figure 3: The infinite Gaussian mixture model hashing scheme in the target code, which again causes interpretation overhead. Lastly, by Hur et al. [2014]such , Chaganty generictechniques approach proposed is Monte-Carlo sampling, as likeli[ ] [ ] et al. 2013 and Nori et al. 2014 primarily focus on optihood weighting [Milch et al., 2005a] (LW) and Metropolismizing PM by analyzing the static properties ofidea the input PP, Hastings algorithm (MH). For LW, the main is to sample which arerandom complementary Swift. every variable to from its prior per iteration and collect weighted samples for the queries, where the weight is the 3 likelihood Background of the evidences. For MH, the high-level procedure ofpaper the algorithm is summarized in Alg. 1.although M denotes an inThis focuses on the BLOG language, our apput BLOG program model), its evidence, Q asemanquery, N proach also applies to (or other PPLsEwith equivalent the number samples, denotes world, [McAllester ticsdenotes et al., of 2008; Wu etWal., 2014].a possible Other PPLs (x)also is the value of xtoinBLOG possiblevia world W ,single Pr[x|W ] denotes canWbe converted static assigntheform conditional probability of x in W[,Cytron and Pr[W ] denotes ment (SSA form) transformation et al., 1989; the of].possible world W . In particular, when the proHurlikelihood et al., 2014 posal distribution g in Alg.1 is the prior of every variable, the 3.1algorithm The BLOG Language becomes the parental MH algorithm (PMH). Our in the paper focuses LW and PMHprobabilbut our pro[Milch ] defines Thediscussion BLOG language et al.,on2005a solution to other Monte-Carlo methods well. ity posed measures over applies first-order (relational) possible worlds;asin this sense it is a probabilistic analogue of first-order logic. A BLOG declares types for objects and defines distri4 program The Swift Compiler butions over their numbers, as well as defining distributions has the following design applied goals for handlingObopenfor Swift the values of random functions to objects. universe probabilistic models. servation statements supply evidence, and a query statement specifies the posterior probability of interest. A random vari• Provide a general framework to automatically and efable in BLOG to exact the application of a random ficientlycorresponds (1) track the Markov blanket for each functionrandom to specific objects a possible BLOG natvariable andin (2) maintainworld. the minimum set of urally supports open universe probability models (OUPMs) and context-specific dependencies. Algorithm 1: Metropolis-Hastings Fig. 2 demonstrates the open-universealgorithm urn-ball (MH) model. In this version the, E query asks for the color ofHthe next random Input: M , Q, N ; Output : samples pick1 from an urn given world the colors of balls drawn previously. initialize a possible W0 with E satisfied; Line2 4forisi a←number statement stating that the number vari1 to N do randomly pick a variable x from able3 #Ball, corresponding to the totalWnumber of balls, is i−1 ; 4 Wdistributed and v ← 1 Wand i ← Wi−1 between i−1 (x) uniformly 20.; Lines 5–6 declare a 5 propose color(·), a value v 0applied for x viatoproposal g; random function balls, picking Blue or 6 Wi (x) ← v 0 and ensure WiLines is supported; Green with a biased probability. 7–8 state that each 0 →v) Pr[W i ] the urn, with replacedraw a ball1,at g(v random from 7 chooses α ← min ; g(v→v 0 ) Pr[Wi−1 ] ment. 8 if rand(0, 1) ≥ α then Wi ← Wi−1 Fig. 3 describes OUPM, the infinite Gaussian mixH ← H +another (Wi (Q)) ; ture model (∞-GMM). The model includes an unknown r t 1 t • P s i For (Fram gener goal, code ( For demo discus Sec. 4 Swift only s contin 4.1 Our d thoug includ adapti ing (R and R Dyna Dynam a dyn variab comp piler): and q during For input is use and (2 the cu infere DB is to r struct target For x(d) i doubl // mem // val ret Since mu(c Notab dency non-e 8 ~ UniformChoice({b for Ball b}); 9 obs color(drawn(D[0])) = Green; // drawn ball is green 1 type 10 1 query color(drawn(D[1])); //Color; query //3 type Ball; Ball; type type Draw; Draw; type type Color; //3 types types 2 distinct Color Blue, Green; //two colors 2 distinct Color Blue, Green; //two colors 3 distinct Draw D[2]; //two draws: D[0], D[1] 3 distinct Draw D[2]; //two draws: D[0], D[1] 4 #Ball ~ UniformInt(1,20);//unknown # of balls 4 #Ball ~ UniformInt(1,20);//unknown # of balls 5 random Color color(Ball b) //color of a ball 5 random Color color(Ball b) //color of a ball 6 ~~ Categorical({Blue -> 0.9, Categorical({Blue 0.9, Green Green -> -> 0.1}); 0.1}); 176 type Cluster; type Data; -> random drawn(Draw d) random Ball Ball drawn(Draw d)data points 2 7 distinct Data D[20]; // 20 8 ~ UniformChoice({b for Ball b}); ~ UniformChoice({b fornumber Ball b}); 3 8 #Cluster ~ Poisson(4); // of clusters 9 obs color(drawn(D[0])) = Green; // drawn ball is green obs color(drawn(D[0])) // drawn //cluster ball is green 4 9 random Real mu(Cluster c) =~ Green; Gaussian(0,10); mean 10 query color(drawn(D[1])); // query query Cluster color(drawn(D[1])); // query 510 random z(Data d) 6 ~ UniformChoice({c for Cluster c}); 7 random Real x(Data d) ~ Gaussian(mu(z(d)), 1.0); // data 8 obs x(D[0]) = 0.1; // omit other data points 9 query size({c for Cluster c}); structures for Monte Carlo sampling algorithms to avoid variables evaluate the query interpretive overhead atnecessary runtime. therandom evidence (the dynamic sliceto random variables necessary to[Agrawal evaluateand the Horgan, query and and [ the evidence (the dynamic slice Agrawal and Horgan, ] , evidence or goal, equivalently the minimum par[self-supported For 1990 thethe first we propose a novel FIDS (the dynamic slice framework Agrawal and Horgan, Figure 2: The urn-ball model or equivalently the self-supported [Milch ]). Dynamic tial1990 world et al., updating 2005a (Framework of]],,Incrementally Slices) for par1990 or equivalently the minimum minimum self-supported par[ ] tial world Milch et al., 2005a ). generating the inference code (P in Fig. 1). For the second [ ] tial world Milch et al.,I 2005a ). • Provide efficient memory management and fast data • Provide efficient memory management fast datadata goal, we carefully choose structures inalgorithms the and target C++ •structures efficient memory management and fast for Montedata Carlo sampling to avoid • Provide Providefor efficient memory management andto fast data structures Monte Carlo sampling algorithms avoid code (P ). Tstructures for Monte Carlo sampling algorithms to avoid interpretive overhead at runtime. structures for Monte Carlo sampling algorithms to avoid overhead at runtime. Forinterpretive convenience, r.v. is short for random variable. We interpretive overhead atatruntime. interpretive overhead runtime. For the first goal, we propose novel framework FIDS demonstrate examples of compiled inframework, the following ForFor thethe firstfirst goal, we we propose aacode novel FIDS goal, propose aa novel framework FIDS For the first goal, we propose novel framework FIDS discussion: the inference code (P ) generated by FIDS in (Framework of Incrementally updating Dynamic Slices) for I Figure 2: The urn-ball model Figure 2: The urn-ball model (Framework for of Incrementally updating Dynamic Slices), for Figure 2: The urn-ball model (Framework Incrementally updating Dynamic Slices) (Framework of Incrementally updating Dynamic for Sec. 4.1 are all ininference pseudo-code while theFig. target code (P )second by for generating the code (P in 1). For the T Slices) I generating thethe inference code (PI(PinI Fig. 1). 1). ForFor thethe second generating inference code in second Swift shown in Sec. 4.2 are indata C++. Due toFig. limited space, weC++ generating the inference code (P in Fig. 1). For the second goal, we carefully choose structures in the target I 1 type Cluster; type Data; 1 type Cluster; type Data; goal, we carefully choose data structures in the target C++ goal, choose target 2 distinct Data D[20]; // 20 data points only show detailed transformation rules for in the adaptive goal, we carefully choose data data structures structures in the the target C++ C++ code (PTwe ).thecarefully 2 distinct Data D[20]; // 20 data points Figure 3:The Theinfinite infinite Gaussian mixture model Figure 3: Gaussian mixture model code (P ). 3 #Cluster ~ Poisson(4); // number of clusters T code (P ). TTupdating contingency (ACU) technique and omit the others. We 3 #Cluster ~ Poisson(4); // number of clusters code (P ). For convenience, r.v. is short for random variable. 4 random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean ForFor convenience, r.v.r.v. is short for random variable. We 4 random Real mu(Cluster c) ~ Gaussian(0,10); //cluster mean isis short random For convenience, convenience, r.v.compiled short for for random variable. We 5 random Cluster z(Data d) demonstrate examples of of code in the thevariable. followingWe 5 random Cluster z(Data d) demonstrate examples compiled code in following 4.1 Optimizations in FIDS demonstrate examples of code the following 6Algorithm ~~ UniformChoice({c for c}); 1: Metropolis–Hastings algorithm (MH) approach is Monte-Carlo sampling, such as likelidemonstrate examples of compiled compiled code in in by theFIDS following 6 generic UniformChoice({c for Cluster Cluster c}); discussion: the inference code (P ) generated in I discussion: theinthe inference code (PI(P )onIgenerated by by FIDS in 7 random Real x(Data d) ~ Gaussian(mu(z(d)), 1.0); // data discussion: inference code )LW generated FIDS Our discussion this section focuses andcode PMH, ]H 7 hood random x(Data d)et~ al., Gaussian(mu(z(d)), // data 2005adata (LW) and1.0); Metropolisdiscussion: the inference code (P generated by FIDS in :weighting MReal , E, =Q ,[Milch N ; Output : samples Sec. 4.1isare all in pseudo-code while the (Pal) by in I ) target T 8 Input obs x(D[0]) 0.1; // omit other points Sec. 4.1 in pseudo-code while the target code (P ) by Swift 8 obs x(D[0]) = 0.1; // omit other data points T though FIDS can be also applied to other algorithms. FIDS Sec. 4.1 are all in pseudo-code while the target code (P ))by (0) algorithm (MH). For LW, the main idea is to sample T query size({c for Cluster c}); Sec. 4.1 are all in pseudo-code while the target code (P by 199 Hastings initialize a possible world w with E satisfied; Swift shown in Sec. 4.2 are in C++. Due to limited space, we T query size({c for Cluster c}); shown inthree Sec. optimizations: 4.2 is in 4.2 C++. Due to limited space, wespace, show includes dynamic Swift shown in are in Due to limited random from its prior per iteration and col2 every for i ← 1 to N variable do Swift shown inSec. Sec. 4.2 are inC++. C++.backchaining Due tofor limited space,we we only show the detailed transformation rules the(DB), adaptive the detailed transformation rules only for the adaptive continadaptive contingency updating (ACU) and reference count(i−1) only show the detailed transformation rules for the adaptive Figure 3:samples The infinite mixture weighted for theGaussian weight is the 3 lect randomly pick a variable Xqueries, from wwhere ; themodel only show the detailed transformation rules for the adaptive contingency updating (ACU) technique and omit the others. Figure 3: The infinite Gaussian mixture model Figure The infinite Gaussian mixture model gency updating (ACU) andand omit the while others.the ing (RC). DB is updating applied technique both LW PMH likelihood of the3: evidences. For MH, the high-level procedure contingency technique and contingency updatingto(ACU) (ACU) technique andomit omit ACU theothers. others. 4 w(i) ← w(i−1) and v ← X(w(i−1) ); and RC are particularly for PMH. of the algorithm is summarized in Alg. 1. M denotes an in0 4.1 Optimizations in FIDS 5 propose a value v for X via proposal g ; 4.14.1 in in FIDS generic approach is (or Monte-Carlo sampling, such as likeliOptimizations put BLOG program model), E its evidence, Qsuch a query, N 4.1Optimizations Optimizations inFIDS FIDS generic approach sampling, as (i) 0 is backchaining generic approach is Monte-Carlo Monte-Carlo sampling, such world, as likelilikeli- Dynamic 6 X(wthe )number ←[vMilch and ensure w(i) is self-supporting; ] Our discussion in this section focuses onon LWLW andand PMH, al-alhood weighting et al., 2005a (LW) and Metropolisdenotes of samples, W denotes a possible OurOur discussion in this section focuses on LW and PMH, al ]] (LW) and Metropolis- Dynamic [[Milch discussion in this section PMH, hood weighting et al., 2005a 0 (i) backchaining (DB) aims tofocuses incrementally construct Our FIDS discussion in also this section focuses on LW and PMH, alhood weighting Milch et al., (LW) and Metropolisg(v →v)For Pr[w ]2005a though can be applied to other algorithms. FIDS Hastings algorithm (MH). LW, the main idea is to sample (x) is the value of x in possible world W , Pr[x|W ] denotes ; 7 W α ← min 1, though FIDS can be also applied to other algorithms. FIDS though FIDS can be also applied to other algorithms. FIDS (i−1) 0 ) Pr[w Hastings algorithm (MH). For LW, the main idea isisto sample aincludes dynamic slice in eachbeLW iteration and sample those g(v→v ] thoughthree FIDS can also applied to only other algorithms. FIDS Hastings algorithm (MH). For LW, the main idea to sample optimizations: dynamic backchaining (DB), every random variable from its prior per iteration and colthe conditional probabilityfrom of (i) x in W ,(i−1) andper Pr[W ] denotes thecol- variables includes three optimizations: dynamic backchaining includes three optimizations: dynamic backchaining random its iteration and necessary at runtime. DB is adaptive from idea (DB), of (DB), 8 every if rand(0, 1)variable ≥ α for thenthe wqueries, ← prior w where includes three optimizations: dynamic backchaining (DB), every random variable from perthe iteration adaptive contingency updating (ACU) andand reference countlect weighted weight isprothecol- adaptive likelihood ofsamples possible world W . its In prior particular, when theand contingency updating (ACU), and reference countadaptive contingency updating (ACU) reference count(i) for the queries, where the weight is the compiling lect weighted samples lazy evaluation (also similar to the Prolog comadaptive contingency updating (ACU) and reference countH ← H + (Q(w )) ; lect weighted samples for the queries, where the weight is the inging (RC). DBDB applied to both both LWLW andand PMH while ACU likelihood of the evidences. MH, the of high-level procedure posal distribution g in Alg.1For is the prior every variable, the ing (RC). DB isis applied to LW and PMH while ACU (RC). is to PMH while ACU piler): in each iteration, DB backtracks from the evidence likelihood of evidences. For MH, the high-level procedure ing (RC). DB is applied applied to both both LW and PMH while ACU likelihood ofthe the evidences. For MH, the high-level procedure andand RC are particularly for PMH. becomes the parental algorithm (PMH). ofalgorithm the algorithm is summarized inMH Alg. 1. M denotes anOur in- and and RC are specific to PMH. RC are particularly for PMH. query and samples a r.v. only when its value is required of the algorithm is summarized in Alg. 1. M denotes an inand RC are particularly for PMH. of the algorithm is inevidence, Alg. 1. M denotes an discussion in the paper focuses on LW and PMH our proput BLOG program (orsummarized model), E its Qbut a query, N in- during inference. put program (or model), E its Q N Dynamic backchaining putBLOG BLOG program (or model), E itsevidence, evidence, Qaas aquery, query, N Dynamic Dynamic backchaining backchaining posed solution applies to other Monte-Carlo methods well. denotes the number of samples, W denotes a possible world, Dynamic backchaining For everybackchaining r.v. declaration random X ∼ CX ; in the number of clusters as stated in line W 3, which is ato be inferred denotes the number of samples, denotes possible world, Dynamic (DB) toTincrementally construct denotes the number of samples, W denotes a possible world, [aims ] constructs Dynamic backchaining (DB) aims incrementally construct Dynamic backchaining Milch 2005a W (x) is the value of x in possible world W , Pr[x|W ] denotes Dynamic backchaining (DB) aimsetto toal., incrementally construct input PP, FIDS generates (DB) a getter function get X(), which from the data. W (x) is the value of x in possible world W , Pr[x|W ] denotes dynamic slice in in each LWLW iteration andand only sample those (x) is Swift the value of x inofpossible W , Pr[x|W ] denotes aa dynamic slice each iteration only sample those slice incrementally inX each LW iteration andCX samthe4W conditional probability x in W world , and Pr[W ] denotes the isaa dynamic The Compiler dynamic slice in each LW iteration and only sample those used to (1) sample a value for from its declaration the probability of in W ]]denotes the variables necessary at runtime. DB is adaptive from idea of of theconditional conditional probability of.xxIn inparticular, W,,and andPr[W Pr[W denotes the and variables necessary at runtime. DB is adaptive from idea ples only those variables necessary at runtime. DB in Swift likelihood of possible world W when the pro(2) memoize the sampled value for later references in variables necessary at runtime. DB is adaptive from idea of 3.2Swift Generic Inference Algorithms likelihood of possible world W . In particular, when the prohas the followingworld design goals for handling opencompiling lazy evaluation (also similar to the Prolog comlikelihood of possible W . In particular, when the procompiling lazy evaluation (also to the comis an example of compiling lazy in iteration, posal distribution g in Alg.1 is the prior of every variable, the the current iteration. Whenever aevaluation: r.v.similar is referred to Prolog during compiling lazy evaluation (also similar to each the Prolog comuniverse probabilistic models. posal distribution g in Alg.1 is the prior of every variable, the All PPLsdistribution need to infer posterior ofvariable, a query piler): inits each iteration, DB backtracks from thethe evidence posal g inthe Alg.1 isMH the distribution prior of every piler): in each iteration, DB from DB backtracks from the evidence and query and an algorithm becomes the parental algorithm (PMH). Ourthe inference, function will bebacktracks evoked. piler): ingetter each iteration, DB backtracks fromsamples the evidence evidence algorithm becomes the parental MH algorithm (PMH). given observed evidence. The majority use and query and samples a r.v. only when its value is required algorithm becomes the parental MH algorithm (PMH). Our r.v. • the Provide a general framework to and automatically and proef-Our and query and samples a r.v. only when its value isis required only when its value is required during inference. discussion in the paper focuses ongreat LW PMHof butPPLs our DB is the fundamental technique for Swift. The key insight and query and samples a r.v. only when its value required discussion in the paper focuses on LW and PMH but our proMonte Carlo sampling methods such as likelihood inference. ficiently (1) the exact Markov blanket for discussion inapplies thetrack paper focuses on LW and PMHweighting but our during inference. every declaration random T interpretive X ∼ CX ;data in the posed solution to other Monte-Carlo methods as each well.pro- isduring toFor replace ther.v. dependency look-up in some during inference. posed solution applies to Monte-Carlo methods as well. (LW) and Metropolis–Hastings MCMC (MH). LW samples random variable and (2) maintain the minimum set For every r.v.generates declaration random T TX ∼∼ CXCC ;Xwhich in in thethe posed solution applies toother other Monte-Carlo methods asof well. structure every r.v. declaration random X ; direct machine seeking in the input For PP, aaddress/instruction getter function get X(), ForbyFIDS every r.v. declaration random T X ∼ X ; in the unobserved random variables sequentially in topological orinput PP, FIDS generates a getter function get X(), which input PP, FIDS generates a getter function get X(), which target executable file. is used (1)FIDS sample a value afor X from its declaration X 4 The Swift on Compiler inputtoPP, generates getter function get X(),Cwhich der,44conditioned evidence and sampled values earlier in The Swift Compiler isFor used to (1) sample asampled value for XX from its its declaration example, thesample inference code for the getter function ofCin isis(2) used to (1) aa value for from declaration CCXX X and memoize the value for later references The Swift Compiler used to (1) sample value for X from its declaration Algorithm 1:following Metropolis-Hastings algorithm (MH) Swift has theeach design goals for handling the sequence; complete sample is weighted by theopenlikeand (2) memoize the sampled value for later references in x(d) in the ∞-GMM model (Fig. 3) is shown below. and (2) memoize the sampled value for later references the current iteration. Whenever an r.v. is referred to during and (2) memoize the sampled value for later references in in Swift has the design goals handling openuniverse Swift has the following design goals forthe handling open- thethe Input :the M,evidence E , Qfollowing , N ;models. Output : samples H for lihood of probabilistic variables appearing in sequence. current Whenever a evoked. r.v. is is referred to to during current iteration. Whenever aa r.v. inference, itsiteration. getter function double get_x(Data d) { will be the current iteration. Whenever r.v. is referred referred to during during universe probabilistic models. 1 initialize a possible world W with E satisfied; universe probabilistic models. 0 The• MH algorithm is summarized 1. M denotes a inference, getter function will be be evoked. Provide a general framework in to Alg. automatically and ef// someitscode the sampled value inference, its function will evoked. Compiling DB ismemoizing the fundamental technique Swift. The inference, itsgetter getter function will be evoked.for 2 for i ← 1 to N do •ficiently Provide general framework to automatically and BLOG model, Eaaits evidence, Q aMarkov query, Nblanket the number of efmemoization; DB is the fundamental technique for Swift. The keykey insight (1) track the exact for each DB is the fundamental technique for Swift. The insight • Provide general framework to automatically and efkey insight is to replace dependency look-ups and method in3 randomly pick a variable x from Wi−1 ; DB is the fundamental technique for Swift. The key insight ifreplace notthesampled, sample a innew value ficiently (1) track the exact Markov each samples, w a←possible world, X(w) the value ofblanket randomfor variis// to dependency look-up some interpretive data variable and maintain the minimum set of isisreplace to the dependency look-up in some interpretive ficiently (1)and track the exact blanket for each vocations from some internal PP data structure with direct 4 random W Wi−1 v ←(2) W (x) ; Markov i i−1 the dependency look-up in some interpretivedata data valto=replace sample_gaussian(get_mu(get_z(d)),1); variable maintain the minimum set able X random in w, variables Pr[X(w)|w denotes the gconditional proba0 and structure by direct machine address/instruction seeking inexethe -X ]x(2) to evaluate query and structure by direct machine address/instruction seeking in the random and (2) maintain minimum set of of machine address accessing and branching in the target 5 random propose avariable value vnecessary for via proposal ; thethe structure by direct machine address/instruction seeking in the return val; } 0 bility of W Xi (x) in ← w,vand the likelihood of postarget executable file. target executable file. 6 andPr[w] ensuredenotes W cutable file. i is supported; target executable file. the proposal distribution Since get_z( · ) the is the called before get_mu( ·getter ),getter only those of sible world w. In particular, when ForFor example, the inference code forfor thethe function of of g(v 0 →v) Pr[W example, inference code function i] For example, inference code for the getter function For example, the inference code for the getter function of α ← min 1, g(v→v ; algorithm (MH) 0 ) Pr[W mu(c) corresponding to non-empty clusters will be sampled. i−1 ] 1: 1: Metropolis-Hastings algorithm g Algorithm in7 Algorithm Alg.1 samples a variable conditioned on its(MH) parents, Algorithm 1:Metropolis-Hastings Metropolis-Hastings algorithm (MH) the x(d) in the ∞-GMM model (Fig. 3) is shown below. x(d) in the ∞-GMM model (Fig. 3) is shown below. x(d) in the ∞-GMM model (Fig. 3) is shown below. x(d) with in thethe ∞-GMM model (Fig. 3) no is shown below. 8 if rand(0, 1) ≥ α then W ← W Notably, inference code above, explicit depeni i−1 Input :: M Q ,,;N ;parental ::MH samples algorithm becomes the algorithm (PMH). Our Input :M , E,, E Q N Output : samples HH Input E, ,(W Q N ;Output Output samples H double get_x(Data double get_x(Data d)d) { at{{runtime to discover those H in ←M + ; Won i (Q)) double get_x(Data d) dency look-up is ever required 1 initialize aH world E satisfied; 0 with discussion the paper focuses LW and PMH but our pro1 initialize a possible world W0W with E satisfied; 1 initialize apossible possible world E satisfied; 0 with // some code memoizing the sampled value // some code memoizing the sampled value // some code memoizing the sampled value non-empty clusters. 2 for ii← to 2 for isolution ← 1←to11N do posed applies 2 for toN N do doto other Monte Carlo methods as well. randomly pick aavariable xxfrom W randomly pick a variable x from Wi−1 ; ;; randomly pick variable from Wi−1 i−1 ← W and v ← W (x) ; i i−1 i−1 WiW ← W and v ← W (x) ; i−1 i−1 Wi ← Wi−1 and 0v ← Wi−1 (x); 4 The Swift Compiler propose a xxvia proposal propose a value v 0 vfor x proposal g ; gg;; propose avalue value v 0 for forvia via proposal 0 v 0 0 and ensure W is supported; W (x) ← i (x) i supported; Wi (x) v and ensure W isgoals has design for handling openiW Wthe ← v and ensure is supported; i ←following i g(v ] 0 Pr[W 0→v) g(v 0g(v →v) Pr[W i ] ii ] universe probability models. →v) Pr[W αα← min 1, ; 0 7 77 α ← min 1, ; 0 ) Pr[W ] ← min 1, g(v→v ; 3 3 4 44 5 55 6 66 Swift 3 8 0 ) Pr[W g(v→v ) Pr[W ]i−1 ] i−1i−1 g(v→v ififrand(0, ≥ ααthen W i ← if rand(0, 1) 1) ≥ WiW ← rand(0, 1)α ≥then then W ←i−1 Wi−1 i W i−1 ← H + (W (Q)) ; i HH ← H + (W (Q)) ; i H ← H + (W i (Q)); •88 Provide a general framework to automatically and efficiently (1) track the exact Markov blanket for each random variable and (2) maintain the minimum set of random variables necessary to evaluate the query and the evidence (the dynamic slice [Agrawal and Horgan, 1990], or equivalently the minimum self-supporting partial world [Milch et al., 2005a]). memoization; memoization; memoization; not sampled, sample value //// ifif not sampled, sample a anew value // if not sampled, sample a new new value val val = =sample_gaussian(get_mu(get_z(d)),1); val = sample_gaussian(get_mu(get_z(d)),1); sample_gaussian(get_mu(get_z(d)),1); return val; return val; } }} return val; Since get_z( is before get_mu( those memoization some pseudo-code for Since get_z( · ) ··is))denotes called before get_mu( · ),·snippet those Since get_z( is called called before get_mu( ·),only ), only only those mu(c) corresponding to non-empty clusters will be sampled. performing memoization. Since get_z(·) is called before mu(c) corresponding to non-empty clusters will be sampled. mu(c) corresponding to non-empty clusters will be sampled. Notably, with inference code above, depenget_mu(·), only those mu(c) corresponding toexplicit non-empty Notably, with thethe inference code above, nono explicit depenNotably, with the inference code above, no explicit dependency look-up is required runtime to those clusters will be issampled. Notably, thetoinference code dency look-up ever required at at runtime discover those dency look-up is ever ever required atwith runtime to discover discover those non-empty clusters. above, no explicit dependency look-up is ever required at runnon-empty clusters. non-empty clusters. time to discover those non-empty clusters. When sampling a number variable, we need to allocate memory for the associated variables. For example, in the urnball model (Fig. 2), we need to allocate memory for color(b) When sampling a number variable, we need to allocate When sampling a number variables, variable, we to allocate memory for the associated i.e.,need in the urn-ball memory for 2), thewe associated variables, i.e., in urn-ball model (Fig. need to allocate memory forthe color(b) afmodel (Fig. 2),#Ball. we needThe to allocate memorygenerated for color(b) ter sampling corresponding codeaf-is ter sampling is after sampling #Ball.The Thecorresponding correspondinggenerated generatedcode infershown below.#Ball. shown below. ence code is shown below. int get_num_Ball() { int//get_num_Ball() some code for {memoization // some code for memoization memoization; memoization; // sample a value // a value valsample = sample_uniformInt(1,20); val = sample_uniformInt(1,20); // some code allocating memory for color // some code allocating memory for color allocate_memory_color(val); allocate_memory_color(val); return val; } return val; } allocate_memory_color(val) denotes some pseudoallocate_memory_color(val) denotes some some pseudoallocate_memory_color(val) denotes code segment for allocating val chunks of memory for the code segment for allocating val chunks of memory for the values of color(b). values of color(b). Reference Counting Reference Counting Reference counting (RC) generalize the idea of DB to increReference counting (RC) generalize the idea of to increReference counting the (RC) generalize of DB DB mentally maintain dynamic slicetheinidea PMH. RC to is increa effimentally maintain the dynamic slice ininPMH. RC isisana effimentally maintain the dynamic slice PMH. RC cient compilation strategy for the interpretive BLOG toeffidycient compilation strategy for interpretive BLOG to dycient compilation strategy for the the interpretive dynamically maintain references to variables andBLOG excludetothose namically maintain references to and those namically maintain references to variables variables and exclude exclude without any references from the current possible worldthose with without any references from the current possible world with without any references from the current possible world with minimal runtime overhead. RC is also similar to dynamic minimal runtime RC to minimal runtime overhead. overhead. RC is is also also similar similar to dynamic dynamic garbage collection in programming language community. garbage collection in programming language community. garbage collection in programming language community. For an a r.v. being tracked, say X, X, RC RC maintains maintains aa referreferFor r.v. being being tracked, tracked, say say Forcounter a r.v. X,cnt(X) RC maintains a reference cnt(X) defined by = |Ch(X|W )|. ence counter cnt(X) cnt(X) defined defined by by cnt(X) cnt(X) ==|Ch(X|W |Chw (X)|. ence counter )|.. Ch(X|W ) denote the children of X in the possible world W Ch the of X in the possible world world W w. w (X) denote Ch(X|W ) denote thechildren children the possible When cnt(X) becomes zero,of XXis isin removed from w; W ;when when. When cnt(X) becomes zero, X removed from When cnt(X) becomes zero, X is removed from W ; when cnt(X)become becomepositive, positive,X X isinstantiated instantiatedand and addedback back cnt(X) cnt(X) become positive, X is is instantiated and added added back to W . This procedure is performed recursively. to is performed recursively. to w. W .This Thisprocedure procedure Tracking referencesistotoperformed everyr.v. r.v.recursively. mightcause causeunnecessary unnecessary Tracking references every might Tracking references to every r.v. unnecessary overhead.For Forexample, example,for forclassical classicalmight Bayescause nets,RC RC neverexexoverhead. Bayes nets, never overhead. For example, for classical Bayes nets, RC never excludes any variables. Hence, as a trade-off, Swift only counts cludes any Hence, aa trade-off, Swift cludes any variables. variables. Hence, as asmodels, trade-off, Swift only onlytocounts counts references in open-universe open-universe particularly, those references in models, particularly, to references in open-universe models, particularly, to those those variables associated with number variables (i.e., color(b) in variables associated with variables associated with number number variables variables (e.g., (i.e., color(b) color(b) in in the urn-ball model). the urn-ball model). theTake urn-ball model). theurn-ball urn-ballmodel modelas asan an example. When Whenresampling resampling Take the Take the urn-ball modelthe as proposed an example. example. When resampling drawn(d) and accepting value v, the generated drawn(d) and the value drawn(d) and accepting accepting the proposed proposed value v, v, the the generated generated code for accepting the proposal will be code code for for accepting accepting the the proposal proposal will will be be void inc_cnt(X) { void if inc_cnt(X) (cnt(X) == {0) W.add(X); if (cnt(X) == 0) +W.add(X); cnt(X) = cnt(X) 1; } cnt(X) = cnt(X){ + 1; } void dec_cnt(X) void dec_cnt(X) { - 1; cnt(X) = cnt(X) cnt(X) = cnt(X) 1; if (cnt(X) == 0)- W.remove(X); } if (cnt(X) == 0) W.remove(X); void accept_value_drawn(Draw d, }Ball v) { void accept_value_drawn(Draw d, Ball v) { // code for updating dependencies omitted // dec_cnt(color(val_drawn(d))); code for updating dependencies omitted dec_cnt(color(val_drawn(d))); val_drawn(d) = v; val_drawn(d) = v; } inc_cnt(color(v)); inc_cnt(color(v)); } The The function function inc_cnt(X) inc_cnt(X) and and dec_cnt(X) dec_cnt(X) update update the the The function dec_cnt(X) update the references totoX. The accept_value_drawn() isis references X.inc_cnt(X) Thefunction functionand accept_value_drawn() references function Swift accept_value_drawn() is specialized r.v. drawn(d): specializedtototoX. r.v.The drawn(d): Swift analyzes analyzes the the input input propro- specialized to r.v. drawn(d): the variables. input program specialized codes for gramand andgenerates generates specializedSwift codesanalyzes fordifferent different variables. gram and generates specialized codes for different variables. Adaptive Adaptivecontingency contingencyupdating updating Adaptive contingency updating The goal of ACU is to The goal of ACU is to incrementally incrementally maintain maintain the the Markov Markov The goalfor ofevery ACUr.v. is with to incrementally maintain the Markov Blanket minimal Blanket for every r.v. with minimalefforts. efforts. Blanket for every with minimal CX denotes ther.v.declaration of efforts. r.v. X in the input PP; P arw (X) denote the parents of X in the possible world w; w[X ← v] denotes the new possible world derived from w by only changing the value of X to v. Note that deriving CX denotes the declaration of r.v. X in the input PP; the declaration X the in possible the inputworld PP; X denotes PC ar(X|W ) denote the parents of of r.v. X in PWar(X|W the parents X in the possible ; W (X )←denote v) denotes the newof possible world derivedworld from W v) denotes newofpossible from W; W by (X only←changing thethe value X to v.world Notederived that deriving W by only changing the value of X to v. Note that deriving w[X ← v] may require instantiating new variables not existW (X ← v) may require instantiating new variables not exW (X ←due v) due may instantiating new variables not exing in win to dependency changes. isting W torequire dependency changes. isting in W due toar(X|W dependency Computing PPar X changes. is by executComputing ) for X straightforward is straightforward by exew (X) for Computing P ar(X|W )C for is straightforward by exeing its generation process, ,Xwithin w, which is often incuting its generation process, W , which is often XC X , within cuting its generation process, C , within W , which is often X expensive. Hence, the principal challenge for maintaining inexpensive. Hence, the principal challenge for maintaining inexpensive. Hence, isthe challenge for maintaining the efficiently maintain aa children the Markov Markov blanket blanket is totoprincipal efficiently maintain children set set the Markov blanket efficiently a children Ch(X) for r.v. X with the aim maintain identical set to Ch(X) for each each r.v. is Xto of keeping identical the Ch(X) for each r.v. X with thethe aimcurrent of keeping the the true set Ch current possible world ground truth Ch(X|W under PW Widentical . w. w (X) for) the 0 W. ground Ch(X|W ) under the current PW With aaproposed possible world (PW) ← itv],is Withtruth proposed possible world (PW)w W= (Xw[X ← v), With a proposed possible world (PW) W (X ← v), it is iteasy is easy to add all missing dependencies from a particular to add all missing dependencies from a particular varieasy add all dependencies varir.v. . .That is,is,missing for U and everyfrom ∈ PPar(U arw0 (U ableUto U That forana r.v. r.v. Va particular ∈ |W),), able UU. must That be is,whether forCh aw r.v. U and Vadd ∈P |W ), 0 (V since in we every can Ch(V) up-to-date we can verify U ∈),Ch(V |Wkeep )r.v. and Uar(U to Ch(X) we can verify whether U ∈ Ch(V |W ) and add U to Ch(X) by adding U to Ch(V) if U ∈ 6 Ch(V). Therefore, the if U 6∈ Ch(X). Therefore, the key step becomes trackingkey the ifsetU of 6∈ variables, Ch(X). thev)), key which step becomes the step is tracking Therefore, the set(X of ← variables, ∆(w[X ← tracking v]), which ∆(W will have a different set variables, v)),← which will have a different will have a set in of∆(W parents w[X v] different from w. setof of parents W (X(X ←in← v). set of parents in W (X ← v). ∆(W(X [X ← ←v)) v]) = = {Y : P arw (Y ) 6= P arw[X←v] (Y )} ∆(W ∆(W (X ← v)){Y = : P ar(Y |W ) 6= P ar(Y |W (X ← v))} Precisely computing ∆(w[X ← v]) is very expensive at {Y : P ar(Y |W ) 6= P ar(Y |W (X ← v))} runtime but computing an upper bound for ∆(w[X ← v]) Precisely computing ∆(W (X ← of v)) is very expensive at does not influence the ∆(W correctness inference code. Precisely computing (X ← v)) the is very expensive at runtime but computing an upper bound for ∆(W (X ← v)) Therefore, we proceed to find an bound upper bound that(X holds for runtime computing ancorrectness upper ←code. v)) doesvalue notbut influence the of for the∆(W inference any v. We call this over-approximation the contingent does not influence the tocorrectness of the inference code. Therefore, we proceed find an upper bound that holds for set Contw (X). Therefore, weWe proceed to find an upper bound the thatcontingent holds for any value v. call this over-approximation Note thatv.for r.v. Uover-approximation with dependency that in any value Weevery this the changes contingent set Cont(X|W ).call be w(X ← v), X must reached in the condition of a control set Note Cont(X|W ). that for every r.v. Vpath with dependency that changes in statement onfor theevery execution CU within w. implies Note that r.v. Vreached withofdependency thatThis changes in W (X ← v), X must be in the condition of a con∆(w[X v])X⊆must Chwbe (X). One straightforward idea set W (Xstatement ←←v), reached of, namely aiscontrol on the(X). execution pathinofthe CVcondition within W Cont Ch However, this leads toW too much w (X) =on w execution trol statement the path of C within , namely V ∆(W (X ← v)) ⊆ Ch(X|W One straightforward ideaforis overhead. ∆(·)).).should be always empty ∆(W (X ←For v))example, ⊆=Ch(X|W One straightforward idea is set Cont(X|W ) Ch(X|W ). However, this leads to too models with fixed dependencies. set Cont(X|W ) = Ch(X|W ). However, this leads to too much overhead. For example, ∆( · ) should be always empty Another approach is to first∆( find the be set of switching much overhead. For example, · ) out should always empty for models with fixed dependency. variables S for every r.v. X, which is defined by for Another models X with fixed dependency. approach is to first find out the set of switching Another approach is toarfirst find6=out the set of switching SX =S{Y : ∃w, vP Pisar (X)}, variables every r.v. X, which defined w (X) w[Y ←v]by X for variables SX for every r.v. X, which is defined by This immediate advantage: SX can (Y be ← statically SXbrings = {Y an : ∃W, v P ar(X|W ) 6= P ar(X|W v))}, SX = {Yat :compile ∃W, v Ptime. ar(X|W ) 6= P ar(X|W (Y S← computed However, computing is NPX v))}, 1 This brings Complete . an immediate advantage: SX can be statically This brings immediate SX canbbeSstatically computed atancompile time.advantage: However, computing is NPX taking Our solution is to derive the approximation S by computed 1at compile time. However, computingXSX is NPComplete . the union 1of free variables of if/case conditions, function arComplete . Our solution to derive conditions the approximation SbX then by taking guments, and setisexpression in CX , and set solution is variables to derive of theif/case approximation SbXfunction by taking theOur union of free conditions, arbY }. the union of conditions, function arCont (X) = {Y : Yofconditions ∈if/case Chw (X) S wfree guments, and set variables expression in∧CXX ,∈and then set guments, and set expression conditions in CX , and then set Contw (X) is a function of the sampling variable Xb and the Cont(X|W ) = {Y : Y ∈ Ch(X|W ) ∧ we X ∈ bSY }. a current PW w. Likewise, r.v. )X, Cont(X|W ) = {Y : Yfor∈ every Ch(X|W ∧ X ∈maintain SY }. runtime contingent set Cont(X) identical to theXtrue Cont(X|W ) is a function of the sampling variable andset the Cont(X|W )W is .a Likewise, function ofwe the sampling X and Cont the current PW w. Cont(X) be the inw (X) current PWunder maintain a variable runtime can contingent current PW W .for Likewise, weXmaintain a runtime contingent crementally updated as well. set Cont(X) every r.v. corresponding to true continset Cont(X) for every r.v. X corresponding to true continBack to the original focus of ACU, suppose we are adding gent set Cont(X|W ) under the current PW W . Cont(X) gent setincrementally Cont(X|W ) updated. the PW current W← . Cont(X) new dependencies inunder the new w0 =PW w[X v]. There can the be can be incrementally updated. are Back three to steps accomplish goal: (1) enumerating the theto original focusthe of ACU, suppose we are all adding Back focus ACU, are 0 (U variables Uthe ∈ original Cont(X), (2)of for all P ar ), adding add are U wx). new thetodependencies in the new PWVsuppose W∈(X ←we There new the dependencies new (X ←Sbx). There 0 (U to Ch(V), and (3) forinalltheVthe ∈goal: PPW arwW )∩ U are to three steps to accomplish (1) enumerating the U , addall three steps These to accomplish (1) enumerating allway the Cont(V). steps canthe be goal: also repeated in a similar 1 Since the CX may contain arbitrary boolean formuto remove thedeclaration vanished dependencies. 1 Since thereduce declaration CX may contain computing arbitrary boolean las, one the can the model 3-SAT problem SX . formuTake ∞-GMM (Fig. 3)toto ascomputing an example las, one can reduce the 3-SAT problem SX,. when resampling z(d), we need to change the dependency of r.v. x(d) 1 Since the declaration CX may contain arbitrary boolean formulas, one can reduce the 3-SAT problem to computing SX . { F Fcc (C, Id, Id, {}) {}) } } cc (C, { void Id::accept_value(T Id::accept_value(T v) v) void Fcc(M (M= ={ random Tin Id([T Id,] ,]∗∗∗∗))))∼ ∼C) C)= = {random for(uT in Cont(Id)) u.del_from_Ch(); FF (M == random T Id([T Id ,] ∼∼ C) == for(u Cont(Id)) u.del_from_Ch(); (M random T Id([T Id ,] C) F Id([T Id c c void Id::add_to_Ch() Id = v; void Id::add_to_Ch() Id = v; void Id::add_to_Ch() Id::add_to_Ch() void {F Fcccc(C, (C, Id, {}) } } for(u inId, Cont(Id)) u.add_to_Ch(); } } { FF (C, Id, {}) }} for(u in Cont(Id)) u.add_to_Ch(); { Id, {}) { {}) cc (C, cc ∗ void Id::accept_value(T v) ∗ void Id::accept_value(T v) F (M = random T Id([T Id ,] ) ∼ C) = c void Id::accept_value(T v) Fc (M =void random T Id([T Id ,] ) ∼ C) = Id::accept_value(T v) { for(u in Cont(Id)) Cont(Id)) u.del_from_Ch(); Fcc (Cvoid ={ Exp, X, seen) seen) = Fcc (Exp, X, X, seen) {{ for(u in Cont(Id)) u.del_from_Ch(); cc (C cc (Exp, void Id::add_to_Ch() F = Exp, X, = F seen) for(u in Cont(Id)) u.del_from_Ch(); Id::add_to_Ch() for(u in u.del_from_Ch(); IdF =(C, v;Id, Fcc (C = = Dist(Exp), Dist(Exp), X, seen) seen) =F Fcc (Exp, X, X, seen) seen) Id = v; cc (C cc (Exp, {Id F= (C, Id, {}) } }= F X, cc Id = v; { {}) v; cc for(u in Cont(Id)) Cont(Id)) u.add_to_Ch(); } Fcc (C void = if iffor(u (Exp) then C11 else else C Cu.add_to_Ch(); X, seen) = = for(u in Cont(Id)) u.add_to_Ch(); }} cc (C 2 void Id::accept_value(T v) seen) F = (Exp) then C ,, X, for(u in Cont(Id)) u.add_to_Ch(); 2 Id::accept_value(T v) in } F (Exp, X, seen) cc { for(u in Cont(Id)) u.del_from_Ch(); F (Exp, X, seen) {ccfor(u in Cont(Id)) u.del_from_Ch(); Fcccc(C (C== = Exp, X,v; seen)= =F Fcccc(Exp, (Exp,X, X,seen) seen) if (Exp) FF (C Exp, X, seen) == FF (Exp, seen) Id = v; if (Exp) Exp, X, seen) X, seen) Id = F Exp, X, seen) cc (C= cc (Exp,X, cc cc F (C = Dist(Exp), X, seen) = F (Exp, X,seen) seen) { F (C , X, seen ∪ F V (Exp)) }X, F = Dist(Exp), X, seen) = F (Exp, X, seen) cc(C cc cc 1 cc cc for(u in Cont(Id)) u.add_to_Ch(); } { F (C , X, seen ∪ F V (Exp)) } F (C = Dist(Exp), X, seen) = F (Exp, seen) cc 1 X, for(u in Cont(Id)) } Fcccc(C = Dist(Exp), seen) = Fccccu.add_to_Ch(); (Exp, X, Fcccc(C (C== = if(Exp) (Exp)then thenC C11else elseC C22,,,,X, X,seen) seen)= = else FF (C if (Exp) then C else X, == else if (Exp) then C C seen) F if cc (C= cc 11 elseC 22 X,seen) F (Exp, X, seen) { FX, (C X, seen ∪ FV V (Exp)) (Exp)) } FF X, seen) cc(Exp, cc (C 2 ,,seen) cc Fcc (C = =F Exp, X, seen) = Fcc (Exp, X, seen) seen)} { F X, seen F cc (C cc∪ (Exp, X, seen) cc 2 F Exp, seen) = F (Exp, X, X, cc(Exp, cc if (Exp) X, if (Exp) Fcc (C = =if Dist(Exp), X, seen) seen) = =F Fcc (Exp, X, X, seen) seen) cc (C cc (Exp, if (Exp) F Dist(Exp), (Exp) { (Exp) Fcccc(C (C1then X,seen seen ∪F FV VC (Exp)) } = {X, F (C seen ∪∪ FF VV (Exp)) }} 1, ,X, Fcc (Exp, X, seen) = FF (C = if if (Exp) then C11 else else C X, seen) seen) = cc(C cc 2 ,, X, { F X, seen (Exp)) (Exp, seen) F = C { F ∪ (Exp)) } cc (C 1, ,X, cc 2 cc 1= else else F (Exp, X, seen); seen); /* emit emitF line of code code for every every r.v. r.v. U U in in the the corresponding corresponding set set */ */ ccline (Exp, X, else /* aaelse of for cc {(Exp) Fcc (C X,seen seen ∪ FV V(Exp)) (Exp)) } { FF seen ∪∪ F)\seen) VV (Exp)) }} cc(C 2,,,X, b if (Exp) ∀{ U{ ∈F ((F V22(Exp) (Exp) ∩S S Cont(U) += X; X; b (C X, seen F (Exp)) if , X, ∪ } XF cc(C 2 cc ∀ U ∈ ((F V ∩ )\seen) :: Cont(U) += X {U F F (C X, seen seen ∪ ∪F FV V (Exp)); } X; cc ,, X, } ∀{ ∈cc (F(C V11(Exp)\seen) (Exp)\seen) Ch(U) += += ∀ UX, ∈ (F V :: (Exp)); Ch(U) X; F (Exp, seen) = F (Exp, X, seen) = else cc cc else (Exp, X,seen) seen)== FFcccc(Exp, X, { F (C , X, seen ∪ F V (Exp)); } /*emit emit4: line of(C code forseen every r.v. U(Exp)); inthe thecorresponding corresponding setcode */ /*/* emit line of code for every r.v. UU in the corresponding set */*/ cc 2 , X, Fof ∪r.v. Ffor V } inference Figure Transformation rules outputting inference cc 2 emit line of code for every r.v. in the corresponding set /* code for every U in set */ Figure 4:aaaa{line Transformation rules outputting code bXXfor bS ∀U U∈ ∈V ((F V(Exp) (Exp) ∩S )\seen) Cont(U)in+= += X; ∀∀ UU ∈∈ ((F VV (Exp) ∩∩ S )\seen) Cont(U) X; b with ACU. ACU. F (Expr) denotes the free free:::: variables variables Expr. ((F (Exp) SbX )\seen) Cont(U) += X; ∀ V ∩ Cont(U) X; X)\seen) with F V((F (Expr) denotes the in+= Expr. Fcc (Exp, X, seen) = cc (Exp, ∀X, U∈ ∈(F (FV V= (Exp)\seen):::: Ch(U) Ch(U) += X; X; ∀∀ UU ∈ (F V (Exp)\seen) Ch(U) += X; F seen) bX have ∈ (F V (Exp)\seen) Ch(U) += X; ∀ U (Exp)\seen) += b We assume the switching variables S have already been been We assume switching S already X /* emit emit a line line the of code code for every everyvariables r.v. U U in in the the corresponding set */ */ /* a of for r.v. corresponding set computed. The rules for number statements and case statecomputed. The rules for number statements and case statebX )\seen) Figure 4: Transformation rules for outputting inference code Figure 4: Transformation rules for outputting inference code ∀ U ∈ ((F V (Exp) ∩ S : Cont(U) += X; b ∀ ∈ ((F V (Exp) ∩ SX )\seen) : Cont(U) += code X; Figureare Transformation rules foroutputting outputting inference code Figure 4:4:UTransformation rules for inference ments not shown for conciseness—they conciseness—they are analogous to ments are not shown for are analogous to with ACU. F V (Expr) denotes the free variables in Expr. with ACU. F V (Expr) denotes the free variables in Expr. ∀ U ∈ (F V (Exp)\seen) : Ch(U) += X; ∀ U ∈ (F V (Exp)\seen) : Ch(U) += X; withACU. ACU.FFVV(Expr) (Expr)denotes denotesthe thefree freevariables variablesininExpr. Expr. with the rules rules for for random random variables variables and and if if statements statements respectively. the respectively. bXX have bS We assume assume the the switching switching variables variables have already already been been We assume the switching variables S already been b We assume the switching variables SbX have already been We S X have Figure 4: Transformation Transformation rules for statements outputting inference code Figure 4: rules for outputting inference code computed. The rules for number statements and case statecomputed. The rules for number and case statecomputed. The The rules rules for for number number statements statements and and case case statestatecomputed. with ACU. F∈V Vshown (Expr) denote the free variables in),Expr. Expr. with ACU. F (Expr) in ments areV not shown fordenote conciseness—they areanalogous analogous to ments are not for conciseness—they are analogous to variables Vnot Cont(X), (2) for forthe allfree U∈ ∈variables P ar(V ar(V |W add to Vto ments are not shown for conciseness—they are analogous variables ∈ shown Cont(X), (2) all U P |W ), add V ments are for conciseness—they are b b We assume the switching variables S have already been the rules for random variables and if statements respectively. the rules for random variables and if statements respectively. X|W brespectively. We assume the switching variables Sstatements have theCh(U), rulesfor forand random variables and respectively. X b to Ch(U), and (3) for for all U U ∈and P ar(V ar(V ∩ already S addbeen V to to the rules random variables ififstatements V ,, add to (3) ∈ P |W )) ∩ S V V computed. The rules forall number statements and case statestatecomputed. The rules for number statements and case Cont(U). These steps can be also repeated in a similar way Cont(U). These steps can be also repeatedare in analogous a similar way ments are not not shown fordependencies. conciseness—they to ments are for conciseness—they are analogous to to remove the∈shown vanished to remove the vanished dependencies. variables V Cont(X), (2) for all U ∈ P ar(V |W),), ),add addVV V the rules for random variables and if statements respectively. variables V ∈ Cont(X), (2) for all U ∈ P ar(V |W add variables V ∈ Cont(X), (2) for all U ∈ P ar(V |W the rules for random variables and if statements respectively. Take the the ∞-GMM model (Fig. 3) U as an anPexample example when revariables V ∞-GMM ∈ Cont(X), (2)(Fig. for all ∈ ar(Vb|W,, when ), addreV Take model 3) as to Ch(U), and (3) for all U ∈ P ar(V |W ) ∩ S , add V to b b V toCh(U), Ch(U), andwe (3)need forall all ar(V |W)))∩ add V to to to Ch(U), and (3) for all UU ∈ ∈∈ P PPar(V ar(V |W ∩∩S S add V to bSVVVof sampling z(d), we need to change change the dependency of r.v. V x(d) to and (3) for U |W ,,,add sampling z(d), to the dependency r.v. x(d) Cont(U). These steps can be also repeated in a similar way Cont(U). Thesesteps steps can be also repeated inThe similar way Cont(U). These steps can be also repeated in similar way since Cont(z(d)|W =can {x(d)} forrepeated any W ..in The generated Cont(U). These be also aaasimilar way since Cont(z(d)|W = {x(d)} for any W generated since Cont (z(d)) =)) {x(d)} {x(d)} for any w. The The generated code wthe since Cont (z(d)) = for any w. generated code toremove remove vanished dependencies. w to remove the vanished dependencies. to the vanished dependencies. code for this process is shown below. to remove the vanished dependencies. code for this process is shown below. for this this process is shown shown below. for is below. Takeprocess the ∞-GMM ∞-GMM model (Fig. 3) 3) as as an an example example ,, when when rereTake the model Take the model (Fig. (Fig. 3) as as an an example example ,, when when void x(d)::add_to_Ch() { the Take the ∞-GMM model (Fig. 3) void x(d)::add_to_Ch() { sampling z(d), we need to change dependency of r.v. x(d) sampling z(d), we need to change the dependency of r.v. x(d) resampling z(d), we need to change change the dependency dependency of Cont(z(d)) += x(d); resampling z(d), we need to of Cont(z(d)) += since Cont(z(d)|W )x(d); = {x(d)} for any anythe W .. The The generated generated since Cont(z(d)|W ) = {x(d)} for W r.v. x(d) since Cont(z(d)|W ) = {x(d)} for any W . The Ch(z(d)) += x(d); r.v. x(d) Cont(z(d)|W = {x(d)} for any W . The Ch(z(d)) += x(d); code for since this process process is shown shown) below. below. code for this is generated code for for this this process is} shown below. below. Ch(mu(z(d))) += x(d); is }shown Ch(mu(z(d))) += x(d); generated code process void z(d)::accept_value(Cluster x(d)::add_to_Ch() { { void z(d)::accept_value(Cluster v) { { void v) void x(d)::add_to_Ch() Cont(z(d)) += x(d); // accept the proposed value v for for r.v. r.v. z(d) z(d) // Cont(z(d)) accept the += proposed x(d); value v Ch(z(d)) += x(d); references // Ch(z(d)) code for for += updating references omitted omitted // code updating x(d); Ch(mu(z(d))) += x(d); x(d); u.del_from_Ch(); } for (u in in Cont(z(d))) Cont(z(d))) u.del_from_Ch(); for (u Ch(mu(z(d))) += } void z(d)::accept_value(Cluster v) { { val_z(d) = v; v; val_z(d) = void z(d)::accept_value(Cluster v) // accept the proposed value v for r.v.} z(d) for (u in Cont(z(d))) u.add_to_Ch(); } z(d) (u inthe Cont(z(d))) u.add_to_Ch(); //for accept proposed value v for r.v. // code code for for updating updating references references omitted omitted // We omit the code for del_from_Ch() del_from_Ch() for for conciseness, We omit the code for for conciseness, conciseness, We omit the code We omit for for conciseness, for (uthe incode Cont(z(d))) u.del_from_Ch(); for (u in Cont(z(d))) u.del_from_Ch(); which isis essentially the same as add_to_Ch() which isessentially essentially thesame sameas asadd_to_Ch() add_to_Ch() which is the .... which essentially the same as add_to_Ch() val_z(d) = v; val_z(d) = v; The formal transformation rules for ACU are demonstrated The formal transformation rules forACU ACUare aredemonstrated demonstrated for (u in in Cont(z(d))) u.add_to_Ch(); } The formal transformation rules for The formal transformation rules for ACU are demonstrated for (u Cont(z(d))) u.add_to_Ch(); } in Fig. 4.4. FF takes in random variable declaration statement inFig. Fig. 4.F Fthe takes inaafor random variabledeclaration declaration statement cctakes in 4. in random variable statement in Fig. in aafor random variable declaration statement c We omit code del_from_Ch() for conciseness, c takes We omit the code del_from_Ch() forcode conciseness, We omit the code for del_from_Ch() for conciseness, We omit the code for del_from_Ch() for conciseness, in the BLOG program and outputs the inference containin the BLOG program and outputs the inference code containin the BLOG program and outputs the inference code containin the BLOG program and outputs the inference..code containwhich isisessentially essentially the same as add_to_Ch() which essentially the same asadd_to_Ch() add_to_Ch() which is the same as which is essentially the same as add_to_Ch() ing methods accept value() and add to Ch(). FF value() andadd add to to.. Ch(). Ch().F Fcc ing methods accept cc ing methods accept value() and cc value() and add to Ch(). ing methods accept cc The formal transformation rules for ACU are demonstrated The transformation rules forACU ACU are demonstrated The transformation rules for are demonstrated takes in aformal BLOG expression and generates the inference code The formal transformation rules for ACU are demonstrated takes informal aBLOG BLOG expressionand and generates the inference code takes in a expression generates the inference code takes in a BLOG expression and generates the inference code in Fig. 4.4. FF takes inin random variable declaration statement inFig. Fig. 4.F Fccctakes takes random variable declaration statement in 4. aaaarandom statement inside method add to Ch(). It keeps track of already seen in Fig. takes random variable declaration statement inside method addinin to Ch().variable Itkeeps keepsdeclaration trackof ofalready already seen c add inside method to Ch(). It track seen inside method add to Ch(). It keeps track of already seen in the BLOG program and outputs the inference code containin the BLOG program and outputs the inference code containin the BLOG program and outputs the inference code containvariables in seen, which makes sure that a variable will be in the BLOG program and outputs the inference code containvariables in seen, which makes sure that a variable will be variables in seen, which makes sure that a variable will be variables in seen, which makes sure variable willFF be ing methods accept value() and add to Ch(). value() andthat addaset to Ch(). ingmethods methods accept cc cc ing accept value() and add to Ch(). FFcc added to the contingent set or the children atat most once. value() and add to Ch(). ing methods accept added tothe thecontingent contingent setor orthe thechildren children set atmost most once. cc added to set set at once. added to the contingent set or the children set most once. takes in aremoving BLOG expression and generates the inference code takes in a BLOG expression and generates the inference code takes in a BLOG expression and generates the inference code Code for variables can be generated similarly. takes in a BLOG expression and generates the inference code Codefor forremoving removingvariables variablescan canbe begenerated generatedsimilarly. similarly. Code Code for removing can generated similarly. inside method add to Ch(). Itchildren keeps track of already seen inside method addvariables to Ch(). Itbe keeps track of already seen inside method add to Ch(). keeps track already seen inside method add to Ch(). It keeps track of already seen Since we maintain maintain the exactIt children for of each r.v. with with Since we maintain the exact for each r.v. with Since we maintain the exact children for each r.v. with Since we the exact children for each r.v. variables in seen, which makes sure that a variable will be variables in seen, which makes sure that a variable will be variables in makes that variable be ACU, thecomputation computation ofacceptance acceptance ratio αain (line in will Alg. 1) variables in seen, seen, which which makes sure sure that a(line variable will be ACU, the computation ofof acceptance ratio αα in Eq. 7171 for for PMH ACU, the computation acceptance ratio in Alg. 1) ACU, the of ratio α Eq. PMH added to the contingent set or the children set at most once. added to the contingent set or the children set at most once. added to the contingent set or the children set at most once. added to the contingent set or the children set at most once. can be simplified simplified to can be to Code for removing variables can be generated similarly. Code forremoving removing variablescan canbe begenerated generatedsimilarly. similarly. Code for variables Code for removing variables can be generated similarly. Since we maintain the exact children for each r.v. r.v. with with Since we maintain the exact children for each r.v. with Since we we maintain maintain the the exact exact children for for each each r.v. with Since ! Q children ! Q 0acceptance 0in 0 ACU, the computation of acceptance ratio α (line 7 in Alg. 1) ACU, the computation of ratio α (line 7 Alg. 1) 0 0 0 |w| Pr[X = v|w ] Pr[U (w )|w ACU, the computation of acceptance ratio α (line 7 in Alg. 1) ACU, the|w| computation of-X acceptance ratio (line(w7 in Alg. Pr[X = v|w ] U Pr[U )|w -X -U ]] 1) U ∈Ch ∈Chw (X) α 00 (X) -U w Q min 1, min 1, |w00 | Pr[X = v00 |w ] Q Pr[V (w)|w (w)|w-V -X ] -V ]] |w | Pr[X = v |w-X V ∈Ch ∈Chw (X) Pr[V w (X) V (2) (2) ! Q ! |Wi−1 Pr[x = = v|W v|Wii]] Q u∈Ch(x|Wi ) Pr[u|W Pr[u|Wii]] i−1 || Pr[x |W u∈Ch(x|Wi ) Q min 1, min 1, |W | Pr[x = v00 |W ] Q Pr[v|Wi−1 i−1 ] i−1 ]] |Wii| Pr[x = v |Wi−1 v∈Ch(x|Wi−1 )) Pr[v|W for PMHcan canbe besimplified simplifiedto to v∈Ch(x|Wi−1 for PMH can be simplified to forPMH PMH can be simplified to (2) for (2) ! ! Q of random variables ! Q ! Q Q Here |W |W || denotes denotes the total number exHere the total number of random variables ex|W | Pr[x = v|W ] Pr[u|W ] |W | Pr[x = v|W ] Pr[u|W ] i−1 Pr[x = v|Wi i ] u∈Ch(x|W i−1 u∈Ch(x|W )))Pr[u|W |Wi−1 Pr[u|Wiii]i] i = maintained v|WiQ i−1| |Pr[x i]Q u∈Ch(x|W i) u∈Ch(x|W min in 1, ii min 1,1, isting in1, W ,, |W which can 0be be via RC. RC. Q isting W which can maintained via 0 |W min Q min |W | Pr[x = v ] Pr[v|Wi−1 Here |w| denotes the total number of random variables exist|W | Pr[x = v |W ] Pr[v|W 0 i i−1 i−1 0 i i−1 v∈Ch(x|W ) Here |w| denotes the total number of random variables existv∈Ch(x|W |W Pr[x==vv|W |Wi−1 Pr[v|W i−1 ] ] v∈Ch(x|W Pr[v|W ]]]] i−1 i−1 i−1 i i| |Pr[x i−1 Finally,|W the computation time of ACU ACUi−1 is)))strictly strictly shorter v∈Ch(x|W i−1 Finally, the computation time of is shorter ing in w, which can be maintained via RC. (2) (2) ing in w, which can be maintained via RC. (2) (2) than that that of of acceptance acceptance ratio ratio α α (Eq. (Eq. 2). 2). than Finally, the computation time of ACU ACU is strictly strictly shorter Here |W||||the denotes thetotal totaltime number ofrandom random variables exHere |W denotes the total number of random variables exFinally, computation of is shorter Here |W denotes the total number of random variables exHere |W denotes the number of variables exthan that of acceptance ratio α (Eq. 2). isting in W , which can be maintained via RC. isting in W , which can be maintained via RC. 4.2 Implementation of Swift than that αSwift (Eq. 2). via isting inof Wacceptance whichcan canratio beof maintained viaRC. RC. isting in W , ,which be maintained 4.2 Implementation Finally, the thememoization computation time time of of ACU ACU is is strictly strictly shorter shorter Finally, the computation time of ACU isis strictly shorter Finally, the computation time of ACU strictly shorter Finally, computation Lightweight Lightweight memoization 4.2 Implementation of Swift than that of acceptance ratio α (Eq. 2). than that of acceptance ratio α (Eq. 2). 4.2 Implementation of Swift thanthat that ofacceptance acceptance ratioαα(Eq. (Eq.2). 2). details than ratio Here we of introduce the implementation implementation details of of the the memomemoHere we introduce the Lightweight memoization ization code in the getter function mentioned in section 4.1. Lightweight memoization ization code in the getter function mentioned in section 4.1. 4.2 Implementation Implementationof ofSwift Swift 4.2 Implementation of Swift 4.2 of Swift 4.2 Here weImplementation introduce the implementation details of target the memomemoObjects will be be the converted to integers integers in the the target code. Here we introduce implementation details of the Objects will converted to in code. Lightweight memoization Lightweight memoization Lightweight memoization ization code in inmemoization the getter function mentioned in section section 4.1.for Lightweight Swift analyzes thegetter input function PP and and allocates allocates static memory for ization code the mentioned in 4.1. Swift analyzes the input PP static memory Here we introduce the implementation details of thememomemoHere we introduce the implementation details of the memoObjects will be converted to integers in the target code. Here we introduce the implementation details of the memoHere we introduce the implementation details of the memoization in the target code. For open-universe models Objects willin be to integers in theintarget code. memoization inthe theconverted targetfunction code. For open-universe models ization code the getter function mentioned section 4.1. ization code in getter mentioned in section 4.1. Swift analyzes the input PP and allocates static memory for ization code inthe the getter function mentioned insection section 4.1. ization code in getter function mentioned in 4.1. where the number of random variables arestatic unknown, the dySwift analyzes input PP and allocates memory for where the number of random variables are unknown, the dyObjects will be converted to integers in the target code. Objects will be converted to integers in the target code. memoization in the target code. For open-universe models Objects will be converted to integers in the target code. Objects be converted to integers in the target code. namic tablewill data structure (i.e., vector<int> in models C++) is memoization in the target code. For open-universe namic table data structure (i.e., vector<int> in C++) is Swiftto analyzes theofinput input PPof and allocates staticmemory memory for Swift analyzes the input PP and allocates static memory for where the number random variables are unknown, the dydySwift analyzes the input PP and allocates static memory for Swift analyzes the PP and allocates static for used reduce the amount dynamic memory allocations: where the number of random variables are unknown, the used to reduce the amount of dynamic memory allocations: memoization in the the target target(e.g., code. For open-universe open-universe models memoization in the target code. For open-universe models namic table data structure vector in C++) C++) is used used to memoization in target code. For open-universe models memoization in code. For models we only increase the length of the array when the number number benamic table data structure (e.g., vector in is to we only increase the length of the array when the bewhere the number of random variables are unknown, the dywhere the number of random variables are unknown, the dyreduce amount ofofdynamic dynamic memory are allocations: we only wherethe theamount number randomvariables variables areunknown, unknown, the dywhere the number random the dycomes larger than the the capacity. reduce of memory allocations: we only comes larger than capacity. namic table data structure (i.e., vector<int> in C++) is namic table data structure (i.e., vector<int> in C++) isis increase the length of the array when the number becomes namic table data structure (i.e., vector<int> in C++) namic dataweakness structure (i.e., vector<int> inbecomes C++)tais Onetable potential weakness of directly applying dynamic taincrease the length of the array when thememory number One potential of directly applying aa dynamic used to reduce the amount of dynamic allocations: used to reduce the amount of dynamic memory allocations: larger than the capacity. usedis tothat, reduce the amount ofmultiple dynamicnested memory allocations: used reduce the amount of dynamic memory allocations: ble isto that, forcapacity. models with multiple nested number statelarger than the ble for models with number statewe only increase theconsumption length of thearray array when the number bewe only increase the length of the array when the number bewe only increase the length of the array when the number beOne potential weakness of directly applying a dynamic tawe only increase the length of the when the number bements, the memory can be large. In Swift, the ments, the memory consumption canapplying be large.a In Swift, tathe One potential weakness of directly dynamic comes larger than the capacity. comes larger than the capacity. comes larger than thewith capacity. ble is can that, for models models with multiple nested number statements comes larger than capacity. user can force the the target code to clear clear all the the unused memory user force the target code to all unused memory ble is that, for multiple nested number statements One potential weakness ofdirectly directly applying dynamic taOne potential weakness of directly applying dynamic taOne potential weakness of directly applying aadynamic dynamic ta(e.g. the aircraft tracking model with multiple aircrafts and One potential weakness of aaaircrafts taevery fixed number of iterations iterations via applying an compilation option, every fixed number of via an compilation option, (e.g. the aircraft tracking model with multiple and ble is that, for models with multiple nested number stateble is that, for models with multiple nested number stateble isis for models with multiple multiple nested number number stateblips for each one), thedefault. memory consumption can be large. large. ble that, for models with nested statewhich isthat, turned off by default. which is turned off by blips for each one), the memory consumption can be ments, the memory consumption canbe be large. InSwift, Swift, the ments, the memory consumption can be large. In Swift, the ments, the memory consumption can be large. In Swift, the In Swift, the user can force the target code to clear all the ments, the memory consumption can large. In the The following code demonstrates the target code for The following codeforce demonstrates the target for In Swift, the user can the target code to clearcode all the user can force the target code to clear all the unused memory user can force the target code to clear all the unused memory usercan can force the target code clear all theunused unused unused memory every fixed number ofall iterations via2) a where comuser force the target code totoclear the memory #Ball and color(b) in thenumber urn-ball model (Fig. 2) where #Ball and color(b) in the urn-ball model (Fig. unused memory every fixed of iterations via amemory comeveryfixed fixed number of iterations via an compilation option, every fixed number of iterations via an compilation option, every fixed number of iterations via an compilation option, pilation option, whichiteration isiterations turned offvia by an default. every of compilation option, iter is thenumber current iteration number. iter is the current number. pilation option, which is turned off by default. which isturned turnedoff off bydefault. default. which is turned off by default. which is turned off by default. The is following code demonstrates the target target code code for for the the which by vector<int> val_color, mark_color; The following code demonstrates the vector<int> val_color, mark_color; The following code demonstrates thecolor(b)’s target code code for The following code demonstrates the target code for The following code demonstrates the target code for number variable #Ball and its associated in the The following code demonstrates the target for int get_color(int b) { number variable #Ball and associated color(b)’s in the int get_color(int b) {its #Ball and color(b) in the urn-ball model (Fig. 2) where #Ball and color(b) in the urn-ball model (Fig. 2) where #Ball and color(b) in the urn-ball model (Fig. 2) where urn-ball model (Fig. 2) 2)inwhere where iter ismodel the current current iteration #Ball and color(b) the (Fig. 2) where if (mark_color[b] (mark_color[b] ==urn-ball iter)is if == iter) urn-ball (Fig. iter the iteration iter ismodel thecurrent current iteration number. iter isis the current iteration number. iter the current iteration number. return val_color[b]; // memoization memoization number. iter is the iteration number. return val_color[b]; // number. vector<int> val_color, mark_color; mark_color[b] = iter; iter;mark_color; // mark mark the the flag flag vector<int> val_color, mark_color; mark_color[b] = // vector<int> val_color, mark_color; vector<int> val_color, int get_color(int b) { { val_color[b] = ...//sampling ...//sampling code omitted omitted int get_color(int b) {{ val_color[b] = code int get_color(int b) int get_color(int b) if (mark_color[b] (mark_color[b] == iter) iter) return val_color[b];} if (mark_color[b] == iter) return val_color[b];} if (mark_color[b] == iter) if == return val_color[b]; val_color[b]; // memoization memoization int return val_num_B, mark_num_B; return val_color[b]; // memoization int val_num_B, mark_num_B; return val_color[b]; // memoization // mark_color[b] = iter; // mark the flag flag int get_num_Ball() { mark_color[b] = iter; // mark the flag int get_num_Ball() { mark_color[b] == iter; iter; // // mark mark the the flag mark_color[b] val_color[b] = === ...//sampling code omitted if(mark_num_B == iter) return returncode val_num_B; val_color[b] == ...//sampling code omitted if(mark_num_B iter) val_num_B; val_color[b] ...//sampling code omitted val_color[b] ...//sampling omitted return val_color[b];} val_color[b];} val_num_B = ...// ...// sampling sampling code code omitted omitted return val_color[b];} val_num_B = return val_color[b];} return int val_num_B, mark_num_B; mark_num_B; if(val_num_B >mark_num_B; val_color.size(){//allocate int val_num_B, mark_num_B; if(val_num_B > val_color.size(){//allocate int val_num_B, int val_num_B, int get_num_Ball() { { { get_num_Ball() val_color.resize(val_num_B); //memory int get_num_Ball() {{ { val_color.resize(val_num_B); //memory int get_num_Ball() int if(mark_num_B == iter) iter) return return val_num_B; val_num_B; mark_color.resize(val_num_B); } if(mark_num_B == iter) return val_num_B; mark_color.resize(val_num_B); } if(mark_num_B == iter) return val_num_B; if(mark_num_B == val_num_B = ...// sampling code omitted return val_num_B; } val_num_B = ...// sampling code omitted return val_num_B; } val_num_B == ...// ...// sampling sampling code code omitted omitted val_num_B if(val_num_B > > val_color.size(){//allocate val_color.size(){//allocate if(val_num_B >> val_color.size(){//allocate if(val_num_B val_color.size(){//allocate if(val_num_B Note that when generating a new PW, by simply increasing Note that when generating a new PW, by simply increasing { val_color.resize(val_num_B); val_color.resize(val_num_B); //memory //memory {{ val_color.resize(val_num_B); //memory val_color.resize(val_num_B); //memory { counter, all the memoization flags (i.e., mark color for counter, all the memoization flags (i.e., mark}} color for mark_color.resize(val_num_B); mark_color.resize(val_num_B); mark_color.resize(val_num_B); mark_color.resize(val_num_B); }} color( · )) will be automatically cleared. color( · )) will be automatically cleared. return val_num_B; val_num_B; } } return val_num_B; }} return return Lastly, in val_num_B; PMH, we need to randomly select a r.v. per Lastly, inwhen PMH, we needaato randomly select a r.v. inper Note that when generating new PW, by simply simply Notethat that when generating new PW, byprocess simply increasing Note that generating new PW, by simply increasing Note that when generating aa new PW, by inNote that when generating new PW, by simply increasing Note when PW, by simply increasing iteration. In the generating target C++aanew code, this is accomiteration. In the target C++ code, this process is accomcreasing the counter iter, all allflags the memoization memoization flags (e.g., counter,via allpolymorphism the memoization memoization flags (i.e., mark color for counter, all the memoization (i.e., mark color for creasing the counter iter, the flags counter, all the memoization flags (i.e., mark color for counter, all the flags (i.e., for plished by declaring declaring anmark abstractcolor class(e.g., with plished via polymorphism by an abstract class with mark color for color(·)) will be automatically cleared. color( · )) will be automatically cleared. color( · )) will be automatically cleared. mark color for color(·)) will be automatically cleared. willbe bemethod automatically cleared.class color( ··))))will automatically cleared. resample() method and aa derived derived class for for every every r.v. r.v. in in aacolor( resample() and Lastly, inin PMH, wecode need to randomly randomly select an r.v. r.v.aaaamodel to samLastly, in PMH, we need to randomly randomly select r.v. per Lastly, in PMH, we need to randomly select r.v. per Lastly, in PMH, we need to an to samLastly, in PMH, we need to randomly select r.v. per Lastly, PMH, we need to select r.v. per the PP. An An example snippet for theselect ∞−GMM is the PP. example code snippet for the ∞−GMM model ple per iteration. iteration. Intarget the target target C++ code, this process is acac-is iteration. In the theIn target C++C++ code,code, this this process is accomaccomiteration. In the target C++ code, this process isis accomple per the process is iteration. In the target C++ code, this process accomiteration. In C++ code, this process is shown below. shown below. complished via polymorphism by declaring declaring an abstract abstract class plishedvia viapolymorphism polymorphism bydeclaring declaring anabstract abstract classclass with plished via polymorphism by declaring an abstract class with complished via polymorphism by an plished via polymorphism by declaring an abstract class with plished by an class with class MH_OBJ {public: //abstract class MH_OBJ {public: //abstract class with a resample() method and a derived class for every aaresample() resample() method and a derived class for every r.v. in aaclass resample() method and a derived class for every r.v. in with a resample() method and a derived class for every resample() method andaaderived derivedclass classfor forevery every r.v.in in method and r.v. virtual void void resample()=0;//resample step resample()=0;//resample step r.v. in the PP. An example example code for snippet for the the ∞-GMM ∞-GMM thevirtual PP.the AnPP. example codesnippet snippet forthe the∞−GMM ∞−GMM modelis is the PP. An example code snippet for the ∞−GMM model is r.v. in An code snippet for the PP. An example code snippet for the ∞−GMM model is the PP. An example code model virtual void void add_to_Ch()=0; add_to_Ch()=0; // // for for ACU ACU virtual model is shown below. shown below. shown below. model isbelow. shown below. get_likeli()=0; shown below. virtual double get_likeli()=0; shown virtual double class MH_OBJ {public: //abstractclass class }; // MH_OBJ some methods omitted class MH_OBJ {public: //abstract class class {public: //abstract }; // some methods omitted class MH_OBJ {public: //abstract class class MH_OBJ {public: //abstract class virtualvoid voidresample()=0;//resample resample()=0;//resamplestep step virtual void resample()=0;//resample step virtual virtual void resample()=0;//resample step virtual void resample()=0;//resample step virtual void add_to_Ch()=0; // for ACU virtual void add_to_Ch()=0; // for ACU virtual void add_to_Ch()=0; //for ACU virtual void add_to_Ch()=0; // for ACU virtual void add_to_Ch()=0; // for ACU virtualvoid double get_likeli()=0; virtual double get_likeli()=0; virtual accept_value()=0;//for ACU virtual double get_likeli()=0; virtual double get_likeli()=0; }; // some methods omitted }; // some methods omitted virtual double get_likeli()=0; }; // some methods omitted }; // some methods omitted }; // some methods omitted // maintain all the r.v.s in the current PW std::vector<MH_OBJ*> active_vars; // derived derived classes classes for for r.v.s r.v.s in in the the PP PP // class Var_mu:public Var_mu:public MH_OBJ{public://for MH_OBJ{public://for mu(c) mu(c) class double val; val; // // value value for for the the var var double double cached_val; cached_val; //cache //cache for for proposal proposal double int mark, mark, cache_mark; cache_mark; // // flags flags int std::set<MH_OBJ**>> Ch, Ch, Cont; Cont; //sets //sets for for ACU ACU std::set<MH_OBJ double getval(){...}; getval(){...}; //sample //sample && memoize memoize double double getcache(){...};//generate getcache(){...};//generate proposal proposal double ... }; }; // // some some methods methods omitted omitted ... std::vector<Var_mu**>> mu; mu; // // dynamic dynamic table table std::vector<Var_mu class Var_z:public Var_z:public MH_OBJ{ MH_OBJ{ ... ... };//for };//for z(d) z(d) class Var_z** z[20]; z[20]; // // fixed-size fixed-size array array Var_z Efficient proposal proposal manipulation manipulation Efficient Although one PMH iteration only only samples samples aa single single variable, variable, Although one PMH iteration generating aa new new PW PW may may still still involve involve (de-)instantiating (de-)instantiating an an generating arbitrary number number of of random random variables variables due due to to RC RC or or sampling sampling arbitrary number variable. variable. The The general general approach approach commonly commonly adopted adopted aa number by many PPL systems is to construct the proposed possible by many PPL systems is to construct the proposed possible world in in dynamically dynamically allocated allocated memory memory and and then then copy copy itit to to world the current current world world [[Milch Milch et et al., al., 2005a; 2005a; Wingate Wingate et et al., al., 2011 2011]],, the which suffers suffers from from significant significant overhead. overhead. which In order to generate and accept new PWs PWs in in PMH PMH with with In order to generate and accept new negligible memory memory management management overhead, overhead, we we extend extend the the negligible lightweight memoization memoization to to manipulate manipulate proposals: proposals: Swift Swift statstatlightweight ically allocates allocates an an extra extra memoized memoized cache cache for for each each random random ically variable, which is dedicated to storing the proposed value for variable, which is dedicated to storing the proposed value for that variable. variable. During During resampling, resampling, all all the the intermediate intermediate results results that are stored stored in in the the cache. cache. When When accepting accepting the the proposal, proposal, the the proproare posed value value is is directly directly loaded loaded from from the the cache; cache; when when rejection, rejection, posed no action action is is needed needed due due to to our our memoization memoization mechanism. mechanism. no Here is is an an example example target target code code fragment fragment for for mu(c) mu(c) in in the the Here ∞-GMM model. model. proposed proposed vars vars is is an an array array storing storing all all ∞-GMM model. proposed vars is an array storing all ∞-GMM the variables variables to to be be updated updated ifif the the proposal proposal gets gets accepted. accepted. the the variables to be updated if the proposal gets accepted. std::vector<MH_OBJ*>proposed_vars; >proposed_vars; std::vector<MH_OBJ * class Var_mu:public Var_mu:public MH_OBJ{public://for MH_OBJ{public://for mu(c) mu(c) class double val; val; // // value value double double cached_val; cached_val; //cache //cache for for proposal proposal double int mark, mark, cache_mark; cache_mark; // // flags flags int double getcache(){ getcache(){ // // propose propose new new value value double if(mark == == 1) 1) return return val; val; if(mark if(mark_cache == == iter) iter) return return cached_val; cached_val; if(mark_cache mark_cache == iter; iter; // // memoization memoization mark_cache proposed_vars.push_back(this); proposed_vars.push_back(this); cached_val == ... ... // // sample sample new new value value cached_val return cached_val; cached_val; }} return void accept_value() accept_value() {{ void val == cached_val;//accept cached_val;//accept proposed proposed value value val if(mark == == 0) 0) {{ // // not not instantiated instantiated yet yet if(mark mark == 1;//instantiate 1;//instantiate variable variable mark active_vars.push_back(this); active_vars.push_back(this); }} }} void resample() resample() {{ // // resample resample void proposed_vars.clear(); proposed_vars.clear(); mark == 0; 0; // // generate generate proposal proposal mark new_val == getcache(); getcache(); new_val mark == 1; 1; mark alpha == ... ... //compute //compute acceptance acceptance ratio ratio alpha if (sample_unif(0,1) (sample_unif(0,1) <= <= alpha) alpha) {{ if for(auto v: v: proposed_vars) proposed_vars) for(auto v->accept_value(); // // accept accept v->accept_value(); // code code for for ACU ACU and and RC RC omitted omitted // }; // // some some methods methods omitted omitted }} }} }; Supporting new new algorithms algorithms Supporting We demonstrate that FIDS can can be be applied applied to to new new algorithms algorithms We demonstrate that FIDS Supporting new algorithms by implementing implementing translator for for the the Gibbs Gibbs sampling sampling [[Arora Arora by aa translator et al., al., 2010]] (Gibbs) (Gibbs) in Swift. We demonstrate that in FIDS can be applied to new algorithms et 2010 Swift. Gibbs is aa variant variant of MH MH with with the proposal distribution g(·) [Arora byGibbs implementing a translator forthe theproposal Gibbs sampling is of distribution g(·) set to be the posterior distribution. The acceptance ratio α is ] et al., 2010 (Gibbs) in Swift. set to be the posterior distribution. The acceptance ratio α is always while the disadvantage disadvantage is that that itit is is only only possible possible to Gibbs11iswhile a variant of MH with the proposal distribution g(·) always the is to explicitly construct thedistribution. posterior distribution distribution with conjugate conjugate set to be the posterior The acceptance ratio α is explicitly construct the posterior with priors. In Gibbs,the thedisadvantage proposal distribution distribution still constructed constructed always 1Inwhile is that it isis only possible to priors. Gibbs, the proposal still from the the Markov Markov blanket, which distribution again requires requires maintaining explicitly constructblanket, the posterior with conjugate from which again maintaining the children children set. Hence, Hence, FIDS can can be be fully fully is utilized. priors. In Gibbs, the proposal distribution still constructed the set. FIDS utilized. However, different conjugate priors yield different forms from the Markov blanket, whichpriors again yield requires maintaining However, different conjugate different forms of posterior posterior distributions. Incan order to support support variety of of the children set. Hence, FIDS be to fully utilized. of distributions. In order aa variety proposal distributions, we need to (1) implement a conjugacy However, different conjugate yield different forms proposal distributions, we need topriors (1) implement a conjugacy checker to check check the form form of of posterior distribution for each each of posterior distributions. In posterior order to distribution support a variety of checker to the for r.v. in in the thedistributions, PP at at compile compile time (the checker has nothing nothing to do do proposal wetime need(the to (1) implement a conjugacy r.v. PP checker has to with FIDS); FIDS); (2) implement implement a posterior posterior sampler for every every conchecker to check the form aof posteriorsampler distribution for each with (2) for conjugate prior in the runtime library of Swift. r.v. in the PP at compile time (the checker has nothing to do jugate prior in the runtime library of Swift. with FIDS); (2) implement a posterior sampler for every conExperiments jugate prior in the runtime library of Swift. 55 Experiments In the experiments, Swift Swift generates generates target target code code in in C++ C++ with with In the experiments, 5C++ C++Experiments standard <random> library for random number generstandard <random> library for random number generation and the armadillo armadillo package Sanderson, 2010 forwith maation the ma[[Sanderson, ]] for In theand experiments, Swiftpackage generates target code2010 in C++ trix computation. The baseline systems include BLOG (vertrix computation. The baseline systems include BLOG (verC++ standard <random> library for random number genersion 0.9.1), 0.9.1), Figaro (version 3.3.0), Church (webchurch), (webchurch), Insion (version 3.3.0), Church In[Sanderson, ation and theFigaro armadillo package 2010] for mafer.NET (version 2.6), BUGS (winBUGS 1.4.3) and Stan fer.NET (version The 2.6),baseline BUGS systems (winBUGS 1.4.3) and (verStan trix computation. include BLOG (cmdStan 2.6.0). (cmdStan sion 0.9.1),2.6.0). Figaro (version 3.3.0), Church (webchurch), Infer.NET (version 2.6), BUGS (winBUGS 1.4.3) and Stan 5.1 Benchmark Benchmark models 5.1 models (cmdStan 2.6.0). We collect collect aa set set of of benchmark benchmark models models22 which which exhibit exhibit varvarWe ious capabilities of a PPL (Tab. 1), including the burglary ious capabilities of a PPL (Tab. 1), including the burglary 5.1 Benchmark models model (Bur), (Bur), the the hurricane hurricane model model (Hur), (Hur), the urn-ball model model 2 the urn-ball model We collect denotes a set of benchmark models (U-B(x,y) the urn-ball model model withwhich at most mostexhibit ballsvarand (U-B(x,y) denotes the urn-ball with at xx balls and ious capabilities of a PPL (Tab. 1), including the draws), the the TrueSkill TrueSkill model model [[Herbrich Herbrich et et al., al., 2007 2007burglary (T-K), ]] (T-K), yy draws), model (Bur), the hurricane model model (Hur), (GMM, the urn-ball 1-dimensional Gaussian mixture 100 model points 1-dimensional Gaussian mixture model (GMM, 100 points (U-B(x,y) denotes the urn-ball model with at most x balls and with 44 clusters) clusters) and and the the infinite infinite GMM GMM model model (∞-GMM). (∞-GMM). We with We [Herbrich yalso draws), theaTrueSkill et al.,digits 2007][[LeCun (T-K), include real-worldmodel dataset: handwritten also include a real-world dataset: handwritten digits LeCun 1-dimensional Gaussian mixturemodel modelTipping (GMM,and 100Bishop, points et al., al., 1998 1998]] using using the PPCA PPCA et the model [[Tipping and Bishop, with 4 clusters) and the infinite GMM model (∞-GMM). We 1999]].. All All the the models models can can be be downloaded downloaded from from the the GitHub GitHub 1999 [LeCun also includeof aBLOG. real-world dataset: handwritten digits repository repository of] BLOG. et al., 1998 using the PPCA model [Tipping and Bishop, ] 1999 . All the can within be downloaded 5.2 Speedup by FIDS FIDS within Swift from the GitHub 5.2 Speedupmodels by Swift repository of BLOG. We first first evaluate evaluate the the speedup speedup promoted promoted by by each each of of the the three three We optimizations in in FIDS FIDS individually. individually. optimizations 5.2 Speedup by FIDS within Swift Dynamic Backchaining for LW LW Dynamic Backchaining for We first evaluate the speedup promoted by each of the three We compare compare the the running time of of the the following following versions versions of of We running time optimizations in FIDS individually. code: (1) the code generated by Swift (“Swift”); (2) the modcode: (1) the code generated by Swift (“Swift”); (2) the modDynamic Backchaining for ified compiled compiled code with with DB DB LW manually turned turned off off (“No (“No DB”), DB”), ified code manually which sequentially samples allof thethe variables in the the PP; PP; (3) (3) We compare the running time following versions of which sequentially samples all the variables in the hand-optimized hand-optimized code without without unnecessary memoization code: (1) the code generated by Swift (“Swift”);memoization (2) the modthe code unnecessary or recursive recursive function callsDB (“Hand-Opt”). ified compiled code with manually turned off (“No DB”), or function calls (“Hand-Opt”). We measure measure the running running time for all all the 33 versions versions and the the which sequentially samplestime all for the variables in the PP; (3) We the the and number of calls calls to to the the random number generator for “Swift” “Swift” the hand-optimized code without unnecessary memoization number of random number generator for andrecursive “No DB”. DB”. The result result is(“Hand-Opt”). concluded in in Tab. Tab. 2. 2. or function callsis and “No The concluded 2 Most 22 Most models are simplified to ensure that all the benchmark Most models are simplified to ensure that all the benchmark systems can can produce produce an an output output in in reasonably reasonably short short running running time. time. All All systems systems can produce an output in reasonably short running time. All of these these models models can can be be enlarged enlarged to to handle handle more more data data without without adding adding of extra lines lines of of BLOG BLOG code. code. extra extra lines of BLOG code. Bur R CT CC OU CG Hur X X X X U-B X X X T-K GMM ∞-GMM PPCA X X X X X X X X X 30 25 20 2.3x No RC Swift 15 10 5 X Table 1: Features of the benchmark models. R: continuous scalar or vector variables. CT: context-specific dependencies (contingency). CC: cyclic dependencies in the PP (while in any particular PW the dependency structure remains acyclic). OU: open-universe. CG: conjugate priors. Alg. Bur Hur U-B(20,2) U-B(20,8) Running Time (s): Swift v.s. No DB No DB 0.768 1.288 2.952 5.040 Swift 0.782 1.115 1.755 4.601 Speedup 0.982 1.155 1.682 1.10 Running Time (s): Swift v.s. Hand-Optimized Hand-Opt 0.768 1.099 1.723 4.492 Swift 0.782 1.115 1.755 4.601 Overhead 1.8% 1.4% 1.8% 2.4% Calls to Random Number Generator No DB 3 ∗ 107 3 ∗ 107 1.35 ∗ 108 1.95 ∗ 108 Swift 3 ∗ 107 2.0 ∗ 107 4.82 ∗ 107 1.42 ∗ 108 Calls Saved 0% 33.3% 64.3% 27.1% Table 2: Performance on LW with 107 samples for Swift, the version without DB and the hand-optimized version We measure the running time for all the 3 versions and the number of calls to the random number generator for “Swift” and “No DB”. The result is concluded in Tab. 2. The overhead due to memoization compared against the hand-optimized code is less than 2.4%. We can further notice that the speedup is proportional to the number of calls saved. Reference Counting for PMH RC only applies to open-universe models in Swift. Hence we focus on the urn-ball model with various model parameters. The urn-ball model represents a common model structure appearing in many applications with open-universe uncertainty. This experiment reveals the potential speedup by RC for realworld problems with similar structures. RC achieves greater speedup when the number of balls and the number of observations become larger. We compare the code produced by Swift with RC (“Swift”) and RC manually turned off (“No RC”). For “No RC”, we traverse the whole dependency structure and reconstruct the dynamic slice every iteration. Fig. 5 shows the running time of both versions. RC is indeed effective – leading to up to 2.3x speedup. Adaptive Contingency Updating for PMH We compare the code generated by Swift with ACU (“Swift”) and the manually modified version without ACU (“Static Dep”) and measure the number of calls to the likelihood function in Tab. 3. The version without ACU (“Static Dep”) demonstrates the 0 U-B(10,5) U-B(20,5) U-B(40,5) U-B(40,10) U-B(40,20) U-B(40,40) Figure 5: Running time (s) of PMH in Swift with and without RC on urn-ball models for 20 million samples. Alg. Bur Hur U-B(20,10) U-B(40,20) Running Time (s): Swift v.s. Static Dep Static Dep 0.986 1.642 4.433 7.722 Swift 0.988 1.492 3.891 4.514 Speedup 0.998 1.100 1.139 1.711 Calls to Likelihood Functions Static Dep 1.8 ∗ 107 2.7 ∗ 107 1.5 ∗ 108 3.0 ∗ 108 7 7 7 Swift 1.8 ∗ 10 1.7 ∗ 10 4.1 ∗ 10 5.5 ∗ 107 Calls Saved 0% 37.6% 72.8% 81.7% Table 3: Performance of PMH in Swift and the version utilizing static dependencies on benchmark models. best efficiency that can be achieved via compile-time analysis without maintaining dependencies at runtime. This version statically computes an upper bound of the children set c by Ch(X) = {Y : X appears in CY } and again uses Eq.(2) to compute the acceptance ratio. Thus, for models with fixed dependencies, “Static Dep” should be faster than “Swift”. In Tab. 3, “Swift” is up to 1.7x faster than “Static Dep”, thanks to up to 81% reduction of likelihood function evaluations. Note that for the model with fixed dependencies (Bur), the overhead by ACU is almost negligible: 0.2% due to traversing the hashmap to access to child variables. 5.3 Swift against other PPL systems We compare Swift with other systems using LW, PMH and Gibbs respectively on the benchmark models. The running time is presented in Tab. 4. The speedup is measured between Swift and the fastest one among the rest. An empty entry means that the corresponding PPL fails to perform inference on that model. Though these PPLs have different host languages (C++, Java, Scala, etc.), the performance difference resulting from host languages is within a factor of 23 . 5.4 Experiment on real-world dataset We use Swift with Gibbs to handle the probabilistic principal component analysis (PPCA) [Tipping and Bishop, 1999], with real-world data: 5958 images of digit “2” from the handwritten digits dataset (MNIST) [LeCun et al., 1998]. We compute 10 principal components for the digits. The training and testing sets include 5958 and 1032 images respectively, each with 28x28 pixels and pixel value rescaled to [0, 1]. Since most benchmark PPL systems are too slow to handle this amount of data, we compare Swift against other two 3 http://benchmarksgame.alioth.debian.org/ PPL Bur Church Figaro BLOG Swift Speedup 9.6 15.8 8.4 0.04 196 Church Figaro BLOG Swift Speedup 12.7 10.6 6.7 0.11 61 BUGS Infer.NET Swift Speedup 87.7 1.5 0.12 12 Hur U-B (20,10) T-K LW with 106 samples 22.9 179 57.4 24.7 176 48.6 11.9 189 49.5 0.08 0.54 0.24 145 326 202 PMH with 106 samples 25.2 246 173 59.6 17.4 18.5 30.4 38.7 0.15 0.4 0.32 121 75 54 Gibbs with 106 samples 0.19 0.34 ∞ ∞ - GMM ∞-GMM 1627 997 998 6.8 147 1038 235 261 1.1 214 3703 151 72.5 0.76 95 1057 62.2 68.3 0.60 103 84.4 77.8 0.42 185 0.86 ∞ Table 4: Running time (s) of Swift and other PPLs on the benchmark models using LW, PMH and Gibbs. scalable PPLs, Infer.NET and Stan, on this model. Both PPLs have compiled inference and are widely used for real-world applications. Stan uses HMC as its inference algorithm. For Infer.NET, we select variational message passing algorithm (VMP), which is Infer.NET’s primary focus. Note that HMC and VMP are usually favored for fast convergence. Stan requires a tuning process before it can produce samples. We ran Stan with 0, 5 and 9 tuning steps respectively4 . We measure the perplexity of the generated samples over test images w.r.t. the running time in Fig. 6, where we also visualize the produced principal components. For Infer.NET, we consider the mean of the approximation distribution. Swift quickly generates visually meaningful outputs with around 105 Gibbs iterations in 5 seconds. Infer.NET takes 13.4 seconds to finish the first iteration5 and converges to a result with the same perplexity after 25 iterations and 150 seconds. The overall convergence of Gibbs w.r.t. the running time significantly benefits from the speedup by Swift. For Stan, its no-U-turn sampler [Homan and Gelman, 2014] suffers from significant parameter tuning issues. We also tried to manually tune the parameters, which does not help much. Nevertheless, Stan does work for the simplified PPCA model with 1-dimensional data. Although further investigation is still needed, we conjecture that Stan is very sensitive to its parameters in high-dimensional cases. Lastly, this experiment also suggests that those parameter-robust inference engines, such as Swift with Gibbs and Infer.NET with VMP, would be preferred at practice when possible. 4 We also ran Stan with 50 and 100 tuning steps, which took more than 3 days to finish 130 iterations (including tuning). However, the results are almost the same as that with 9 tuning steps. 5 Gibbs by Swift samples a single variable while Infer.NET processes all the variables per iteration. Figure 6: Log-perplexity w.r.t running time(s) on the PPCA model with visualized principal components. Swift converges faster. 6 Conclusion and Future Work We have developed Swift, a PPL compiler that generates model-specific target code for performing inference. Swift uses a dynamic slicing framework for open-universe models to incrementally maintain the dependency structure at runtime. In addition, we carefully design data structures in the target code to avoid dynamic memory allocations when possible. Our experiments show that Swift achieves orders of magnitudes speedup against other PPL engines. The next step for Swift is to support more algorithms, such as SMC [Wood et al., 2014], as well as more samplers, such as the block sampler for (nearly) deterministic dependencies [Li et al., 2013]. Although FIDS is a general framework that can be naturally extended to these cases, there are still implementation details to be studied. Another direction is partial evaluation, which allows the compiler to reason about the input program to simplify it. Shah et al. [2016] proposes a partial evaluation framework for the inference code (PI in Fig. 1). It is very interesting to extend this work to the whole Swift pipeline. Acknowledgement We would like to thank our anonymous reviewers, as well as Rohin Shah for valuable discussions. This work is supported by the DARPA PPAML program, contract FA875014-C-0011. References [Agrawal and Horgan, 1990] Hiralal Agrawal and Joseph R Horgan. Dynamic program slicing. In ACM SIGPLAN Notices, volume 25, pages 246–256. ACM, 1990. [Arora et al., 2010] Nimar S. Arora, Rodrigo de Salvo Braz, Erik B. Sudderth, and Stuart J. Russell. Gibbs sampling in open-universe stochastic languages. In UAI, pages 30–39. AUAI Press, 2010. [Chaganty et al., 2013] Arun Chaganty, Aditya Nori, and Sriram Rajamani. Efficiently sampling probabilistic programs via program analysis. In AISTATS, pages 153–160, 2013. [Cytron et al., 1989] Ron Cytron, Jeanne Ferrante, Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. An efficient method of computing static single assignment form. In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 25–35. ACM, 1989. [Goodman et al., 2008] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. Church: A language for generative models. In UAI, pages 220–229, 2008. [Herbrich et al., 2007] Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill(tm): A bayesian skill rating system. In Advances in Neural Information Processing Systems 20, pages 569–576. MIT Press, January 2007. [Homan and Gelman, 2014] Matthew D Homan and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. The Journal of Machine Learning Research, 15(1):1593–1623, 2014. [Hur et al., 2014] Chung-Kil Hur, Aditya V. Nori, Sriram K. Rajamani, and Selva Samuel. Slicing probabilistic programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 133–144, New York, NY, USA, 2014. ACM. [Kazemi and Poole, 2016] Seyed Mehran Kazemi and David Poole. Knowledge compilation for lifted probabilistic inference: Compiling to a low-level language. 2016. [Kucukelbir et al., 2015] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic variational inference in Stan. In NIPS, pages 568–576, 2015. [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [Li et al., 2013] Lei Li, Bharath Ramsundar, and Stuart Russell. Dynamic scaled sampling for deterministic constraints. In AISTATS, pages 397–405, 2013. [Lunn et al., 2000] David J. Lunn, Andrew Thomas, Nicky Best, and David Spiegelhalter. WinBUGS - a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4):325–337, October 2000. [Mansinghka et al., 2013] Vikash K. Mansinghka, Tejas D. Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum. Approximate Bayesian image interpretation using generative probabilistic graphics programs. In NIPS, pages 1520–1528, 2013. [McAllester et al., 2008] David McAllester, Brian Milch, and Noah D Goodman. Random-world semantics and syntactic independence for expressive languages. Technical report, 2008. [Milch and Russell, 2006] Brian Milch and Stuart J. Russell. General-purpose MCMC inference over relational structures. In UAI. AUAI Press, 2006. [Milch et al., 2005a] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov. BLOG: Probabilistic models with unknown objects. In IJCAI, pages 1352–1359, 2005. [Milch et al., 2005b] Brian Milch, Bhaskara Marthi, David Sontag, Stuart Russell, Daniel L. Ong, and Andrey Kolobov. Approximate inference for infinite contingent Bayesian networks. In Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005. [Minka et al., 2014] T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6, 2014. [Nori et al., 2014] Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. R2: An efficient MCMC sampler for probabilistic programs. In AAAI Conference on Artificial Intelligence, 2014. [Pfeffer, 2001] Avi Pfeffer. IBAL: A probabilistic rational programming language. In In Proc. 17th IJCAI, pages 733–740. Morgan Kaufmann Publishers, 2001. [Pfeffer, 2009] Avi Pfeffer. Figaro: An object-oriented probabilistic programming language. Charles River Analytics Technical Report, 2009. [Plummer, 2003] Martyn Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, volume 124, page 125. Vienna, 2003. [Ritchie et al., 2016] Daniel Ritchie, Andreas Stuhlmüller, and Noah D. Goodman. C3: Lightweight incrementalized MCMC for probabilistic programs using continuations and callsite caching. In AISTATS 2016, 2016. [Sanderson, 2010] Conrad Sanderson. Armadillo: An open source c++ linear algebra library for fast prototyping and computationally intensive experiments. 2010. [Shah et al., 2016] Rohin Shah, Emina Torlak, and Rastislav Bodik. SIMPL: A DSL for automatic specialization of inference algorithms. arXiv preprint arXiv:1604.04729, 2016. [Stan Development Team, 2014] Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0, 2014. [Tipping and Bishop, 1999] Michael E. Tipping and Chris M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999. [Tristan et al., 2014] Jean-Baptiste Tristan, Daniel Huang, Joseph Tassarotti, Adam C Pocock, Stephen Green, and Guy L Steele. Augur: Data-parallel probabilistic modeling. In NIPS, pages 2600–2608. 2014. [Wingate et al., 2011] David Wingate, Andreas Stuhlmueller, and Noah D Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In International Conference on Artificial Intelligence and Statistics, pages 770–778, 2011. [Wood et al., 2014] Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, pages 2–46, 2014. [Wu et al., 2014] Yi Wu, Lei Li, and Stuart J. Russell. BFiT: From possible-world semantics to random-evaluation semantics in open universe. In Neural Information Processing Systems, Probabilistic Programming workshop, 2014. [Yang et al., 2014] Lingfeng Yang, Pat Hanrahan, and Noah D. Goodman. Generating efficient MCMC kernels from probabilistic programs. In AISTATS, pages 1068–1076, 2014.