Using Cactus Graphs to Build a Pangenome
Transcription
Using Cactus Graphs to Build a Pangenome
Susan Tu Using Cactus Graphs for Multispecies Genomes Motivation Build a reference genome for multiple, related species Visualize variation between species Display data for several species relative to this reference genome Prior work Minimize partial order’s weighted symmetric difference Minimize Kemeny tau distance (number of out of order pairs) But this doesn’t penalize less likely order-switches (i.e., of sequences that are pretty far apart) enough What’s better about new approach: Nguyen et al.’s model takes “explicitly models double stranded nature of DNA” Problem Formulation Let S be the input DNA sequences Define the equivalence relation ~ on S x S Review: reflexive, symmetric, transitive We enforce strand consistency and strand exclusivity Consistency: x ~ y => -y ~ -x Exclusivity: x ~ y => neither x ~ -y nor –x ~ y We call each equivalence class, S/~, a side, and the forward and reverse complement sides together constitute a block Sequence Graphs G=(V,E) is a bi-directed sequence graph Bi-directed means that each edge is given an orientation at each endpoint Each edge is a pair of sides A thread path is a sequence of sides such that each consecutive pair of sides is connected by an edge going in that direction Transitive sequence graph: add in edges for sides connected by thread path Constructing the Cactus Graph Merge nodes that are connected only by adjacency or backdoor adjacency edges (the only other kind of edge is block edges) Each 3-edge-connected component should be merged into 1 node Merge all leaf nodes and branching nodes of bridge trees into single node (Call the node that contains the backdoor group component the origin node) Constructing a Pangenome Reference A set of non-empty threads such that each block is visited once Find the one F with the best score, where the score is the sum of the weights of edges that are consistent with F NP-hard problem Heuristic for Pangenome Reference Problem Cactus graph represents sequence graph in hierarchical form Create a pangenome reference independent for each net of the cactus graph (can do this in parallel) Solve each subproblem using greedy algorithm and simulated annealing Greedy: add in element of V, picking insertion point and member of V that maximizes consistency with elements already in F Multi-level Cactus Graphs A bunch of cactus graphs connected in the shape of a tree Represent progressively more detailed levels of alignment Maximum weight Cactus subgraph with large chains problem This is for constructing the initial cactus graph (we will make it multilevel later) Find a cactus graph such that all chains are of length >= alpha, and the weight is maximal Length of chain: # of block edges it contains Weight of a cactus graph: sum of weights of its block edges Minimizing entropy of multilevel cactus graph Results Results References Benedict Paten, Mark Diekhans, Dent Earl, John St. John, Jian Ma, Bernard Suh, and David Haussler. Journal of Computational Biology. March 2011, 18(3): 469-481. doi:10.1089/cmb.2010.0252. Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Research. 2011;21(9):1512-1528. doi:10.1101/gr.123356.111. Nguyen Ngan, Hickey Glenn, Zerbino Daniel R., Raney Brian, Earl Dent, Armstrong Joel, Kent W. James, Haussler David, and Paten Benedict. Journal of Computational Biology. May 2015, 22(5): 387-401. doi:10.1089/cmb.2014.0146.