Phuong Dao

My primary research interest is to design efficient algorithms and machine learning methods to process and interpret biological networks and large-scale high-throughput genomics data. Over the years, I have mainly worked on the following three main areas:

Computational Methods for Analysis of scRNA Sequencing Data

At Memorial Sloan Kettering Cancer Center, I have been responsible for implementing an in-house pipeline for processing and analysis of 10X scRNA from fastq files to gene-count matrices (do not make use of 10X Cell Ranger). The pipelines also includes clustering of phenotypically similar cells and discovering differentially expressed genes specific for each cluster.
Moreover, I have designed and implemented novel algorithms and machine learning approaches for identifying copy number, discovering allelic expression, identifying and removing artifacts for scRNA sequencing data.

Computational Methods for Analysis of SELEX and HT-SELEX Data

Aptamers, short synthetic RNA/DNA molecules binding specific targets with high affinity and specificity, are utilized in an increasing spectrum of bio-medical applications. Aptamers are identified in vitro via the Systematic Evolution of Ligands by Exponential Enrichment (SELEX) protocol. SELEX selects binders through an iterative process that , starting from a pool of random ssDNA/RNA sequences, amplifies target-affine species through a series of selection cycles. HT-SELEX, which combines SELEX with high throughput sequencing , is capable of generating nearly one billion aptamer sequences. Given the massive amount of data generated by HT-SELEX, available computational methods to visualize high throughput sequencing data and to identify binding motifs neither posses the required scalability nor take advantage of important properties of the experimental procedure.
We introduced AptaGUI, an open-source and platform-independent graphical user interface (GUI) to visualize HT-SELEX data. AptaGUI contains many computational tools for HT-SELEX analysis, including data pre-processing and tracking the changes of individual aptamers and entire aptamer families (groups of aptamers sharing highly similar nucleotide sequences) throughout selection cycles. We recently developed AptaTRACE, a novel approach for the identification of sequence-structure binding motifs for massive amount of sequence data produced by HT-SELEX experiment. Our approach leverages the experimental design of the SELEX protocol and identifies sequence-structure motifs that show a signature of selection towards a preferred structure. In the initial pool, secondary structural contexts i.e. tendency of residing in a hairpin, bulge loop, inner loop, multiple loop, dangling end or being paired of each k-mer are distributed according to a background distribution. For sequence motifs involved in binding, in later selection cycles, this distribution shifts towards the structural context favored by the binding interaction with the target site. Utilizing a relative entropy based scoring function, AptaTRACE is able to identify the motifs that converge to a specific structural context throughout the selection cycles of HT-SELEX experiments.

Systems Biology Approaches to Study Complex Diseases

Recent high-throughput genomic technologies have been providing a comprehensive view of the molecular changes in cancer tissues. These technologies allow for the simultaneous genome-wide assay of the state of genomic variation, gene expression, DNA methylation, microRNA expression of tumor samples and cancer cell lines. Each omic technology can only explain mechanisms of complex diseases at the level of specific molecules that it can capture. However, complex diseases are the results of interactions among various types of molecules such as DNA, transcripts, metabolites and the environment. Hence, integration of different types of omic data together identifying with identifying complex interactions among these molecules can provide more comprehensive information on complex diseases. We introduced subnetwork biclusters which are combinations of genes and sample clusters. In each bicluster, participating genes form dense, connected subgraphs in a protein-protein interaction network and all the genes are differently expressed in a fraction of case samples. We utilized subnetwork biclusters as biomarkers to classify colon and breast cancer tissue samples and significantly reduced the classification error from those of methods based on gene expression data only. Since the currently available methods for discovering subnetwork biomarkers are either heuristics or exhaustive enumerations like the one we introduced, we later designed OPTDIS, a randomized algorithm with polynomial time to extract the best discriminative subnetwork biomarkers. The discriminative score is calculated as the difference between the total distance between samples from different classes and the total distance between samples from the same class.