Computational Genomics MSCBIO2070/02-710/02-510, (Spring 2016)

Course Project

Updated: 2-24-2016

Your class project is an opportunity for you to explore an interesting genomic analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two students. Your project will be worth 25% of your final class grade, and will have three final deliverables:

  • Project proposal (1 page maximum, reference list should be included in additional pages), due March 28 , worth 5% of the project grade.
  • Poster presentation, on April 27, worth 20% of the project grade.
  • Final report (8 pages maximum), due May 2, worth 75% of the project grade.

Project Proposal:

You must turn in a brief project proposal (1 page maximum, reference list should be included in additional pages) by March 28.

You are encouraged to come up a topic directly related to your own current research project, but the proposed work must be new and should not be copied from your previous published or unpublished work.

You may pick a “less adventurous” project from the following list of potential project ideas. You can apply new methods to the datasets that have been successfully used in prior publication and can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list. If you need some additional help picking a project and getting started feel free to discuss it with the course instructors

Project proposal format: 

  • Project title
  • Team members (including Andrew IDs)
  • Project idea.  This should be approximately two paragraphs.
  • Software you will need to write.
  • Papers to read.  Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal

Project Report:

You must turn in a project report (8 pages maximum) by May 2.

You could make use of some journal/conference templates to organize your report but make sure your report at least contains the following high-level structure. It is very important for you to follow the following structure, as this is how a typical scientific paper is structured. Your report will be evaluated on both the quality of writing (e.g., does it have a good structure, is it clearly written, was it proofread) as well as the content (effort in computational modeling and results).

  • Project title
  • Team members (including Andrew IDs)
  • Introduction:  State the motivation, the problem you are addressing, review of the literature, your approach for solving the problem. You can approximately plan on 1 paragraph for each of the items I listed.
  • Methods: Explain your computational approach. Describe models and then describe the learning methods in two different subsections.
  • Experimental Results: If it is more convenient for you, you can have the Experimental Results section "before" the Methods section. Plan on having subsections on 1) dataset, 2) details on how you applied your method to the dataset, 3) results (this last item on results will be the most substantial portion of Experimental Results section)
  • Conclusions: summarize your approach and finding
  • References

Regarding printing the posters:

SCS Computing Facilities has instituted a new procedure for printing posters. The new procedure is intended to make the process of poster printing faster and easier for the SCS community. There will no longer be a need to call Operations in order to print a poster. You can now submit posters via email, to Simply follow the printing procedures that are documented on the SCS Help pages at: and Operations will print the poster and notify you when it is ready for pickup. Please contact SCS Operations at x8-2608 or send mail to with any questions or concerns. Also, the poster boards we use are 32"x 40" Non SCS students will need to contact their departments about resources for printing posters.


Project suggestions:

Ideally, you will want to pick a problem in a domain of your interest, e.g., DNA sequence analysis, genetic polymorphisms, regulatory networks, etc., and formulate your problem in a statistical and/or machine learning framework. For example, you can adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis. Last year's project are listed as follows,

  • HoriGenT: A novel software to detect Horizontal Gene Transfer
  • Investigating Structure in Microarray Gene Expression Data: Non-negative Matrix Factorization and other methods
  • Comparative genomic analysis of single stranded RNA viruses
  • A Workflow for Identifying Transcription Factor Directly from DNase Protected data
  • Analysis of clustering methods for lung tissue miRNA
  • Uncovering relationships between network topology and co-evolutionary signatures in Protein-Protein Interaction Networks
  • Modeling Precision Treatment of Breast Cancer
  • Comparison of Sepsis Time Series Gene Expression Data Classification
  • Sequence Features of Translation Pause Sites and Slow Translation Regions
  • Identifying Inherent Altruistic Biases in Human Genomic Studies
  • Prediction of Optimum Sampling Points for Time Series Lung Development data
  • Changes in Gene Expression due to Aging and their Relationship with Cancer
  • Positive Selection in the Genomes of Humans and Chimpanzees
  • Identifying Significantly Linked Proteins in HiC using ChIPSeq
  • Improving performance of Random Forest in Clinical Feature Learning
  • The Identification of Complementarity Determining Regions of Antibody Sequences

    You can also find some project ideas below.

  • Project A: Haplotyping blocking and genetic demorgraphical inference

    Genetic polymorphisms such as SNPs and Microsatellite carry important information of human evolution and disease propensity. One of the interesting problems in this area is to infer the haplotype of long sequence of ambiguous genotypes based on haplotypes of small overlapping regions. In this project we want to build a haplotype assembler using a partition-ligation scheme and/or a tiling scheme to stitch together short haplotypes inferred by off-the-shelf haplotype inference algorithm; and then, after determining long haplotypes of a long stretch of markers, find the best block structure using dynamic programming and information theoretic scoring. The resulting blocks will provide essential markers for mapping disease genes and for inferring the evolutionary history of given populations.


    Niu et al. Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms, Am J Hum Genet. 2006 Jan;78(1):174

  • Project B: Discovering network motifs and recurring subgraphs from sequences of biological networks

    Network motifs refer to recurring subgraphs and connectivity patterns in a single or multiple networks. They usually represent certain pathway components and bio-regulatory mechanisms, and their occurrence profiles are often unique to different networks and imply intrinsic functionalities of the biological networks. Early research in this area focuses on searching for small motif in a single network. In this project we want to develop algorithms for searching large and possibly overlapping subgraphs recurring over multiple graphs. We will explore algorithms for constructing multiple networks, and graph theoretical approaches to mine such networks for motifs. 


    Hu H, Yan X, Huang Y, Han J, Zhou XJ (2005) Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics (ISMB 2005), Vol. 21 Suppl. 1 2005, pages 213-221.  Supplementary Material/Software

    Zhou XJ, Kao MJ, Huang H, Wong A, Nunez-Iglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH (2005) Functional annotation and network reconstruction through cross-platform integration of microarray data.  Nature Biotechnology 2005 Feb;23(2):238-43.          

    E. Wong, B. Baur, S. Quader and C. Huang (2011) Biological network motif detection: principles and practice. Briefings in Bioinformatics, doi: 10.1093/bib/bbr033.

  • Project C: Protein function prediction from interaction network using graph theoretic and statistical latent-space modeling approaches

    Local and global connectivities of an element in a network are often indicative of its functions; and such predictions often going beyond the traditional approaches that are based on physical and sequence properties biological element, but seeks a combination of such properties with its interaction contexts in biological processes, as reflected in the network, and such predictions can often be context-specific. In this project explore algorithms to infer biological functions of proteins from protein-protein interaction networks and other protein attributes. 


    E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, A Latent Mixed Membership Model for Relational Data.Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD-2005).

  • Project D: Network structure inference using time series data

  • Dynamic Bayesian networks from time series datasets

    Time series Expression data measures the levels of genes following specific treatment. For example, following pathogen infection such data can provide insight to the set of genes that are responding to the infection and to the immune response system. Using time series data we would like to learn a graphical model that represent the set of interactions that are employed as part of the response. In this project you will explore ways to use time series datasets for determining the structure and parameters of the regulatory network underlying the observed responses.
  • Inferring the Drosphila development network based on microarray expression profile time series 

    We use probabilistic graphical models (e.g., Bayesian Networks), information theoretic approaches (e.g., mutual information minimization) and graph theoretic methods (i.e., path finding) to infer such networks from microarray expression profile time series.


    Schulz, M. H., Devanny, W. E., Gitter, A., Zhong, S., Ernst, J., & Bar-Joseph, Z. (2012). DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data. BMC systems biology, 6(1), 104.

  • Project E: Classification using time series expression data

    It has been shown that the type of cancer, and in some cases the right treatment option can be determined by looking at the expression profile of a patient. Many famous classification algorithms have been suggested for this task including SVM, Naïve Bayes and statistical tests. More recently, measurements that follow patients over time are becoming available. This project will explore ways to develop classifiers that are appropriate for time series data. 


    Orsenigo, C., & Vercellis, C. (2010). Time series gene expression data classification via L 1-norm temporal SVM. In Pattern Recognition in Bioinformatics (pp. 264-274). Springer Berlin Heidelberg.

  • Project F: Protein interaction network

    Recent experiments have identified many new protein-protein interactions. While the quality of this data is not great, it does serve as a useful source for integration with other available datasets. In this project you will explore the relationship between the interacting proteins and other types of high throughput data (such as expression or binding). Specifically, it is interesting to see of aspects that cannot be inferred from the current interaction data (such as pathways) can be determined by using these complementary data sources.


    Navlakha, S., Gitter, A., & Bar-Joseph, Z. (2012). A network-based approach for predicting missing pathway interactions. PLoS Comput Biol, 8(8), e1002640.

  • Project G: Cancer pathway subtype analysis

    Personalized medicine is already becoming a reality in cancer treatment. Signatures for cancer subtypes have been found in gene expression, epigenetic, and genome sequence data. In this project, you will explore the use of computational tools to identify cancer subtypes from various types of genomic data and to classify tumor data into subtypes.


    The Cancer Genome Atlas Network (2012) Comprehensive molecular portraits of human breast tumours. Nature 490: 61-70.

  • Project H: Gene network analysis

    In this project, you will construct gene regulatory networks from genomic data. In particular, Gaussian graphical models have been extremely popular as a computational tool for constructing a gene network from gene expression data. You will explore different variants of Gaussian graphical models to construct a gene network, to identify gene modules, and to interpret the learned network and modules.


    Grechkin et al. (2015) Pathway graphical lasso. AAAI 2015.

    T. Wang et al. (2016) FastGGM: An Efficient Algorithm for the Inference of Gaussian Graphical Model in Biological Networks. PLoS Computational Biology, 12(2): e1004755.

  • Project I: eQTL mapping

    Expression quantitative trait locus (eQTL) mapping detects the genetic loci that control gene expressions by examining genetic variant and gene expression data from a large number of individuals. eQTL mapping can be used to study the genetic basis of diseases or the genetic control in different tissue types. This project explores the use of computational methods to determine eQTLs and the biological mechanisms influenced by the eQTLs.


    J. Becker et al. (2012) .A systematic eQTL study of cis–trans epistasis in 210 HapMap individuals. Eur J Hum Genet, 20(1): 97–101.

    B. Stranger et al. (2012) Patterns of Cis Regulatory Variation in Diverse Human Populations. PLoS Genetics, 8(4): e1002639.

  • Project J: Population structure inference

    Genome sequences contain information on ancestry, population evolution, and migration. In this project, you will analyze genome sequence data from a population to study the population structure and identify the ancestry information for each individual.


    B. Engelhardt, M. Stephens (2010) Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis. PLoS Genetics, 6(9): e1001117.

    A. Raj, M. Stephens, and J. Pritchard (2014) fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics 197(2): 573-589.

  • Project K: Calculating the relative abundance of different transcript isoforms

    Simulate RNAseq data from a variety of complex splicing scenarios and investigate performance of different quantification methods. You can either compare existing implementations or implement some strategies from scratch.


    Read-simulation software can be found at


    Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads, Nature Biotechnology 2015

  • Project L: Identifying genomic regions under accelerated evolution

    Use Rphast models and genomic multiple alignment to identify regions that show accelerated evolution in different species or species groups.


    Rphast is an R package found at:


    Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20: 110–121.

    Pollard KS, Salama SR, Lambert N, Lambot M-A, Coppens S, Pedersen JS,Katzman S, King B, Onodera C, Siepel A, et al. 2006b. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443

    Pertea M1, Pertea GM, Salzberg SL. Detection of lineage-specific evolutionary changes among primate species. BMC Bioinformatics. 2011 Jul 4;12:274.