Personal tools
 
Views

Greg, Nitin, Scott and Ross met on 23 February.

1.BioC

Greg will ping Herve and Seth to see where our submissions are in the BioC development repository review process

2.Scalability and Performance issues

Ross and Weiliang have been playing with very large (500k markers, 2000 subjects and having memory problems. Greg pointed out that 32bit R will max out at 2GB of addressable objects of which at least 0.5 GB will likely already be chewed up. For modest sized data sets (50k markers, 1000 subjects) performance is fine for some basic tasks such as HWE but slow for counting missing markers by marker - easy by subject.

Substantial discussion about the role of a raw matrix for callCodes. Greg suggested a base class with an abstract callCodes slot, with a pair of derived classes, one with the existing integer callCodes matrix for situations where there may be multiple alleles such as STR/microsatellite markers for linkage, and the other with the callCodes matrix in raw mode so one byte per genotype. The constructor might either be given a hint as to which one to use or might figure out how many allele-pair combinations need to be represented and make a decision for itself. Scott mentioned Plink (http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml) and the possibility of writing a wrapper. Ross indicated that Plink was targetted at large scale whole genome biallelic SNP projects, came from smart folks at MGH and the Broad Institute and that outputs were generally fairly verbose (a 500k marker HWE report > 1GB of text eg!) and needed massaging for presentation which R is likely to be very good at. Greg started looking at the source and declared that bools are used for each allele - major/minor allele and missings are stored in a vector of offsets.

Todo: Need to take another look at the BioC eSet class to see what's changed there. Need to evaluate Plink internal representations with a view to writing a wrapper so we can Borg their functionality and add value with nice presentation and graphics utilities. Greg to draft and circulate a proposal

3.Helping Hand: Scott has a potentially willing R helper. Suggestions for specific projects were discussed. Ross described the need for a good "out of the box" experience with a QC report for a new dataset as being one of the first tasks any new user is likely to try. Scott will discuss with his mathematician colleague

4.Next Telecon: March 9 1:30 pm eastern


 

Powered by Plone, the Open Source Content Management System

This site conforms to the following standards: