Views
Minutes for 2007-11-09
In Attendance: Greg, Nicholas, Nitin, Ross, Scott
Note taker for today: Ross
Nitin's summary of features
Discussion about GeneticsPed problem
GeneticsPed didn't make it into the BioC 2.1 release last month - example error that appeared probably after the changeover to R.2.5 as the development environment
Nitin's list of undocumented features
No need to document unexported functions.
Links on rgenetics.org front page for developers
Need a third link to the BioC development SVN for things ready to go into the next BioC release. SF svn remains in use for unreleased packages in development.
More efficient representation:
BOOST library looks promising.
Efficiency:
Nicholas points out that profiling shows that column names are being used to access columns rather than indexes in GeneticsBase leading to bad slowdown. Fixes are being worked on. Hashing or some sophisticated matching is needed for repeated column and row access. Read column names, get indexes at start and use those. Problem will remain even with more efficient underlying data structures - precompute accessors and put them into a hash table..
Scott reminded us that banging on is what we need to get things up to speed!
Considerations for efficient GeneSet
Discussion about GeneSet row and column name cache stored hashed so lookups can be quicker. Not currently cached - needs to be done as part of new representation. Privacy for matrix dim names to keep synchronization with the accessors. Worry about developers cheating to get directly at values - we can't stop that from happening.
Greg pointed out that the allele summary functions were designed for small candidate gene data sets so are slow because they lookup names repeatedly. This could be changed easily but won't fix the general problem. Discussion about automagically hashing if no hash, and using the hash for lookup - any accessor that changes a hashed index would need to update the hash.
For subsetting, Greg proposed a mapping object - as an index set for an existing object to avoid R's innate copying behaviour in creating subsets. Apparently not always needed for S4 classes. Scott proposed lists of vectors to subset rows and columns as a "cheap" way to create subsets.