Views
Present
Jason, Nitin, Ross, Scott
Action Items
Nitin will rearrange CVS to conform to R package layout standards
Ross will create small and large ped files of public test data
ALL will think carefully about representing missing values in the genetics.ng internal representation and related vexing complications
R Package conformity rearrangements
Nitin discussed rearranging CVS files to conform with R package structure so the package can be more easily imported into an existing R installation. Some discussion about names. Since r-genetics contains an illegal character it can't be used as a package name - all agreed that genetics.ng was good enough to go ahead with.
sample data
Ross discussed sources and formats of available sample data: http://innateimmunity.net and http://pga.gs.washington.edu/finished_genes.html have between them about 200 genes resequenced in 23 EA and 24 AA coriell cell lines in "prettybase" format - locus id a1 a2 where locus is a numeric offset, id is a subject id string, a1 and a2 are letter alleles
http://hapmap.org has about a million snps in 30 CEPH trios - chr22 has about 800 snps and chr1 has nearly 20k snps. The data is in a format which Ross will translate into a pedigree file - pedigree files are (fairly) widely used for family based human data. Originally promulgated by Jurg Ott's group (http://linkage.rockefeller.edu/ott/linkutil.htm), they have the marker names as the first row (all fields are whitespace delimited), followed by one line per subject with 6 fixed fields (family_id individual_id father_id mother_id gender affection_status) and then the 2 alleles for each marker. Ross will add non-standard comment lines with # as the first character to add comments about where the data came from so the import routine should ignore those lines.
Ross agreed to create some small and large data files for testing in ped format so only one (of the soon to be many!) import routines can be written.
missing values
Long discussion about missing values in the current genetics.ng representation. In an ideal world, the analysis routines might only have to know about the internal representation although for result presentation, annotation requires that the analysis code be able to deconstruct the internal representation back to whatever allele codes the user wanted to use. Scott pointed out that for many of the older types of marker, one allele might be unknown (eg a dominant allele where the other allele cannot be determined), so having a missing allele is still analysable. Ross (selfishly) argued that it might be best to try to cover the 90% of cases requiring 10% of the work - in human SNP association data, a missing allele really meant that the marker was missing for that subject since most high throughput snp genotyping methods give 2 good calls or else the call is pretty much uninterpretable - if you can't reliable distinguish two homozygous "clouds" and a heterozygous cloud from taqman/sequenom/illumina genotyping, you tend to drop the whole marker from all analysis. All agreed to think about this more for the next meeting.
Missing Values -- Sat, 12 Feb 2005 16:06:53 -0500 reply
One comment about missing data. One important reason for handling 1 called and 1 missing allele is the X and Y chromosomes for males. To be pedantic, males have 1 copy of X and 1 copy of Y, so biallelic formatting should render these as marker/NA since they aren't exactly homozygotes... Sometimes one will model them as homozygotes sometimes not.
In my current code, I allow for NA (completely missing), as well as whatever/NA and NA/whatever. In all of the codes I've worked on the latter two get mapped to the first, but there are cases where the mapping ought to be different.
This wasn't hard to handle using my internal format, and should be straightforward using our new format. Total NA's are simply coded as NA in the data matrix. NA alleles are simply appropriately marked as NA in the translation table.
-Greg
Re: Missing values -- Mon, 14 Feb 2005 21:33:32 -0500 reply
Missing values --chasalos, Mon, 14 Feb 2005 22:23:10 -0500 reply
I expanded on some of my thoughts about missing values in a (new) IDEAS file in CVS. Here's an excerpt:
A. Can a completely unknown genotype be encoded by a missing value (NA) in the callCodes slot? Scott votes NO. I don't see much benefit. Costs are lack of consistency, lack of explicitness, and added programmer headaches.
I propose we add a "missingCodes" slot to the data structure for this purpose. This is a list, similar to the transTables slot. The names of the list MUST match the names of the transTables list exactly. Each component is a character vector containing allele strings to be interpreted as missing for a particular locus type.
I realize this adds complexity. But I think the flexibility gained by not forcing "NA" and only "NA" to signify a missing allele will turn out to be worth it in the long run.
Please see the IDEAS file for more detail than you would like.
-Scott