I live in St.Louis, Missouri, where I am focused on industrial genomics (the organization of large-scale datasets for biotechnology), distributed systems, web3, RSS, paleontology, art, and pattern languages.
A blog post I co-wrote with Google Kubernetes Engineers on some of my work on Industrial Genomics: “Bayer Crop Science seeds the future with 15000-node GKE clusters”. I gave a talk about my work at the Plant and Animal Genome XXVIII Conference titled: Industrializing Genotype Data on Public Cloud Infrastructure. in which I discuss my work organizaing genomic/genetic data for Bayer Crop Science, with a massive scale genotype imputation project as the use case. my team of data engineers implemented an imputation engine that takes all varieties of genotype data, along with reference style imputation from Beagle and a rich graph of pedigree data, combined with a novel pedigree-based imputation algorithm to render the best possible view all germplasm given all known genotype observations, in an always-on manner.
I have used combinations of the following to implement industrial genomics solutions:
- teams of software engineers / data engineers
- Google Cloud Platform
- Go and Python
- Spanner, HBase, Neo4j
- genomics standards: BAM, VCF, GFF, BED, etc.
in order to make better use of personal genomics data, I have created a utility called 2vcf for converting one’s 23andme or ancestry.com raw data into VCF files. I worked with the African Kinship Reunion to help convert personal genomics genotype calls into VCF format using 2vcf for the purpose of imputation and IBD calling.