Playing with Archeological DNA Samples: Part 1 — Getting the data
If you’ve read any of my previous posts, you know I’m an avid user of MyTrueAncestry — a DNA data analysis service that accepts uploads of raw data from popular testing companies like 23andMe or MyHeritage. What’s different about them is that instead of (just) running a PCA on your DNA data against “known” populations, they pull in actual historical DNA samples retrieved from various archeological sites. The more you pay, the more samples you have access to, and the more complete your ancestral portrait.
I noticed right away that the matches you get on MTA are labeled by their “official” research IDs, such as VK159 for an old Rus (Viking) from Pskov, Russia. As a former scientist, I got curious about the underlying data and went looking for the public databases that house it. First I found a map that lists nearly ALL archeological DNA samples processed for various research studies so far. It’s an amazing tool:
When you click on a dot on that map, you get a brief description + links to the research paper that it appeared in, along with mtDNA and Y-DNA haplogroups (if any) that were identified for this sample. When you click through to the Nature paper in the example above, you get this
And if you click through the Data Availability link, you end up here
Which links to this sample in the European Nucleotide Archive. In this case, this is part of a large study of Viking genomics, which includes hundreds of individuals!
The only problem is… what the %^&* is a fastq or bam file??
Well, turns out they are only two of the dozens of formats used by genetics researchers… And there’s a whole slew of conversion tools out there. The trick is to make them work :) What I wanted was a 23andMe compatible file in the following format:
After beating my head against the wall for a few days, I finally found the right combination of tools to make this work.
First, you need samtools and bcftools, which luckily are available via Homebrew on a Mac. You also need extract23 from github, which provides the right sequence of commands to unpack a BAM binary into a raw VCF file, call variants on it, and decode the result into the chormosome–RSID–position–genotype format that we want.
The real trick was getting the right genome reference and creating the proper segment filter file as inputs. Without the reference, you can’t do anything with the raw Illumina chip reads that BAM gives you. Without the segment filter you end up wasting hours of CPU time on scanning the entire human genome…
I’ll spare you the details of my journey. In the end, I tried ALL the human genome references, before I got human_g1k_v37.fasta to work. As for the filter, after much interweb surfing, I found that you can grab the 23andMe file via
wget -O 23andMe_all_hg19_raw.tab https://api.23andme.com/1/genome_snp_map/
But the file had to be massaged into a simpler format using dplyr in R:
Armed with these, I can issue the following commands and end up with a zipped TSV
samtools index VK159.final.bam
<path_to>/extract23.sh -b VK159.final.bam -r <path_to>/human_g1k_v37.fasta -t <path_to>/ref.tab.gz
Unzip the resulting 23andMe_V3_hg19.txt.zip file it and upload the txt anywhere you want, just like you would with the raw data files you get directly from 23andMe!
I like checking the results by uploading them to MyTrueAncestry and seeing if they get identified correctly. Some samples they don’t have, but they will usually have related ones. Often times, though, especially for the Viking specimens, they will already have the same one and the match will be perfect
See, I won :D
If you are curious and have an account on GEDmatch, you can run against the following ones I’ve uploaded:
GU2123168 = Early Sarmatian
JU2094003 = Early Sarmatian
WC7594213 = Aldy Bel “proto Scythian”
GU2799296 = Aldy Bel “proto Scythian”
GJ7792846 = Pazyryk “proto Scythian”
WN5669614 = Pazyryk “proto Scythian”
SC1667894 = Zevakino Chilikta
XC5458090 = Scythian Ukraine
BM3927753 = Old “proto” Saami
AU8420095 = Pskov Viking
AX3450330 = Old Rus
GH6744814 = Chernigov Rus
CM6421473 = Chernigov Rus
KH7245150 = Smolensk Rus
EP4306886 = Medieval Livonia
RJ7354034 =Medieval Livonia
HC7142803 = Izyaslav Ingvarevich
In the next installment I’ll go into some of the analysis tools I’ve tried putting these through and the results I’ve got so far. More to come!