Playing with Archeological DNA Samples: Part 1 — Getting the data

Tim Piatenko
4 min readJun 12, 2021

If you’ve read any of my previous posts, you know I’m an avid user of MyTrueAncestry — a DNA data analysis service that accepts uploads of raw data from popular testing companies like 23andMe or MyHeritage. What’s different about them is that instead of (just) running a PCA on your DNA data against “known” populations, they pull in actual historical DNA samples retrieved from various archeological sites. The more you pay, the more samples you have access to, and the more complete your ancestral portrait.

I noticed right away that the matches you get on MTA are labeled by their “official” research IDs, such as VK159 for an old Rus (Viking) from Pskov, Russia. As a former scientist, I got curious about the underlying data and went looking for the public databases that house it. First I found a map that lists nearly ALL archeological DNA samples processed for various research studies so far. It’s an amazing tool:

When you click on a dot on that map, you get a brief description + links to the research paper that it appeared in, along with mtDNA and Y-DNA haplogroups (if any) that were identified for this sample. When you click through to the Nature paper in the example above, you get this

And if you click through the Data Availability link, you end up here

Which links to this sample in the European Nucleotide Archive. In this case, this is part of a large study of Viking genomics, which includes hundreds of individuals!

The only problem is… what the %^&* is a fastq or bam file??

Well, turns out they are only two of the dozens of formats used by genetics researchers… And there’s a whole slew of conversion tools out there. The trick is to make them work :) What I wanted was a 23andMe compatible file in the following format:

After beating my head against the wall for a few days, I finally found the right combination of tools to make this work.

First, you need samtools and bcftools, which luckily are available via Homebrew on a Mac. You also need extract23 from github, which provides the right sequence of commands to unpack a BAM binary into a raw VCF file, call variants on it, and decode the result into the chormosome–RSID–position–genotype format that we want.

The real trick was getting the right genome reference and creating the proper segment filter file as inputs. Without the reference, you can’t do anything with the raw Illumina chip reads that BAM gives you. Without the segment filter you end up wasting hours of CPU time on scanning the entire human genome…

I’ll spare you the details of my journey. In the end, I tried ALL the human genome references, before I got human_g1k_v37.fasta to work. As for the filter, after much interweb surfing, I found that you can grab the 23andMe file via

wget -O 23andMe_all_hg19_raw.tab https://api.23andme.com/1/genome_snp_map/

But the file had to be massaged into a simpler format using dplyr in R:

Armed with these, I can issue the following commands and end up with a zipped TSV

samtools index VK159.final.bam

<path_to>/extract23.sh -b VK159.final.bam -r <path_to>/human_g1k_v37.fasta -t <path_to>/ref.tab.gz

Unzip the resulting 23andMe_V3_hg19.txt.zip file it and upload the txt anywhere you want, just like you would with the raw data files you get directly from 23andMe!

I like checking the results by uploading them to MyTrueAncestry and seeing if they get identified correctly. Some samples they don’t have, but they will usually have related ones. Often times, though, especially for the Viking specimens, they will already have the same one and the match will be perfect

See, I won :D

If you are curious and have an account on GEDmatch, you can run against the following ones I’ve uploaded:

GU2123168 = Early Sarmatian

JU2094003 = Early Sarmatian

WC7594213 = Aldy Bel “proto Scythian”

GU2799296 = Aldy Bel “proto Scythian”

GJ7792846 = Pazyryk “proto Scythian”

WN5669614 = Pazyryk “proto Scythian”

SC1667894 = Zevakino Chilikta

XC5458090 = Scythian Ukraine

BM3927753 = Old “proto” Saami

AU8420095 = Pskov Viking

AX3450330 = Old Rus

GH6744814 = Chernigov Rus

CM6421473 = Chernigov Rus

KH7245150 = Smolensk Rus

EP4306886 = Medieval Livonia

RJ7354034 =Medieval Livonia

HC7142803 = Izyaslav Ingvarevich

In the next installment I’ll go into some of the analysis tools I’ve tried putting these through and the results I’ve got so far. More to come!

--

--

Tim Piatenko

I’m a Caltech particle physics PhD turned Data Scientist. Russia → Japan → US. Also on Mastodon @timoha@mastodon.world / @timoha@newsie.social 🐘