I have made my own ancient DNA kits!

Tim Piatenko
6 min readMay 15, 2021

(TLDR: fun stuff is at the end)

Well, I think I have officially lost my mind 🤪 I spent a good chunk of my past week figuring out where to find published research DNA samples and how to convert them from the crazy data formats geneticists use to a simple CSV like you get from 23andMe or MyHeritage.

And I finally succeeded!!!

While it’s still all fresh in my mind, I’m going to document this feat here. Maybe I’ll do more. Maybe someone else will try to replicate what I’ve done. Who knows. For the benefit of humanity 😄

I first got the idea from My True Ancestry, since they use archeological DNA samples in the public domain and add their analysis and visualization tools on top. I realized I could see the sample IDs for those that were above my membership level and started wondering why I wouldn’t just go and find those myself…

So finally, I landed on the David Reich Lab website. It has everything you need! Well, mostly… or everything, if you are a genetics researcher. I’m not. I know data, but their data is special. Still, that’s never stopped me, once I embark on a quest I’m passionate about. I read about GEDmatch Archaic samples that Felix Immanuel made available a while back and realized that this is possible. I would not stop till I found a way to do it, too.

So I went to the Datasets tab and started grabbing them. I also read the page about the file formats, which made my head spin for days… but it eventually gave me what I needed. I also grabbed the EIG tool from Git to use the convertf utility to turn the crazy binary files I downloaded into tabular text I could import into RStudio. I did have to compile it myself and use homebrew for MacOS to get the missing gsl and openblas libraries.

So in the end I had a bunch of datasets from various publications in the geno format:

  • genotype file (most of them a packed ancestrymaps in binary…)
  • snp file (which SNPs were tested on which chromosome and what the reference and variant alleles are)
  • ind file (data on the individuals whose DNA is in there)

I decided to go with the Scythian-Sarmatian dataset:

Then I ran the convertf utility to get to a text version using the following:

Where the par file looks like this

And I ended up with a new file that looks like this:

In addition to the snp and ind ones from before:

Now we are talking! The only part left was to figure out how to get the actual inherited SNPs(reference vs variant) from all of this. And the file format descriptions explained it to me:

So I took this to mean that ig I have “G A” in my SNP file, then if in my ancestrymapgeno I have:

  • 0 = only veriants = “AA”
  • 1 = one reference = “GA”
  • 2 = two reference = “GG”

Now, I may be wrong here, so if you know better, please correct me! Otherwise, the rest was just loading the files into R and merging them + using the above logic to get to the final file, with the exception of naming conventions:

So in the end, I ended up with this in R:

And I had my two samples!!!

I uploaded them to GEDmatch

And they worked!!!

According to the paper, this is who they are:

And then I thought, “How can I test the validity of my results?” And it hit me — why not upload it to MyTrueAncestry and see what comes out? So I did… And lo and behold, it worked!!! It comes out as northern Chinese / Mongolian / Scythian!!!

Don’t know about you, but I think this is absolutely amazing! 😎

But wait, there’s more! Because these are ancient samples, they still retain a lot of the original DNA humans took with them from Africa via the Middle East. Check these out!

And also, since MTA contains archeological samples of the descendants of these folks as well, we can see how some of their great-great…-great children ended up in Europe as well:

BTW, the Sardinian matches support the new Ashkenazi Jews origins theory, which claims they were Anatolian-Hellenic-Romans who converted during the Roman period and were eventually displaced by the Slavs — mainly the Kievan Rus that decimated the (also converted into Judaism) Khazars and mixed with the local ancestors of Scythian-Sarmatians. History is mind-blowing!

Update:

Since I’ve started uploading my kits to GEDmatch and MyTrueAncestry, I’ve found existing matches to confirm my results! :D

On MyTrueAncestry, I0575 matched itself perfectly:

This is actually the PR9 sample shown here:

And then I saw this on GEDmatch, once one-to-many was available

That’s A10 shown here:

--

--

Tim Piatenko

I’m a Caltech particle physics PhD turned Data Scientist. Russia → Japan → US. Also on Mastodon @timoha@mastodon.world / @timoha@newsie.social 🐘