I have made my own ancient DNA kits!
(TLDR: fun stuff is at the end)
Well, I think I have officially lost my mind š¤Ŗ I spent a good chunk of my past week figuring out where to find published research DNA samples and how to convert them from the crazy data formats geneticists use to a simple CSV like you get from 23andMe or MyHeritage.
And I finally succeeded!!!
While itās still all fresh in my mind, Iām going to document this feat here. Maybe Iāll do more. Maybe someone else will try to replicate what Iāve done. Who knows. For the benefit of humanity š
I first got the idea from My True Ancestry, since they use archeological DNA samples in the public domain and add their analysis and visualization tools on top. I realized I could see the sample IDs for those that were above my membership level and started wondering why I wouldnāt just go and find those myselfā¦
So finally, I landed on the David Reich Lab website. It has everything you need! Well, mostlyā¦ or everything, if you are a genetics researcher. Iām not. I know data, but their data is special. Still, thatās never stopped me, once I embark on a quest Iām passionate about. I read about GEDmatch Archaic samples that Felix Immanuel made available a while back and realized that this is possible. I would not stop till I found a way to do it, too.
So I went to the Datasets tab and started grabbing them. I also read the page about the file formats, which made my head spin for daysā¦ but it eventually gave me what I needed. I also grabbed the EIG tool from Git to use the convertf utility to turn the crazy binary files I downloaded into tabular text I could import into RStudio. I did have to compile it myself and use homebrew for MacOS to get the missing gsl and openblas libraries.
So in the end I had a bunch of datasets from various publications in the geno format:
- genotype file (most of them a packed ancestrymaps in binaryā¦)
- snp file (which SNPs were tested on which chromosome and what the reference and variant alleles are)
- ind file (data on the individuals whose DNA is in there)
I decided to go with the Scythian-Sarmatian dataset:
Then I ran the convertf utility to get to a text version using the following:
Where the par file looks like this
And I ended up with a new file that looks like this:
In addition to the snp and ind ones from before:
Now we are talking! The only part left was to figure out how to get the actual inherited SNPs(reference vs variant) from all of this. And the file format descriptions explained it to me:
So I took this to mean that ig I have āG Aā in my SNP file, then if in my ancestrymapgeno I have:
- 0 = only veriants = āAAā
- 1 = one reference = āGAā
- 2 = two reference = āGGā
Now, I may be wrong here, so if you know better, please correct me! Otherwise, the rest was just loading the files into R and merging them + using the above logic to get to the final file, with the exception of naming conventions:
So in the end, I ended up with this in R:
And I had my two samples!!!
I uploaded them to GEDmatch
And they worked!!!
According to the paper, this is who they are:
And then I thought, āHow can I test the validity of my results?ā And it hit me ā why not upload it to MyTrueAncestry and see what comes out? So I didā¦ And lo and behold, it worked!!! It comes out as northern Chinese / Mongolian / Scythian!!!
Donāt know about you, but I think this is absolutely amazing! š
But wait, thereās more! Because these are ancient samples, they still retain a lot of the original DNA humans took with them from Africa via the Middle East. Check these out!
And also, since MTA contains archeological samples of the descendants of these folks as well, we can see how some of their great-greatā¦-great children ended up in Europe as well:
BTW, the Sardinian matches support the new Ashkenazi Jews origins theory, which claims they were Anatolian-Hellenic-Romans who converted during the Roman period and were eventually displaced by the Slavs ā mainly the Kievan Rus that decimated the (also converted into Judaism) Khazars and mixed with the local ancestors of Scythian-Sarmatians. History is mind-blowing!
Update:
Since Iāve started uploading my kits to GEDmatch and MyTrueAncestry, Iāve found existing matches to confirm my results! :D
On MyTrueAncestry, I0575 matched itself perfectly:
This is actually the PR9 sample shown here:
And then I saw this on GEDmatch, once one-to-many was available
Thatās A10 shown here: