How DNA testing actually works

Since I’ve become involved in multiple Facebook groups dealing with DNA, ancestry, and genealogy, I’ve realized that most people have a very vague idea of what is underneath all of this. And many seem to have false preconceived notions… Let’s walk through the basic steps of how DNA testing and subsequent data analyses actually work. I’m not claiming a deep expertise here, just reasonable basic understanding and working knowledge.

Let’s start at the ground level. This is DNA:

It lives in the nucleus of each cell in your body and is responsible for all reproduction / regeneration of a living organism. It’s the code that determines how things will function.

And the code is made up of just 4 symbols called nucleotides — ACGT. They sit on the strands and connect to each other, creating the famous double-helix structure. Chains of base-pairs are called genes, and the long strings of them are chromosomes. There are 22 pairs where you get one copy from each parent, and then the XX or XY in pair 23 that determine your physical gender.

The process of sequencing itself is simplistically described as ripping apart the DNA helix, making copies, using dies to color the different bases, imaging, and feeding the data to a computer.

And you end up with a colorful picture like the one on the screen in the corer of that graphic

Of course the raw images are not that awesome :)

But anyway, now we have the code mapped and can transform it into a text-based dataset like this:

RSID here stands for “Reference SNP cluster ID”, which lives on a chromosome “N” at position “XXYYZZ”. But what is an SNP? It’s a single nucleotide polymorphism — one base-pair difference in the DNA sequence.

It’s literally a single switch from T to G etc. And there are millions of them in our genome! The variations are what keeps the population going. Some, like the Y-chromosome ones, persist and can be traced back in time for millenia.

Over the years, we have mapped these nucleotide flips for reference. Here’s an example of a marker in the Y-chromosome that identifies the male haplogroup N:

It’s the rs9341278 cluster at position 15469724 on the Y chromosome that flips G in the original human genome to A. We call it M231 or Page91. If your DNA test picked up rs9341278 and got an A there, you fit into the N tree. And then you can go down the tree — if you have C instead of T for rs34442126, then you are N1a1, etc.

For the autosomal DNA that comes from the first 22 pairs of chromosomes, one marker is not enough for anything! This DNA, unlike the Y, which lacks a matching pair, keeps recombining and randomly changing. That’s why you are not a clone of your parents. So what we are looking for is chains of SNPs, measured in units of length called centimorgans, or cM. So when you match one DNA sample against another, you get segment overlaps

Above is my match against a historical sample from almost a millennium ago. I have an overlap in 6 chains of SNPs, totaling 139 cM, with the longest single chain containing 208 matching SNPs (base-pair flips) at 61 cM.

Here’s another example

There’s a single chain of only 104 SNPs at 2 cM — a bit silly really… most matching services would rule this out as below the cM threshold. Why? Because what happens with most DNA tests, unless you map the full genome (usually $$$…) is that they will sample various locations / RSIDs. In fact, one test will sample one set of RSIDs and another test a different one. They usually overlap, but not entirely. The reason it does not matter too much for the most part is that SNPs are often linked in patterns, so testing one or another makes no difference. It’s like getting your last name and then your father’s last name, assuming they are the same. You have two records, but they carry the same information in the end.

So this gets me to an interesting point. If we have millions of SNPs, and each DNA test samples about 600K of them, why not test multiple and combine? I did that with my 23andMe and MyHeritage data. I downloaded the files and put them together using R and dplyr. I started with a bit over 600K in each, and ended with a bit under 1.1M in total. There was an overlap of about 140K entries. About 1000 RSIDs were present in both, but had “no-call” in one or the other. And 14K were tested by both, but were different.

Why? Bacuse tests have errors. Sometimes they could not read an RSID at all. Sometimes they got it wrong. One or the other, or both tests could be wrong. And this is why using a combined DNA kit may not yield the same results as each one separate. You may fill in some gaps between tested SNPs that make the resulting overall chains not match anymore. Or you picked up a conflict that signals an error.

Back to the mapped code and what you can do with it.

The most straigtforward way is 1:1 matching of one kit against another. It comes out looking like this

You have a chromosome, starting and ending positions, SNPs, and centimorgans. An algorithm may assign this a score of some sort, based on whatever models have been developed using previously analyzed data.

Now, once you’ve collected thousands of these samples and analyzed them, you can do macro statistical analysis on the whole population, using techniques like Principal Component Analysis.

You map the detected SNPs for each sample in a multidimesional space and look for directionality in the “blob”. Then you try to associate clusters with known populations.

And then you try to map individuals onto that and guess their admixture

As you can see from above, the more recent the populations, the less clear things become. As ancient people moved out of Africa 50K years ago, they first separated for millennia, before coming back together and remixing. So when you are trying to say you are

what you are really saying is that you are neolithic European + Siberian + Steppe Scythian + … with an unknown weighting.

Personally, for me what works is running a number of these admixture / oracle models on GEDmatch and then referencing that against individual historical matches on MyTrueAncestry to put together a story. We don’t have enough ancient matches for precise populations, and we don’t have enough separation among the modern ones to be exact.

Antother thing you can do it map your modern matches and see if that map jives with what MTA gives you and what you see in your admixture models.

I’m a Caltech particle physics PhD turned Data Scientist, currently working as an independent consultant.