By Iliya Garakh, CTO in Data — Dec 16, 2022

How soon will we be able to store files in our DNA?

The creation of a file made out of human DNA that is capable of retaining terabytes of information is a very real future for scientists.

To this day, humanity has produced around 10 trillion gigabytes of data, and on a daily basis, people generate emails, photographs, films, and other information that add up to another 2.5 million gigabytes. A significant portion of this information is kept in exabyte data centers, which have the footprint of several football fields and have an annual operating cost of one billion dollars. However, researchers have developed an alternate strategy, which consists of a section of DNA that is able to store vast quantities of information in a compact shape.

According to Mark Bath, a professor of biology at the Massachusetts Institute of Technology, you could hypothetically put all of the data in the world into a coffee cup full of DNA.

The DNA molecule is an ideal storage device for digital data

"We need innovative methods to store the massive volumes of data that are growing throughout the world," says Mark Bath. "DNA is a thousand times denser than any flash drive, and it also has the fascinating virtue of not using energy. Anything may be written into DNA and stored indefinitely " he continues.

Text, images, and any other type of information are all encoded as a series of zeros and ones when saved to digital storage devices. The same information may be encoded in DNA using the four nucleotides that make up the genetic code, which is designated by the letters A, T, G, and C. For instance, the numbers 0 and 1 can be represented by the letters G and C, respectively.

DNA possesses various characteristics that make it a good information carrier:

• DNA is very stable

• DNA is relatively simple to synthesize and sequence

• DNA is highly dense, each nucleotide corresponding to two bits is around 1 cubic nanometer. An exabyte can fit in the palm of your hand.

However, there is a drawback. The expense of producing such enormous amounts of DNA is huge. Recording one petabyte of data (1 million gigabytes) now costs $1 trillion. According to Bath, the cost of synthesis needs to be decreased by around six orders of magnitude before creating archives based on a biological polymer becomes economical. According to the expert, this is entirely feasible in 10-20 years.

Another difficulty is obtaining the needed file.

"What happens if technology advances to the point where it is economically feasible to write an exabyte or zettabyte of data into DNA? You'll have a pile of DNA containing millions of photographs, texts, videos, programs, and other data, and you'll need to locate a certain file: how will you accomplish it?" Bath inquires.

It's like looking for a needle in a haystack.

How are files encoded?

At this time, the PCR is the most common method for obtaining DNA files (polymerase chain reaction). Each file contains a sequence that is designed to bind to a particular PCR primer (a primer is a short piece of nucleic acid). Each primer is introduced to the sample individually in order to locate the necessary sequence in order to extract a particular file. However, one of the drawbacks of using this method is that it increases the likelihood of a phenomenon known as crosstalk occurring between the primer and the DNA sequences, which can lead to the loss of some files. In addition, the synthesis process of PCR calls for enzymes and results in the loss of a considerable amount of DNA. You sort of have to burn a haystack to locate a needle.

The problem was solved by Professor Bath and his colleagues when they encapsulated each file in a silica particle measuring 6 micrometers and included a brief DNA sequence that indicated what was contained within the file. The researchers were able to retrieve individual photos that were saved as DNA sequences from a batch of 20 files by using this method, which resulted in an accuracy rate of one hundred percent. It is conceivable to scale up to a sextillion files given the number of potential labels that may be utilized. By the way, a sextillion is a number that consists of one and 20 zeros following it.

Hack DNA to find the right file

The team at MIT devised a novel extraction approach by isolating each file in a silica particle as an alternate option. Each such "capsule" is labeled with a single string of "barcodes" relating to the file's contents, such as "cat", "airplane", and so on. The researchers encoded 20 distinct pictures into DNA segments around 3,000 nucleotides long, which is comparable to about 100 bytes, to show their method in a cost-effective manner. (They also demonstrated that data as large as a gigabit might fit within the capsules).

When the researchers sought to extract a specific image, they deleted the DNA sample and replaced it with primers that matched the labels they were seeking — "cat", "red", and "wild" for a tiger shot, or "cat", "orange", and "domestic" for a domestic cat photo. The primers are then tagged with fluorescent or magnetic particles, making it simple to extract and identify any files while leaving the remainder of the DNA intact for eventual storage. This strategy is comparable to looking for terms on Google.

"So far, the search speed is one kilobyte per second. The size of the data per capsule determines the search speed of our file system. It is also worth mentioning that the speed is constrained by the prohibitively high cost of writing even 100 gigabytes of data per DNA, as well as the number of sorters that may be used concurrently.

"If DNA synthesis gets cheap enough, we can optimize the quantity of data stored", said scientist James Banal.

The researchers created their barcodes using single-stranded DNA sequences from a library of 100,000 sequences, each around 25 nucleotides long, established by Stephen Elledge, a genetics and medicine professor at Harvard Medical School. If you place two of these labels on each file, you may label each one uniquely.

Final words

While DNA may not be extensively employed as a data carrier for some time, there is currently a large need for low-cost, high-volume storage solutions.

The DNA encapsulation approach can be effective for archiving data that is only sometimes accessed. As a result, Professor Bath's laboratory is already hard at work on the formation of a business called Cache DNA, which will provide a method for the long-term storage of information in DNA.