Glossary of Terms

Important terms used at Biota

16s

16S rRNA is an rRNA sequence that makes up part of ribsomes. Ribosomes are complex molecules that perform protein synthesis. Ribosmes are made up of two subunits, a large and a small subunit. 16S rRNA makes up part of the small subunit, its specific function is to bind to a site on the RNA sequence to be translated. Becuase the function of 16S rNA is essential for protein synthesis, all bacteria contain the gene that codes for it.

16S Marker Gene Survey

16S ribosomal RNA is part of the small ribosomal subunit required for translation. The 16S gene refers to the DNA that codes for the RNA, not the actual subunit. This gene is necessary for bacteria to survive, therefore all bacteria have 16S genes. The universal nature of the 16S gene allows it to be used as a biomarker, small variations in the DNA sequence of the gene is sufficient to differentiate distinct organisms, much like the fingerprints of humans. The gene also contains enough similarity, that it is easily targeted for sequencing. In order to survey the bacterial community, Biota extracts all DNA from the microbes in a sample. The DNA that codes for the 16S gene is targeted and copied in a process called PCR amplification and sequenced. DNA is then sorted based on differences in the amplified DNA and compiled into a list of unique DNA markers, representing different bacteria. This process results in hundreds to thousands of DNA markers which enables comparison of samples based on the markers and the development of complex models based on bacterial community composition

Biomass

The DNA that Biota sequences originates from microbes. The first step in our analytical process is to extract the DNA from the microbes. Biota’s produced fluids and well cutting samples are low biomass, meaning the concentration of microbes in our starting volumes is much lower than typical human and environment samples (such as soil), and most of Biota’s samples do not yield enough DNA after the DNA extraction process to move directly to sequencing. The DNA must be amplified or copied millions of times through a process called Polymerase Chain Reaction (PCR). Internal controls ensure that we select samples with sufficient and amplifiable DNA biomass for further analysis.

DNA

Deoxyribonucleic acid (DNA) is a double stranded molecule made of up of two polynucleotide strands. The strands are coiled around each other forming a double helix structure. Each strand of DNA is made up of individual nucleotides, each nucleotide is composed of a five carbon sugar, a phosphate group, and a nitrogenous base. Each nucleotide may contain a difference nitrogenous base, there are four bases (cytosine [C], guanine [G], adenine [A] or thymine [T]). DNA is often shown as combinatinos of A, T, G, and C, for example “ATGCTATCA”. This sequence of letters (or bases) is used as the blue print for protein synthesis (Different sequences of DNA result in differen proteins) and ultimately what determines the physical form of all life and some viruses

DNA Extraction

DNA extraction is a multi-step process that begins with microbial cell rupture by chemical and/or physical methods. Microbial DNA is then purified away from other cell components. The purified DNA can then be used for downstream processes like PCR and sequencing. Biota uses sensitive DNA extraction methods to isolate DNA from the very small amounts of microbial cells found in well cutting and produced fluid samples. Rigorous lab processes are used to mitigate microbial contamination and controls are used to monitor for exogenous microbial DNA.

DNA Marker Assignments

Raw DNA sequences obtained from DNA sequencing machines must be grouped, or ‘binned’, into stable DNA markers that can be compared across samples and sequencing runs. At Biota, DNA markers represent sequences of DNA that have 100% sequence matches between samples and can be observed multiple times within a sample.

DNA Sequencing

Each DNA molecule is composed of multiple nucleotides linked together, which are comprised of one of four molecules represented by the symbols A, T, G, and C. These can be thought of as the ‘letters’ of a word. DNA sequencers read one DNA molecule at a time. Each sequence of DNA that the sequencer produces is called a ‘read’. A typical sequencing run can produce ~15 million reads that characterize a 150 base pair region of the full 16S rRNA gene (~1,600 base pairs long). Each read is assigned to a specific DNA marker to enable comparisons between samples.

DNA Markers

A genetic marker is a gene or DNA sequence with a known location on a chromosome that can be used to identify individuals or species. It can be described as a variation that can be observed.

Demultiplexing

DNA sequencing technology is designed for high throughput, which means millions of short fragments of DNA are being sequenced at once. It would be cost prohibitive and unnecessary to sequence a single sample at a time, so samples are pooled and sequenced simulataneously. In order to allow multiple samples to be run on a single machine, each sample is marked with a small tag made of manufactured DNA before loading into the sequencer. These tags are sequenced along with the sample’s DNA by the sequencer. Post sequencing these sequenced tags are used to reassign sequences to a given sample. The process of unmixing samples is called demultiplexing.

Diversity

The diversity of microbes in a community can be measured in terms of the community’s richness and evenness. The richness of a community can be measured by counting the number of unique DNA markers that are present in the sample (e.g. 100 DNA markers). The evenness of a community can be measured by assessing how many times each DNA markers was observed when sequencing the sample. Evenness can be measured using an index that ranges in values from zero to one, with values closer to zero indicating an uneven distribution of DNA marker within the community.

Negative Control

The diversity of microbes in a community can be measured in terms of the community’s richness and evenness. The richness of a community can be measured by counting the number of unique DNA markers that are present in the sample (e.g. 100 DNA markers). The evenness of a community can be measured by assessing how many times each DNA markers was observed when sequencing the sample. Evenness can be measured using an index that ranges in values from zero to one, with values closer to zero indicating an uneven distribution of DNA marker within the community.

PCR Amplification

PCR amplification is the selective amplification of DNA or RNA targets using the polymerase chain reaction. The goal of PCR is to generate the millions of copies of DNA necessary of sequencing. The process of PCR involves the following steps: The DNA double helix is unwound through the application of heat. A DNA primer (a short section of DNA that matches a sequence on the target DNA) is used to locate and attach to a specific gene region of interest within a genome. A polymerase enzyme is added so that new DNA base pairs are added onto each of the unwound single stranded DNAs. This process constitutes 1 PCR cycle and results in the accurate doubling of DNA strands. Multiple PCR cycles are used to amplify low levels of DNA into high levels required for next generation sequencing machines.

Rarefaction

Rarefaction is the process by which a set number of sequences are randomly selected from samples. The number of sequences in a sample varies due to differences in sample processing, collection, or sequencing and may not reflect the true number of sequences in a sample. In order to account for differences in sequence numbers between samples, a set number of sequences are randomly selected, without replacement, from all samples. If a sample does not contain the minimum number of sequences for rarefaction it is discarded.

Relative Abundance

Relative abundance refers to data that has been converted to ratios. The data that Biota generates represents the counts of sequences, or markers, generated by a sequencing machine. These counts can vary between samples as an artifact of sequencing, and do not necessarily represent the true number of sequences that exist in the environmental sample. Therefore some analytic techniques require the standardization of the counts across a dataset. Relative abundance is calculated by dividing the count of an individual marker by the sum of all markers in a given sample. The relative abundance will therefore always sum to one in a single sample.

Sample Pooling

Sample pooling is the process by which samples of amplified DNA are mixed and loaded into a single DNA sequencer. These DNA sequencers are designed to sequence millions of short fragments at once. To ensure the data can be differentiated post sequencing, sample DNA has a small identification tag made of manufactured DNA. These DNA tags are referred to as barcodes since they act as unique identifiers to a sample. Note: this is a separate process from pooling of physical samples (for example combining wellcuttings in order to increase the available amount of DNA biomass.) Pooling of physical samples results in a new sample that can not later be returned to its constituent parts. Pooled DNA samples may be later separated into the samples that went into them.

Sequences Per Sample

Amplified DNA is sequenced on a next-generation DNA sequencing machine, the Illumina MiSeq, which allows us to determine the sequence of nucleotides on a DNA strand. Samples with too few DNA sequences may not have been sequenced correctly or adequately and may bias downstream analysis. For these reasons, we calculate the sequences per sample and remove samples that do not pass our internal cut-off.

Subsurface Signal

A critical aspect of Biota’s technical foundation is the assertion that the biomass and DNA that we recovered from samples derives from the subsurface and not from technically introduced artifacts. Our stringent negative controls allow us to evaluate the estimated probability of technically introduced artifacts for each sample and remove samples without adequate subsurface DNA.

Taxonomy Assignment

Taxonomy refers to a classification system to identify organisms and determine how closely related they are to one another. The taxonomic levels are kingdom, phylum, class, order, family, genus, and species, which represent increasing levels of similarity. Genus level classification results in groups that are closely related, whereas family level classification will contain more diverse organsims. The groups form a hierarchy, for example a single family level classification will contain multiple genera, and a sinlge genus will contain multiple species. Reference databases contain DNA markers sequences associated with assigned taxonomic names determined by studies. Taxonomy assignment is the process of using algorithms to compare an operator’s DNA markers to the reference database to assign the current taxonomic classification.

Arrow-up