DNA_encoders

DNA encoders

This package can encode DNA sequences into:

The DNA sequence must be in fasta format with lines of 58 nucleotides and extension .fna or .fasta. Example:

>Sigma++
GGTTTATTGCCTTGCAGCTGGCGAGAGACGGTATTGCTCATGCACAAGCCTTGTTCAG
>Sigma24
TGCCCTGACTTCACCCCGCTGTGTCTGCTTTTCCCGACTATTCTTAATGAGCTTCGAT
>Sigma--
AATGTGGATAGATATGAATTATTTTTCTCCTTAAGGATCATCCGTTATTTGGGTCGTT
>Sigma70
CAGTTATTTACCTTACTTTACGCGCGCGTAACTCTGGCAACATCACTACAGGATAGCG
>Sigma++
AAAAAGTTATACGCGGTGGAAACATTGCCCGGATAGTCTATAGTCACTAAGCATTAAA
...

After encoding the sequences, the algorithm stores them at folder_for_output/output/encoder_name/.

Encoders

Pc3mer

Using a table [1] with twelve physicochemical properties values for each 3-mers, we standardize the values and calculate pc3mer by decomposing the input sequence into 3-mers and replacing each 3-mers in the order they appear in the sequence by its value for a given physicochemical property.

The physicochemical properties are Bendability-DNAse, Bendability-consensus, Trinucleotide GC Content, Nucleosome positioning, Consensus_roll, Consensus_Rigid, Dnase I, Dnase I-Rigid, MW-Daltons, MW-kg, Nucleosome, and Nucleosome-Rigid [1].

Example 1:

Example 2:

Usage:

from pc3mer import Pc3mer
import os

input_fasta = "input_file.fasta"  # path + file name
folder_for_output = os.path.join(os.getcwd(), "output")  # path to store the output

encoder = Pc3mer(folder_for_output=folder_for_output)
encoder.encode_fasta_file(input_fasta, store_encode_by_indiv_prop=True)

Output: It creates the folder “Pc3mer” in __ and twelve _.md_ files, each one containing the encoded sequences for one of the properties. Examples: output/Pc3mer/Bendability-consensus.md and output/Pc3mer/Dnase I.md.

Pc3mer stats

Encodes a sequence into pc3mer and then get a set of statistics over the encoded sequence. The statistics are:

Usage:

from pc3mer import Pc3mer
import os

input_fasta = "input_file.fasta"  # path + file name
folder_for_output = os.path.join(os.getcwd(), "output")  # path to store the output

encoder = Pc3mer(folder_for_output=folder_for_output)
encoder.convert_fasta_file_to_pc3mer_stats(input_fasta)

PseKNC

This implementation of PseKNC I [1], [2] decomposes the sequence into 3-mers and maps them to physicochemical property values specific for each word that is used to calculate scores. The scores, called Theta_{i}, are concatenated to the 3-mer decomposition and refers to all 3-mers i nucleotides distant from each other, for i in [1, 2]. The final array is devided by the sum of 3-mers counts and Theta scores.

Usage:

from my_pseknc import Pseknc
import os

input_fasta = "input_file.fasta" # path + file name
folder_for_output = os.path.join(os.getcwd(), "output") # path to store the output
output_file = os.path.join(folder_for_output, "pseknc.csv")

encoder = Pseknc()
encoder.encode_fasta_into_pseknc(input_fasta, output_file)

k-mers

It counts the frequency of k-mers, an enumeration of all “words” of length k, for k in a given interval, in the DNA sequence.

Example 1:

Example 2:

Usage:

from kmers import Kmers
import os

# defining list of ks
k_start = 1
k_end = 5
k_values = list(range(k_start, k_end + 1))

input_fasta = "input_file.fasta"  # path + file name
folder_for_output = os.path.join(os.getcwd(), "output")  # path to store the output
output_file = os.path.join(folder_for_output, f"/{k_start}_to_{k_end}_mers.csv")

encoder = Kmers()
encoder.encode_fasta_file(fastafile=input_fasta, list_of_ks=k_values, outputfile=output_file)

References

[1] W. Chen, T. Y. Lei, D. C. Jin, H. Lin, and K. C. Chou, “PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition,” Anal. Biochem., vol. 456, no. 1, pp. 53–60, 2014, doi: 10.1016/j.ab.2014.04.001.

[2] W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, and K.-C. Chou, “PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions,” Bioinformatics, vol. 31, no. 1, pp. 119–120, Jan. 2015, doi: 10.1093/bioinformatics/btu602.