This package can encode DNA sequences into:
The DNA sequence must be in fasta format with lines of 58 nucleotides and extension .fna or .fasta. Example:
>Sigma++
GGTTTATTGCCTTGCAGCTGGCGAGAGACGGTATTGCTCATGCACAAGCCTTGTTCAG
>Sigma24
TGCCCTGACTTCACCCCGCTGTGTCTGCTTTTCCCGACTATTCTTAATGAGCTTCGAT
>Sigma--
AATGTGGATAGATATGAATTATTTTTCTCCTTAAGGATCATCCGTTATTTGGGTCGTT
>Sigma70
CAGTTATTTACCTTACTTTACGCGCGCGTAACTCTGGCAACATCACTACAGGATAGCG
>Sigma++
AAAAAGTTATACGCGGTGGAAACATTGCCCGGATAGTCTATAGTCACTAAGCATTAAA
...
After encoding the sequences, the algorithm stores them at folder_for_output/output/encoder_name/.
Using a table [1] with twelve physicochemical properties values for each 3-mers, we standardize the values and calculate pc3mer by decomposing the input sequence into 3-mers and replacing each 3-mers in the order they appear in the sequence by its value for a given physicochemical property.
The physicochemical properties are Bendability-DNAse, Bendability-consensus, Trinucleotide GC Content, Nucleosome positioning, Consensus_roll, Consensus_Rigid, Dnase I, Dnase I-Rigid, MW-Daltons, MW-kg, Nucleosome, and Nucleosome-Rigid [1].
Example 1:
[0.07230364, 0.26511335, ..., sample_class]
[0.3577835, -0.0969693, ..., sample_class]
[1.73205081, 0.57735027, .., sample_class]
[0.07230364, 0.26511335, ..., 0.3577835, -0.0969693, ..., 1.73205081, 0.57735027, ..., sample_class]
Example 2:
>Sigma++
GCTGAAAATACGTTGAACGCTTACCGTCGCGATCTGTCAATGATGGTGGAGTGGTTGC
0.919537039821516,0.919537039821516,1.3475397297384397,-0.605222559882525,-2.745236029467144,-2.745236029467144, -2.584735019498298,0.5717848498890153,-0.0702191899863703,0.06353164998766837,0.06353164998766837,..., Sigma++
Usage:
from pc3mer import Pc3mer
import os
input_fasta = "input_file.fasta" # path + file name
folder_for_output = os.path.join(os.getcwd(), "output") # path to store the output
encoder = Pc3mer(folder_for_output=folder_for_output)
encoder.encode_fasta_file(input_fasta, store_encode_by_indiv_prop=True)
Output:
It creates the folder “Pc3mer” in _
Encodes a sequence into pc3mer and then get a set of statistics over the encoded sequence. The statistics are:
Usage:
from pc3mer import Pc3mer
import os
input_fasta = "input_file.fasta" # path + file name
folder_for_output = os.path.join(os.getcwd(), "output") # path to store the output
encoder = Pc3mer(folder_for_output=folder_for_output)
encoder.convert_fasta_file_to_pc3mer_stats(input_fasta)
This implementation of PseKNC I [1], [2] decomposes the sequence into 3-mers and maps them to physicochemical property values specific for each word that is used to calculate scores. The scores, called Theta_{i}, are concatenated to the 3-mer decomposition and refers to all 3-mers i nucleotides distant from each other, for i in [1, 2]. The final array is devided by the sum of 3-mers counts and Theta scores.
Usage:
from my_pseknc import Pseknc
import os
input_fasta = "input_file.fasta" # path + file name
folder_for_output = os.path.join(os.getcwd(), "output") # path to store the output
output_file = os.path.join(folder_for_output, "pseknc.csv")
encoder = Pseknc()
encoder.encode_fasta_into_pseknc(input_fasta, output_file)
It counts the frequency of k-mers, an enumeration of all “words” of length k, for k in a given interval, in the DNA sequence.
Example 1:
{AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT}
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0]
Example 2:
{A, C, G, T, AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT}
[1, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0]
Usage:
from kmers import Kmers
import os
# defining list of ks
k_start = 1
k_end = 5
k_values = list(range(k_start, k_end + 1))
input_fasta = "input_file.fasta" # path + file name
folder_for_output = os.path.join(os.getcwd(), "output") # path to store the output
output_file = os.path.join(folder_for_output, f"/{k_start}_to_{k_end}_mers.csv")
encoder = Kmers()
encoder.encode_fasta_file(fastafile=input_fasta, list_of_ks=k_values, outputfile=output_file)
[1] W. Chen, T. Y. Lei, D. C. Jin, H. Lin, and K. C. Chou, “PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition,” Anal. Biochem., vol. 456, no. 1, pp. 53–60, 2014, doi: 10.1016/j.ab.2014.04.001.
[2] W. Chen, X. Zhang, J. Brooker, H. Lin, L. Zhang, and K.-C. Chou, “PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions,” Bioinformatics, vol. 31, no. 1, pp. 119–120, Jan. 2015, doi: 10.1093/bioinformatics/btu602.