I am trying to extract information from a multi-fasta file (e.g. C/G/A/T count, CG%) using biopython. I keep running into trouble when I try to iterate over the file for each fasta sequence - I can only get the first one to print out.
I suspect it may be something to do with my file format as it is not an actual fasta file, but I don't know how I can change this.
input_file = open("inputfile.fa", 'r') output_file = open('nucleotide_counts.txt','w') output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n') #count nucleotides in this record..gene_name = cur_record.name from Bio import SeqIO for cur_record in SeqIO.parse(input_file, "fasta"): gene_name = cur_record.name A_count = cur_record.seq.count('A') C_count = cur_record.seq.count('C') G_count = cur_record.seq.count('G') T_count = cur_record.seq.count('T') length = len(cur_record.seq) cg_percentage = (float(C_count + G_count) / length)*100 output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \ (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage) output_file.write(output_line) output_file.close() input_file.close()
This is what my multifasta looks like (with start and end specified)
>1:start-end CGCCCCAGTGATGTAGCCGAA >1:start-end CGGCCACCCCGAAGCGTGGGG
And my output file only contains one row:
Gene A C G T Length CG% 1:start-end 85 115 180 59 439 67.198178