problem calculating nucleotide percentages

Hi all,
I am writing a program to do several things, first off to calculate the percentage of 4 nucleotides of RNA (A, U, C, G), given a DNA input(A, C, G, T), and output it to a new file, showing the RNA sequence and nucleotide percentages (ie. 50% U). I have the following code, but I cannot get it to correctly calculate the frequency—I don't know how to get the total number of nucleotides, since I can't get it work with length(filename)....how Can I fix this? See code below.
Thanks
%DNA Transcription
%Goal: read in a file to convert from DNA to RNA:
% C -> G
% G -> C
% T -> A
% A -> U
dnaFile = 0;
while dnaFile < 1
filename = input('Open file: ', 's');
[dnaFile, message] = fopen(filename, 'r');
if dnaFile == -1
disp(message)
end
end
%if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
outputFile = [filename(1:end-4) '.rna']
else
outputFile = [filename '.rna']
end
[rnaFile, message] = fopen(outputFile, 'w');
[base, num] = fread(dnaFile, 1, 'char');
countG = 0;
countC = 0;
countA = 0;
countU = 0;
while num > 0
if base == 'C'
out = 'G';
countG = countG + 1;
elseif base == 'G'
out = 'C';
countC = countC + 1;
elseif base == 'T'
out = 'A';
countA = countA + 1;
elseif base == 'A'
out = 'U';
countU = countU + 1;
else
out = '_';
end
freqG = countG ./ length(dnaFile);
freqC = countC ./ length(dnaFile);
freqA = countA ./ length(dnaFile);
freqU = countU ./ length(dnaFile);
%fprintf('percentG: %.5f \n percentC: %.5f \n percentA: %.5f \n percentU: %.5f \n', freqG, freqC, freqA, freqU);
fwrite(rnaFile, out);
[base, num] = fread(dnaFile, 1, 'char');
end
fclose(dnaFile);
fclose(rnaFile);

1 comentario

"...I can't get it work with length(filename)..."
...
freqG = countG ./ length(dnaFile);
...
dnaFile is a file handle which is an integer stored in default datatype of double (see doc for fopen). Hence, length()--> 1
The length is either sum of all the characters you found if the percentage is to be based on the total string length including the extraneous or the sum of the four sequence counts if not. As shown, however, you can get these more efficiently than by explicit looping.

Iniciar sesión para comentar.

 Respuesta aceptada

dpb
dpb el 22 de Abr. de 2017
Editada: dpb el 23 de Abr. de 2017
[rnaFile, message] = fopen(outputFile, 'w');
While not your problem, should check for the output file opening successfully as well as input...
[base, num] = fread(dnaFile, 1, 'char');
The above reads one character, but returns it as a double, not a character.
Use
[base, num]=fread(dnaFile,'*char');
to read whole file into a character array.
while num > 0
if base == 'C'
...
The above starts an infinite loop as num=1 and there's nothing inside the loop that ever changes num so it stays there forever...oh, although it will eventually error out on EOF on the read...scanned too quickly first.
You can loop, but Matlab is vectorized; may as well make use of it.
% rewrite the rules for convenience C -> G; G -> C; T -> A; A -> U
out=repmat('_',size(base)); % this is else case...we'll overwrite everything besides
out(base=='C')='G'; % use logical addressing to locate each and write output
out(base=='G')='C';
out(base=='T')='A';
out(base=='A')='U';
freqG=sum(out=='G');
freqC=sum(out=='C');
freqA=sum(out=='A');
freqU=sum(out=='U');
totalNum=sum([freqG freqC freqA freqU]);
freqG=sum(out=='G')/totalNum;
freqC=sum(out=='C')/totalNum;
freqA=sum(out=='A')/totalNum;
freqU=sum(out=='U')/totalNum;
ADDENDUM:
Couldn't see an issue otomh so did a trial with just made up sequence...
>> dna=['A', 'C', 'G', 'T']; % the four letters start with
>> dna=repmat(dna,1,10); % make longer sequence from them
>> dna=dna(randperm(length(dna))); % and then scramble 'em up
>> dna(randperm(length(dna),3))='_'; % a few other characters for spice
>> dna % what we start with is then ...
dna =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
The algorithm above starts here:
>> rna=repmat('_',size(base)); % replace the non-wanted letters
>> rna(dna=='C')='G';
>> rna(dna=='G')='C';
>> rna(dna=='T')='A';
>> rna(dna=='A')='U';
>> [dna;rna] % see what we got...
ans =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG
Looks like what problem statement asked for...compute frequency
Did this in "more Matlab-y" way; order is alphabetic to satisfy histc
>> RNA='ACGU'; % the letters to use as bin centers must be increasing order
>> freq=histc(rna,RNA)
freq =
9 9 10 9
>> freq=freq/sum(freq)
freq =
0.2432 0.2432 0.2703 0.2432
>>
>> [RNA.' num2str(freq.','%.4f')] % display results tabulated
ans =
A 0.2432
C 0.2432
G 0.2703
U 0.2432
>>
If order is important revert to previous or a "cute" way would be to use the categorical datatype--
>> rnac=categorical(cellstr(rna),{'G';'C';'A';'U';'_'});
>> summary(rnac)
G 10
C 9
A 9
U 9
_ 3
>>
ADDENDUM 2:
And, the yet more Matlab-y way vectorizes the translation via lookup...
>> DNA=['C','G','T','A','_']; % base characters in sequence 1
>> RNA=['G','C','A','U','_']; % corresponding characters in sequence 2
>> RNA(arrayfun(@(c) find(c==DNA),dna)) % translate one to other...
ans =
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG

19 comentarios

John Jamison
John Jamison el 23 de Abr. de 2017
I tried this and it does not work
dpb
dpb el 23 de Abr. de 2017
Editada: dpb el 23 de Abr. de 2017
Well, that's of little help; starting with what is the file format for certain and what is the reason you think it doesn't? Error, incorrect answer, what????
ADDENDUM Overnight one issue came to me...the divisor isn't total number characters in the file (I don't think the underscores should count, should they??? If they should above seems ok). See the addendum to Answer for refinement, example if the data are format as above presumes based on your code...
John Jamison
John Jamison el 23 de Abr. de 2017
My issue is still with the divisor for the frequency calculation—the length(num) part...That gives wrong ratios for my basepair frequencies: they should add to 100%, but they add to like 25%...idk what to put there...I think it should be length of something to get the whole list of inputs from the text that the user enters...
dpb
dpb el 23 de Abr. de 2017
Editada: dpb el 23 de Abr. de 2017
Why not use one of the implementations I showed? The result for the sample case were
>> freq
freq =
0.2432 0.2432 0.2703 0.2432
>> sum(freq)
ans =
1
>>
so the frequencies do total to unity, precisely. (Again presumes the "_" placeholders are of no consequence in the statistics).
If that is not correct answer, then the problem isn't to compute the frequency of a given letter to the total number of letters, but some other quantity--that wouldn't surprise me, but need to know what the problem statement is.
I'm a NucE by training; I "know nuthink!" of sequence counting so what are "basepair frequencies"...tell us the rules for what you want to compute as well as what the inputs are and we can compute it.
John Jamison
John Jamison el 23 de Abr. de 2017
Editada: John Jamison el 23 de Abr. de 2017
Your method seems too complicated for me and my prof would likely be suspicious. My question is what to change in my code above to get the divisor for my frequency to correctly calculcate the values, which is an issue with my current 'length' command.
The basepair frequencies are the frequencies of RNA's 4 base pairs, A, U, C, and G found in the user's DNA text file
dpb
dpb el 23 de Abr. de 2017
"...my prof would likely be suspicious"
Ah! We didn't know this was homework.
"...to get the divisor for my frequency..."
Well, still not sure what divisor is supposed to be for certain; you've not answered the questions raised, at least in a manner that tells me what they're actually supposed to be.
Is, or is not the answer I showed for the sample case correct?
If so, can show how to get it with something approximating you code; if not, given the example sequence I gave (presuming that looks anything at all like an input sequence which I presume it should at least contain the right values if not in a meaningful/realistic order), what would be the correct answer for that sequence and how did you determine it?
John Jamison
John Jamison el 23 de Abr. de 2017
I am having trouble getting my code to calculate the total number of base pairs in the given sequence: ie. given this: AGTCAGTCAG, getting the computer to know there are 10 base pairs total there, and the percent of A would be 3/10, or 0.3000% A....that is what I want my program to do.
As of now, my program can recognize there is 3 occurrences of A, but it cannot tell there is 10 total base pairs in the sequence..do you get my dilemma now?
the issue I have with your code is that it randomly generates DNA base pairs on its own, whereas what I want is analyzing the user's DNA base pairs input and returning the RNA complements and percent of each base component.
dpb
dpb el 23 de Abr. de 2017
Editada: dpb el 23 de Abr. de 2017
OK, although that's the wrong sequence letter set, isn't it?
I am/was totally aware of what your perceived problem is; you just wouldn't answer the question required to know what the proper answer really is.
If you're adamant in using your code in the loop, then the divisor would be have to be computed as incrementing a counter in each of the subsections that finds a litter match. That sum in the end will be the total.
Or, before computing the frequency, the total is simply the sum of the four counts you've collected--
nTotal=sum([countG,countC,countA,countU]);
freqX=countX/nTotal;
I'd still strongly urge you to read the full file at once and then use one of the ways outlined above to process it..."it's the Matlab way!" and would be a useful tutorial in your Matlab education once you grasp the techniques.
Creating multiple variables with sequential or similar naming schemes in Matlab is a sure sign the data structure isn't being done correctly.
Anyway, the rule appears to be "it is a global percentage using the four characters and ignoring any others in the total count". I wasn't positive that the nonmember characters didn't have some role to play in dividing substrings over which total were needed to be accumulated--if that's not the case then it's pretty simple; you just have to count only the ones that are sieved out; not the entire length of the possible input string which was the first assumption.
Sorry if seemed to harp on you, but we only can know what you tell us unless happen to work in the field and so precision in the problem description is vital to get a correct answer and didn't seem to make sense to give the right answer to the wrong question.
John Jamison
John Jamison el 23 de Abr. de 2017
Ok , thank you.
How can I get my code to incorporate the percentages, which I display with fprintf, into the file I make (the .rna one)? See the above code I have originally for background..
dpb
dpb el 23 de Abr. de 2017
Editada: dpb el 23 de Abr. de 2017
Just add the output filehandle variable in the call...oh, excepting you wrote it as stream file, not formatted. Just do the same thing as did for the other value excepting again you see the problem with naming variables as did; you've got to deal with them all individually, thereby losing all the advantages of Matlab's vectorized operations...
fwrite(fid,[freqG,freqC,freqA,freqU]);
John Jamison
John Jamison el 23 de Abr. de 2017
I get this issue...
dpb
dpb el 23 de Abr. de 2017
Editada: dpb el 24 de Abr. de 2017
OK, I guess I'm to blame for not pointing out the "rest of the story" -- you've got to tell it the data type to write.
Oh, and did you fix the FOPEN problem I pointed out initially in that while you read a character, it is stored internally on reading in Matlab as a double as you wrote the code?
ERRATUM Oops, typo... "did you fix the FOPEN problem". It's in FREAD where the problem actually is... 'char' will read a character/byte but Matlab will store that in a double. I'm guessing that's the root cause of your problem below in getting extra bytes; you're writing more than one byte each FWRITE, just writing a double as 'char'. As noted then, the '*char' will also then store that character as type CHAR in memory when read which should fix the problem unless there are already bum characters in the input file form such a case before.
Anyways, try
fwrite(fid,[freqG,freqC,freqA,freqU],'char');
and see if joy ensues. Of course, this is still an unformatted file so if you're expecting to use it in text editor, etc., etc., you'll be better off using formatted i/o -- that's fprintf and friends, not fread/fwrite
John Jamison
John Jamison el 23 de Abr. de 2017
Shoot. Now it looks like this..
dpb
dpb el 23 de Abr. de 2017
Editada: dpb el 23 de Abr. de 2017
Attach an input file...can't duplicate what you've done 'cuz don't have matching data.
Show output from
whos out
whos whateverTheVariableFreqIs/Are
to see what internal storage actually is. As you can see, the letters are every so many characters in the file which means you either have some additional data in the output vector or the data type isn't character. It's variable because there are nonprinting bytes in the file that a text file tool doesn't know how to handle. Now, what is indeterminate is whether those are in the original file (how was it created?) or an artifact of what you've done.
But it all kinda' begs the question of why are you using unformatted files instead of formatted it you think you have need to view them, anyway???
Well, I just did some checking on ML behavior to refresh memory--
fwrite(fid,var)
by default writes unsigned char and if
var='A'; % store character into default double
then
fwrite(fid,var)
leaves only the one byte 'A' in the file as desired. Ergo, the other stuff is coming from either
  1. There are already funky characters (bytes) in the input file you're just passing thru, or
  2. There's something in your processing code inserting non-ASCII printing bytes.
To see which, need the input file (actual attached copy, not just type in what you think it contains typed out) and the actual code that was run to create the output file. Wouldn't hurt to attach it as well so can inspect it closely.
John Jamison
John Jamison el 24 de Abr. de 2017
Editada: John Jamison el 24 de Abr. de 2017
My input file is the simple string below, saved as a .txt:
ACTTAAAGGTTTCCCTTAAAATTCCGGTTAAA
dpb
dpb el 24 de Abr. de 2017
REREAD THE ABOVE COMMENTS!
As said, that's not enough from which to diagnose your ills; if that is the actual input file, then your processing has to be introducing the nonprinting characters but we don't have the EXACT code you ran to create the output file so can't debug it; nor do we have your actual files, neither input nor output.
I'm happy to try to help, but it's got to be a two-way street here...you've got to contribute your part.
John Jamison
John Jamison el 24 de Abr. de 2017
I figured it out.
Thanks for all your help. My next question is regarding searching through the file for certain sequences..see my new post for that.
Thanks
dpb
dpb el 24 de Abr. de 2017
Editada: dpb el 24 de Abr. de 2017
So, what was the problem?
"Enquiring minds" and all that... :)

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Productos

Etiquetas

Preguntada:

el 22 de Abr. de 2017

Editada:

dpb
el 24 de Abr. de 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by