If i have an article, how can i count the number of characters(space included), and alphabets it uses? Thank you

Question

Olvyan Abraham el 7 de Oct. de 2017

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/360107-if-i-have-an-article-how-can-i-count-the-number-of-characters-space-included-and-alphabets-it-use

Comentada: Walter Roberson el 7 de Oct. de 2017

Respuesta aceptada: Walter Roberson

Pls help me, im really new at this. Idk anything

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Walter Roberson el 7 de Oct. de 2017

Okay, so for the .doc file, are you supposed to include any storage taken up by headers and formatting codes ? .doc files can contain a lot of things that are used to control how the document should be printed. .doc files can include spreadsheets and images inside them. If you want to deal only with the "rendered" result, then you have the problem that the rendered result can include things that are not characters (such as lines), or which just might happen to look the same as a character sequence. Should the short line that looks like --> except all as one piece, be counted as three characters because that is the closest character representation? Even if it took (say) 12 characters to represent inside the file ?

Olvyan Abraham el 7 de Oct. de 2017

Editada: Olvyan Abraham el 7 de Oct. de 2017

Abrir en MATLAB Online

For the headers and formatting codes no, for the short line yes, for example the article is written like this =

_Hello, i'm abraham~

I want my program to count all the characters above and say that there are 20 characters ( all kind space, special characters, comma (,) and apostrophe(') counted as characters too) used to form those sentence above,

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Walter Roberson el 7 de Oct. de 2017

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/360107-if-i-have-an-article-how-can-i-count-the-number-of-characters-space-included-and-alphabets-it-use#answer_284665

Counting alphabetics is a much trickier task than it appears. There are (human) languages for which characters such as ü are not considered alphabetic, and there are other human languages for which they are considered alphabetic, and there are yet other human languages which use some characters like that but consider them to be pairs of alphabetic characters such as ue. Furthermore there are human languages where pairs of characters might be written separately for storage purposes, but are technically considered to be single characters. For example if you see ao in a Swedish text, then that is considered to be a representation of the single alphabetic character å

You also have the problem that there are multiple valid representation for some characters in Unicode. For example ü is U+00FC, but it is also u with the COMBINING DIAERESIS (U+0308). There are some characters that involve combining more than two characters; and then there are the "stroke" representations for some writing systems, in which the stroke order can vary.

Because of these crazy differences, you cannot count alphabetic characters properly until you know which language you are dealing with, and you know all the applicable single-to-multi and multi-to-single rules for the language, and you have to translate the input into a canonical representation and then examine the characters with a table of what that language considers to be alphabetic.

Automatically determining which language is being used based upon which characters appear is difficult...

Ah, and the above assumes that the characters have been represented in Unicode (perhaps UTF encoded.) That is not necessarily the case: the characters might be in one of the various Code Pages, and it might not be obvious which Code Page is in use. I once spent a bunch of time on some code that tried to figure out which Code Page was in use by assuming that the characters represented text (and standard controls like newline), and looking carefully at which bytes occurred: if a byte is unassigned in a particular code page then the presence of that byte could rule out that code page as being a possibility. But it turned out I never had a use for that code.

... and then you get to deal with the problem that human languages often allow other languages to be quoted. I mentioned earlier that ao in Swedish text is a representation of the å character. That is true, but it might happen that the text quotes the English words "intraocular" or "aorta" or "chaos", or the Italian "ciao", or the particle physics term "kaon", and within the quoted words, the pair must be counted as two letters, not one.

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Walter Roberson el 7 de Oct. de 2017

If you can restrict yourself to one language and the language does not have any composite characters or digraph substitutions, then you have several phases:

read the doc file into a uint8 or character array
remove everything from the array that is considered headers or control information, preserving space that is not part of control information
do the counting

Hint: ismember()

Iniciar sesión para comentar.

If i have an article, how can i count the number of characters(space included), and alphabets it uses? Thank you

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Respuesta aceptada

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

If i have an article, how can i count the number of characters(space included), and alphabets it uses? Thank you

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Respuesta aceptada

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos