If i have an article, how can i count the number of characters(space included), and alphabets it uses? Thank you
2 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Olvyan Abraham
el 7 de Oct. de 2017
Comentada: Walter Roberson
el 7 de Oct. de 2017
Pls help me, im really new at this. Idk anything
4 comentarios
Walter Roberson
el 7 de Oct. de 2017
Okay, so for the .doc file, are you supposed to include any storage taken up by headers and formatting codes ? .doc files can contain a lot of things that are used to control how the document should be printed. .doc files can include spreadsheets and images inside them. If you want to deal only with the "rendered" result, then you have the problem that the rendered result can include things that are not characters (such as lines), or which just might happen to look the same as a character sequence. Should the short line that looks like --> except all as one piece, be counted as three characters because that is the closest character representation? Even if it took (say) 12 characters to represent inside the file ?
Olvyan Abraham
el 7 de Oct. de 2017
Editada: Olvyan Abraham
el 7 de Oct. de 2017
Respuesta aceptada
Walter Roberson
el 7 de Oct. de 2017
Counting alphabetics is a much trickier task than it appears. There are (human) languages for which characters such as ü are not considered alphabetic, and there are other human languages for which they are considered alphabetic, and there are yet other human languages which use some characters like that but consider them to be pairs of alphabetic characters such as ue. Furthermore there are human languages where pairs of characters might be written separately for storage purposes, but are technically considered to be single characters. For example if you see ao in a Swedish text, then that is considered to be a representation of the single alphabetic character å
You also have the problem that there are multiple valid representation for some characters in Unicode. For example ü is U+00FC, but it is also u with the COMBINING DIAERESIS (U+0308). There are some characters that involve combining more than two characters; and then there are the "stroke" representations for some writing systems, in which the stroke order can vary.
Because of these crazy differences, you cannot count alphabetic characters properly until you know which language you are dealing with, and you know all the applicable single-to-multi and multi-to-single rules for the language, and you have to translate the input into a canonical representation and then examine the characters with a table of what that language considers to be alphabetic.
Automatically determining which language is being used based upon which characters appear is difficult...
Ah, and the above assumes that the characters have been represented in Unicode (perhaps UTF encoded.) That is not necessarily the case: the characters might be in one of the various Code Pages, and it might not be obvious which Code Page is in use. I once spent a bunch of time on some code that tried to figure out which Code Page was in use by assuming that the characters represented text (and standard controls like newline), and looking carefully at which bytes occurred: if a byte is unassigned in a particular code page then the presence of that byte could rule out that code page as being a possibility. But it turned out I never had a use for that code.
... and then you get to deal with the problem that human languages often allow other languages to be quoted. I mentioned earlier that ao in Swedish text is a representation of the å character. That is true, but it might happen that the text quotes the English words "intraocular" or "aorta" or "chaos", or the Italian "ciao", or the particle physics term "kaon", and within the quoted words, the pair must be counted as two letters, not one.
1 comentario
Walter Roberson
el 7 de Oct. de 2017
If you can restrict yourself to one language and the language does not have any composite characters or digraph substitutions, then you have several phases:
- read the doc file into a uint8 or character array
- remove everything from the array that is considered headers or control information, preserving space that is not part of control information
- do the counting
Hint: ismember()
Más respuestas (0)
Ver también
Categorías
Más información sobre Text Data Preparation en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!