Kivela Developers blog, star date 47410.2

4Jul/082

Rtf stripper

Have you ever had the need to parse an rtf file to normal text ?

One option (and the easiest one) is just using the RichTextControl, setting its RTF property and getting its text property.

If however your not in a UI project it just smells to add a controls reference to do just a simple parsing, or to be more precise a simple stripping.

Since one of my collegues needed such functionality we ended up converting an old c function to the following.


public static String Strip(String rtf)
{
String strCopy = "";
bool slash = false; //indicates if backslash followed by the space
bool figure_opened = false; //indicates if opening figure brace followed by the space
bool figure_closed = false; //indicates if closing brace followed by the space
bool first_space = false; //the else spaces are in plain text and must be included to the result
int length = rtf.Length;
if (length < 4) return "";
int start = 0;
int k = 0;

start = rtf.IndexOf(@"\pard");
if (start < 1) return "";
char ch;
for (int j = start; j < length; j++)
{
ch = rtf[j];
if (ch == '\\')//we are looking at the backslash
{
first_space = true;
slash = true;
}
if (ch == '{')
{
first_space = true;
figure_opened = true;
}
if (ch == '}')
{
first_space = true;
figure_closed = true;
}
if (ch == ' ' && rtf.IndexOf(@"\datafield", j - 10) + 10 != j)
{
slash = false;
figure_opened = false;
figure_closed = false;
}
if (ch == '\\' && rtf[j + 1] == '{') //if the text contains symbol '{'
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '{';
j++; k++;
continue;
}
if (ch == '\\' && rtf[j + 1] == '}') //if the text contains symbol '}'
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '}';
j++; k++;
continue;
}
if (ch == '\\' && rtf[j + 1] == '\\')//if the text contains symbol '\'
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '\\';
j++;
continue;
}
if (rtf.IndexOf("\\par ", j) == j && rtf.IndexOf("\\pard", j) != j)//if there is next line of text
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '\n';
j += 4;
continue;
}
if (rtf.IndexOf("HYPERLINK", j) == j)
{
int i = rtf.IndexOf('"', j) - j + 1;
while (rtf[j + i] != '"')
{
i++;
}
j = j + i + 1;
continue;
}
if (slash == false && figure_opened == false && figure_closed == false && ch != '\n' /*&& ch!=13*/ && rtf.IndexOf("HYPERLINK", j + 1) != j + 1)
{
if (!first_space)
{
strCopy += ch;
}
else
{
first_space = false;
}
}

}

return strCopy;
}

Hope it can save somebody half an hour :)

Cheers Stefan

Filed under: Tech Leave a comment
Comments (2) Trackbacks (0)
  1. I haven’t really tested it but this regular expression should do the same

    ^\{(.+)|^\\(.+)|(\}*)

    Anyways… your unit tests should be able to validate this. (And you’ve written them in half an hour???) ;-)

  2. Hi Ben,

    Tx for your reply.

    I’ll give it a go when I’m back in Leuven.

    And nope I did not write this code in half an hour ofcourse :) No wizardlike behaviour today ;) .

    It took me about that long to find this code snippet in regular C, we just ‘ported’ it afterwards and gave it to the developer who ran his tests with it and went on with his dev’ing. Will give the regex a try ofcourse and let you know if it worked out.

    S pozdravem, Stefan.


Leave a comment

(required)

No trackbacks yet.

nike hyperdunkpaul smith bagoakley sunglasses 2011gucci bagspandora charmralph lauren sale puma speed catchanel handbags sale puma shoesdesigner handbags