Rtf stripper
Have you ever had the need to parse an rtf file to normal text ?
One option (and the easiest one) is just using the RichTextControl, setting its RTF property and getting its text property.
If however your not in a UI project it just smells to add a controls reference to do just a simple parsing, or to be more precise a simple stripping.
Since one of my collegues needed such functionality we ended up converting an old c function to the following.
public static String Strip(String rtf)
{
String strCopy = "";
bool slash = false; //indicates if backslash followed by the space
bool figure_opened = false; //indicates if opening figure brace followed by the space
bool figure_closed = false; //indicates if closing brace followed by the space
bool first_space = false; //the else spaces are in plain text and must be included to the result
int length = rtf.Length;
if (length < 4) return "";
int start = 0;
int k = 0;
start = rtf.IndexOf(@"\pard");
if (start < 1) return "";
char ch;
for (int j = start; j < length; j++)
{
ch = rtf[j];
if (ch == '\\')//we are looking at the backslash
{
first_space = true;
slash = true;
}
if (ch == '{')
{
first_space = true;
figure_opened = true;
}
if (ch == '}')
{
first_space = true;
figure_closed = true;
}
if (ch == ' ' && rtf.IndexOf(@"\datafield", j - 10) + 10 != j)
{
slash = false;
figure_opened = false;
figure_closed = false;
}
if (ch == '\\' && rtf[j + 1] == '{') //if the text contains symbol '{'
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '{';
j++; k++;
continue;
}
if (ch == '\\' && rtf[j + 1] == '}') //if the text contains symbol '}'
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '}';
j++; k++;
continue;
}
if (ch == '\\' && rtf[j + 1] == '\\')//if the text contains symbol '\'
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '\\';
j++;
continue;
}
if (rtf.IndexOf("\\par ", j) == j && rtf.IndexOf("\\pard", j) != j)//if there is next line of text
{
slash = false;
figure_opened = false;
figure_closed = false;
first_space = false;
strCopy += '\n';
j += 4;
continue;
}
if (rtf.IndexOf("HYPERLINK", j) == j)
{
int i = rtf.IndexOf('"', j) - j + 1;
while (rtf[j + i] != '"')
{
i++;
}
j = j + i + 1;
continue;
}
if (slash == false && figure_opened == false && figure_closed == false && ch != '\n' /*&& ch!=13*/ && rtf.IndexOf("HYPERLINK", j + 1) != j + 1)
{
if (!first_space)
{
strCopy += ch;
}
else
{
first_space = false;
}
}
}
return strCopy;
}
Hope it can save somebody half an hour
Cheers Stefan
July 7th, 2008 - 16:54
I haven’t really tested it but this regular expression should do the same
^\{(.+)|^\\(.+)|(\}*)
Anyways… your unit tests should be able to validate this. (And you’ve written them in half an hour???)
July 8th, 2008 - 17:05
Hi Ben,
Tx for your reply.
I’ll give it a go when I’m back in Leuven.
And nope I did not write this code in half an hour ofcourse
No wizardlike behaviour today
.
It took me about that long to find this code snippet in regular C, we just ‘ported’ it afterwards and gave it to the developer who ran his tests with it and went on with his dev’ing. Will give the regex a try ofcourse and let you know if it worked out.
S pozdravem, Stefan.