Hello,
I've noticed an issue in Document Server behavior:
- Convert the attached document from DOCX to HTML with default settings.
- Take a look at HTML markup generated for any of the headers in the resulting document (e.g. "Abstract" or "Introduction"). You'll see something like this:
<ol start="2" style="margin-top:0;margin-bottom:0;"><li class="csC09771DB"><span class="csC05134DD">Introduction</span></li></ol>
This is wrong. They are headers, so they must be rendered as h1 or h2 (depending on the outline level).
I did some debug and it appears that the styles for the paragraphs in question are inherited - probably that's why your logic does not detect their outline level. In order to handle such scenarios properly, you need to use something like this:
C#private int getStyleOutlineLevel(ParagraphStyle style)
{
if (style.OutlineLevel.HasValue
&& style.OutlineLevel.Value > 0)
{
return style.OutlineLevel.Value;
}
else if (style.Parent != null)
{
return getStyleOutlineLevel(style.Parent);
}
else
{
return 0;
}
}
Instead of simply checking for style.OutlineLevel.Value.
Please fix this issue as it is important for me to get some real h1-h2 tags instead of simply replicating the needed heading style with spans and li.
Nickolay, Software Architect
ClickHelp - Online Documentation Tool
http://clickhelp.co
BTW, there are some additional (minor) issues with the HTML output for the document: the "Styles for the Article Body" header is converted to <ol><ol><li>, which is not correct a correct HTML according to http://stackoverflow.com/questions/5899337/proper-way-to-make-html-nested-list
The correct markup should be <ol><li><ol><li>.
Also, the header is supposed to be "2.1. Styles for the Article Body", not "1. Styles for the Article Body", but I'm not sure whether Document Server is supposed to handle this numbering properly.
Nickolay, Software Architect
ClickHelp - Online Documentation Tool
http://clickhelp.co
Hello Nickolay,
It seems that the incorrect behavior occurs at the document importing level. For some reason, the Abstract title becomes numbered when the document is loaded to RichEditDocumentServer. As a result, the numeration is incorrect in the whole document and it affects export. I have passed this issue to our R&D team for further research.
Please stay tuned to our updates. We will notify you as soon as we make any progress.
Well, yes, for me the ideal behavior would be to simply convert all such auto-numbered headers to h1, h2, h3 etc. tags without any numbered lists - this would solve all the three issues mentioned in this case and this is why I wanted the original problem to be fixed. I was going to move all h1-h6 tags out of ol / ul tags and remove those tags.
In fact, considering the fact that HTML supports neither auto-numbered headers nor "broken" lists with lots of extra content between list items properly, this might make sense to remove the numbering during import indeed.
Nickolay, Software Architect
ClickHelp - Online Documentation Tool
http://clickhelp.co
Nickolay, you can try the following approach to remove numbering for paragraphs before the export to Html:
foreach (var paragraph in richEditDocumentServer.Document.Paragraphs) { if (!paragraph.IsInList) continue; int listLevel = paragraph.ListLevel; richEditDocumentServer.Document.Paragraphs.RemoveNumberingFromParagraph(paragraph); paragraph.OutlineLevel = listLevel + 1; }
Please let me know if this approach meets your requirements.
Hmmm, that's a very interesting idea. I tried to keep all my optimization logic on the HTML side, so I missed the obvious idea of pre-processing the Word document. Please tell me, can this solution can be extended to evaluate the correct index for the current list item, so I can not only remove its numbering and produce <h1>Header Text</h1>, but also adjust the paragraph's text during pre-processing so it becomes something like <h1>2.3. Header Text</h1> in the HTML?
Actually, let me describe my use case in detail - my experience with ASPxHtmlEditor tells me the more you know of my use cases the better for me in the future.
In my use case customers provide me with Word documents. Those docs can be anything, from a few pages to huge complicated monsters occupying 50-100 Mb of disk space, which use auto-numbered headers, cropped images, shapes, formulae and all Word features you can imagine intensively (for example, have you ever seen a Word TOC with icons? Well, I did :))
Then, I take the document and split it into separate parts (topics). Each part starts with a header. Now, here's the problem: currently, for docs with auto-numbered headers the header is converted to HTML as text inside a numbered list. Unfortunately, I get a lot of auto-numbered docs, so I have to deal with it somehow. I already have some very nice automatic HTML markup optimization (post-processing) logic which does a lot of cleanup for me (for example, it removes all styles from lists as described in this case: https://www.devexpress.com/Support/Center/Question/Details/T251981). My initial idea was to adjust this logic a bit further and make it convert markup like <ol><li><ol><li><h1>Header text</h1>… to simple <h1>Header text</h1>. However, it appeared that you do not always generate heading tags for such headings and therefore I created this case. I need those h1-h6 tags for a number of reasons including proper styles and SEO considerations. From your response I implied that you are going to make this job for me and generate h1-h6 without any lists for such cases (not sure whether my assumption was correct, maybe I just heard what I wanted to hear).
But there is one more thing to consider here: from one side, you are trying to make the converted markup look as close to the original as possible. People expect to see the same content and styling they see in their Word files, including the heading numbers. From the other side, HTML is obviously not a good fit when you have a single auto-numbered header followed by tons of text, images and other stuff prior to the next auto-numbered header. So, why don't you convert the heading auto-numbers to text, like you currently do for the Document.GetText method (try getting the Abstract paragraph's text and you'll see what I mean)? That is, why don't you generate markup like <h1>1. Header text</h1> for such auto-numbered headers? This can be a Word import option and this would solve another issue for me as currently I don't get proper heading numbers at all, because I first split document into parts and then convert them to HTML (as the result, each auto-generated header is always rendered as "1. Header text" - so, I have 50 topics with the same index of 1 due to that auto-numbering :))
That's just an improvement idea, though - I'll be OK with custom pre-processing or post-processing logic if that logic will eventually give me <h1>2.3. Header Text</h1> or at least <h1>Header Text</h1> for such auto-numbered headers.
Nickolay, Software Architect
ClickHelp - Online Documentation Tool
http://clickhelp.co
Hi Nickolay,
I guess your main idea is to remove the document auto-numbering and replace numbers as a simple text in the document headers before exporting. You can use the approach described in my previous comment to remove numbering, but the issue with text replacing is more complicated. We do not provide the built-in mechanism to get the exact header number as a string value in the corresponding format (how this number is shown in the document).
So, you should generate this string with the number manually by counting the number of headers with the corresponding outline level. When the string is generated, you can insert this string in the paragraph start position using the RichEditDocumentServer.Document.InsertText method.
>>>
I guess your main idea is to remove the document auto-numbering and replace numbers as a simple text in the document headers before exporting
<<<
Exactly.
>>>
So, you should generate this string with the number manually by counting the number of headers with the corresponding outline level. When the string is generated, you can insert this string in the paragraph start position using the RichEditDocumentServer.Document.InsertText method.
<<<
So, no built-in way to do that. OK, thanks for clarifying that. I think I'll just keep removing the numbers for now - just wanted to check whether there's an easy way to keep them. They're not very important for me and if I'll need them I'll just count the headers as you suggested.
I see that you've put this case to the "Fixed" status. Could you please clarify what exactly behavior will be fixed by the hotfix? I mean, what the change of behavior will be after applying the hotfix?
Nickolay, Software Architect
ClickHelp - Online Documentation Tool
http://clickhelp.co
As I specified in my first comment, the Abstract header becomes numbered on document loading for some reason. So, all the document numbering was incorrect.
Now this document numbering is fixed (Abstract is not numbered anymore). So, you will be able to import this document as expected and the exported document should have the correct numbers.
For anyone interested in removing the automatic numbers, the sample above is not correct as it removes all lists from the document regardless of whether they are headers or not, which is obviously not acceptable. You need to use the following code instead:
private void cleanupNumberedHeadings()
{
foreach (Paragraph p in _doc.Paragraphs)
{
if (!p.IsInList
|| (getStyleOutlineLevel(p.Style) < 1
&& p.OutlineLevel < 1))
{ // Not a numbered heading
continue;
}
int listLevel = p.ListLevel;
_doc.Paragraphs.RemoveNumberingFromParagraph§;
p.OutlineLevel = listLevel + 1;
}
}
Where getStyleOutlineLevel is a routine used to get a style outline recursively considering the inheritance:
private int getStyleOutlineLevel(ParagraphStyle style)
{
if (style.OutlineLevel.HasValue
&& style.OutlineLevel.Value > 0)
{
return style.OutlineLevel.Value;
}
else if (style.Parent != null)
{
return getStyleOutlineLevel(style.Parent);
}
else
{
return 0;
}
}
Nickolay, Software Architect
ClickHelp - Online Documentation Tool
http://clickhelp.co