On Jun 11, 2013, at 11:25 AM, Thomas Mortagne <thomas.mortagne(a)xwiki.com> wrote:
On Tue, Jun 11, 2013 at 11:21 AM, Vincent Massol
<vincent(a)massol.net> wrote:
On Jun 11, 2013, at 11:09 AM, Denis Gervalle <dgl(a)softec.lu> wrote:
Thanks Sergui for catching up here, while I was
off.
This is a situation I claim since a long time now, which is that some
document has simply no language, because all text displayed by those
document are simply produced by the localization module. Not only the
search issue are affected, but also the display of language selection in a
multi-wiki, and currently this is not as nice as it should be.
IMO, with a wiki that could produce with a single source documents,
displayed document in all available translation of the wiki (like the UI
does), we need to have the notion of a "no language" or "any
language"
document. In my own projects, I have been able to manage that properly
with the empty "default language" and empty "language" case. Since
having
both is stupid, we will surely merge those column in the future, but we
need to keep the idea of a "no language" or "any language" document,
whatever you see it, and to properly manage it, not only for our own XAR
but for user produced documents as well.
For those reasons, I am -1 for 1) and 3), and +1 for 2)
BTW I forgot to say in the my previous reply but obviously I remove my -1 for 2) since
it's much more complex than I thought.
I follow Thomas, the Welcome page is not a good
example, since this on has
really a translation in all languages.
I do not understand the intend to index code statement in an index, this is
the rendered document that should be indexed. And for "any language"
document, it should be indexed separately for each language enable in a
multi-language wiki.
ok, that's something very important to decide.
I think the 2 use cases are valide; xwiki can be used both by end users who only care
about pure content and it can also be used by developers who develop wiki pages with
scripts inside. For the former they want to see only content result and for the latter
they want to see script results. For example, as a dev I want to see where I call such
velocity macro, or where I use such rendering macro.
So I really believe that we need to index both types of content: rendered and raw.
Right now only raw content is indexed mostly because it's not easy to
index rendered content (most of it is dynamic and some are simply
dangerous or really not intended to be executed in the content of some
deamon thread, this is not a new issue). But putting aside the current
limitation I don't think indexing raw content is useless and saying
that scripting is english because the syntax looks like english and
simply nonsense…
Yes I agree it's hard to index rendered content perfectly...
I was just replying to Denis who was saying the opposite, i.e. that we should only index
rendered content. I see value in indexing the raw content for the reasons I pointed out
above.
I agree that ideally non tech users shouldn't see in their results some script
portions… One thing we could do could be to render without executing transformations and
only index that content. It's not perfect but it could be a first try to filter out
tech content for simple users.
Thanks
-Vincent
> Then we need to decide how to present that in the
UI, but it could an option in the advanced search to include raw results too for example
(and remove duplications…).
>
>> So, to respond to Vincent, using properly the "any language" case by
>> clearing the default language where needed, is the right way to go, but we
>> also need to manage that special case properly elsewhere (indexing,
>> language selection on the page,…)
>
> So what's the rule for putting <defaultLanguage>en</defaultLanguage>?
Whenever there's a page having at least one word of English in the rendered result?
>
> Thanks
> -Vincent
>
>> On Tue, Jun 11, 2013 at 10:54 AM, Thomas Mortagne <thomas.mortagne(a)xwiki.com
>>> wrote:
>>
>>> That's not the home page, that's the Welcome page and it's not a
very
>>> good example since this page do have translations already so we
>>> already decided what we wanted for this pages in practice: each
>>> translation of the page copy the scripts.
>>>
>>> On Tue, Jun 11, 2013 at 10:19 AM, Vincent Massol <vincent(a)massol.net>
>>> wrote:
>>>> Hi Sergiu,
>>>>
>>>> On Jun 10, 2013, at 9:49 PM, Sergiu Dumitriu <sergiu(a)xwiki.org>
wrote:
>>>>
>>>>> On 06/10/2013 03:12 PM, Vincent Massol wrote:
>>>>>>
>>>>>> On Jun 10, 2013, at 8:25 PM, Sergiu Dumitriu
<sergiu(a)xwiki.org> wrote:
>>>>>>
>>>>>>> On 06/10/2013 11:00 AM, Thomas Mortagne wrote:
>>>>>>>> Hi devs,
>>>>>>>>
>>>>>>>> Right now the XAR plugin format goal systematically empty
the
>>>>>>>> <defaultLanguage> property.
>>>>>>>>
>>>>>>>> This is wrong IMO since it means we have no idea what is
the default
>>>>>>>> document language, it was not too visible before but
it's really not
>>>>>>>> very nice for things like the localization module and
especially SOLR
>>>>>>>> which store deferently the content depending on the
language (stop
>>>>>>>> words, etc).
>>>>>>>>
>>>>>>>> I see several possibilities:
>>>>>>>>
>>>>>>>> 1) We don't touch the XAR maven plugin and we state
that when default
>>>>>>>> language is not set, it's en (in the importer for
example or in
>>>>>>>> XWikiDocument#getDefaultLanguage)
>>>>>>>> 2) We stop filtering default language in the XAR plugin
and we set it
>>>>>>>> to en for all document in which it make sense
>>>>>>>> 3) We force default language to "en" in the XAR
plugin
>>>>>>>>
>>>>>>>> WDYT ?
>>>>>>>>
>>>>>>>> I don't like too much 1) since some technical
document could really
>>> be
>>>>>>>> seen has having no default language, some document
without any
>>> literal
>>>>>>>> content. But it's more a -0 than a -1, I understand
other would want
>>>>>>>> this for simplicity.
>>>>>>>>
>>>>>>>> About 3) as I said having a default language empty is a
valid use
>>> case
>>>>>>>> IMO so -0 for this one to. Still a bit better than 1)
since the use
>>>>>>>> case is still possible.
>>>>>>>>
>>>>>>>> +1 for 2)
>>>>>>>
>>>>>>> Neither option is good in general. The main problem is that
most
>>>>>>> documents are written in the "Velocity" language,
not in the "English"
>>>>>>> language, meaning that it only contains code (which won't
be seen by
>>> the
>>>>>>> user), and translations, which depend on a lot of factors.
It's not
>>> good
>>>>>>> to say that the default language of a dynamically translated
document
>>> is
>>>>>>> en, since a wiki configured with a different language will
only
>>> display
>>>>>>> them in that language, never in en.
>>>>>>>
>>>>>>> There are only a few documents that contain real text
(normally only
>>> the
>>>>>>> sandbox should have real text, everything else should be
localized),
>>> and
>>>>>>> for those it's OK to specify the actual language.
>>>>>>
>>>>>> I'm not sure I agree with this vision. It really depends on
the use
>>> case. So far we haven't found a perfect solution.
>>>>>>
>>>>>> Some pages will have more code than content, others will have
more
>>> content than code. For the former, keys are best and for the latter
>>> translations are best.
>>>>>>
>>>>>> In any case I don't understand the problem. What is the issue
with
>>> saying that all our pages are in English by default. If a wiki is
>>> configured to be in another language and there's no translation for that
>>> language the default language (ie "English") will be used.
>>>>>
>>>>> There is no actual text in the document. How can you say that the
>>>>> language of
>>>>>
>>>
https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwi…
>>>>> is English, since there's no English sentence in there? Depending
on the
>>>>> configuration, the same document will appear in German, Chinese,
even
>>>>> Klingon, without changing anything in the document, so it is
definitely
>>>>> not an English document.
>>>>>
>>>>> Scenario: Set up a new XWiki instance, and change the default
language
>>>>> of the wiki to German. When you browse the wiki, everything is in
>>>>> German. Yet all the documents say that they're in English.
>>>>>
>>>>> Problem 1: The wiki is indexed as English text, so searching for
text
>>>>> that the user actually sees in the wiki won't return any
results.
>>>>>
>>>>> Problem 2: Editing such a document will automatically create a
>>>>> translation, since the original document is in English, and the user
>>>>> wants to edit a German document. Since the two languages are not
>>>>> compatible, a translation will be created automatically. Now the
code
>>>>> has been forked, and automatic updates using the Distribution Wizard
>>>>> will update the hidden English document, since that is the default
one,
>>>>> while the forked translation will stay behind.
>>>>
>>>> Thanks a lot for describing these 2 use cases that I definitely
wouldn't
>>> have thought about! That's very useful.
>>>>
>>>> So it seems that suddenly it's becoming more complex ;)
>>>>
>>>> Basically it means that if we have documents that mix content and
>>> scripting we're going to have issues:
>>>> * Either they're marked as having no default language and the
english
>>> content will be indexed in the default language of the wiki
>>>> * Either they're marked as "en" and the user will not have
the scripts
>>> in the search results and the DW/EM will update only the default version if
>>> the user has created a translation
>>>>
>>>> I'm sure we have lots of cases like this, the easiest one being the
main
>>> home page:
>>>>
>>>> -------------------
>>>> It's an easy-to-edit website that will help you work better
together.
>>> This Wiki is made of //pages// sorted by //spaces//. You're currently in
>>> the **Main** space, looking at its home page (**WebHome**).
>>>>
>>>> Learn how to use XWiki with the {{velocity}}[[Getting Started
>>> Guide>>
>>>
http://enterprise.xwiki.org/xwiki/bin/view/GettingStarted/WebHome?version=$…
>>> .
>>>>
>>>> {{velocity}}
>>>> #if($hasEdit)You can then use the [[Sandbox
>>> space>>Sandbox.WebHome]] to try out your wiki's
features.#end
>>>> {{/velocity}}
>>>> -------------------
>>>>
>>>> It has both content and script… If it's marked as "en" then
if the user
>>> searches for "hasEdit" he won't get it if his wiki is in a
language other
>>> than "en".
>>>>
>>>> Unless, if there's no translation in the language of the user then
we
>>> return the default language results for that page. Would that make sense?
>>>>
>>>> But there's still the issue of editing the page, which will create a
>>> translation and then imagine that we replace the velocity script in a
>>> future version, then the user will only get his default page updated and
>>> not his translation… However that's a general problem I guess...
>>>>
>>>> If it has no default language (as is the case now BTW) then it seems
>>> less of an issue it seems. It just means:
>>>> * If user searches for "work" he'll get result even though
he's no in an
>>> "en" wiki. But then he's searching for an english word too ;)
>>>> * any other downside?
>>>>
>>>> In view of all this, it seems that not setting any default language is a
>>> lesser evil, doesn't it?
>>>>
>>>> Thanks
>>>> -Vincent
>>>>
>>>>> That is why I'm saying that this kind of documents don't have
a
>>>>> language, and they never should have. They adapt themselves to the
>>>>> user's language, so they're written in no language, yet they
can match
>>>>> all languages.
>>>>>
>>>>> Of course, not all documents are like this, as I originally stated
>>>>> myself, and there are valid cases where documents should have
"en" as
>>>>> the default translation.
>>>>>
>>>>>> What am I missing?
>>>>>>
>>>>>> (I'm not commenting on anything below yet because I feel
it's
>>> important to agree on what's before first)
>>>>>>
>>>>>> Thanks
>>>>>> -Vincent
>>>>>>
>>>>>>> Other options:
>>>>>>>
>>>>>>> 4) Detect somehow localized documents and index:
>>>>>>> - the raw content using a non-language-specific analyzer
>>>>>>> - the content translated into all the languages registered in
the
>>>>>>> administration, each with the proper language-specific
analyzer, if
>>> they
>>>>>>> are supported by Solr; this includes the default wiki
language.
>>>>>>> 4a) localized document = the default language is empty
>>>>>>> 4b) localized document = the default language is literally
"localized"
>>>>>>> 4c) add another document flag for marking localized
documents
>>>>>>>
>>>>>>> 5) When the defaultLanguage is empty, render in the
configured wiki
>>>>>>> default language
>>>>>>>
>>>>>>>
>>>>>>> I like 4) since it makes localized documents really
searchable in all
>>>>>>> the languages "supported" by that wiki instance.
>>>>>>>
>>>>>>> 4a) is a behavior change, so it might cause some trouble
>>>>>>> 4b) is the safest and requires the least amount of changes
>>>>>>> The number of document fields is increasing, so I'm not
that fond of
>>> 4c)