Hi I'm a companion of Olaf,
the last week we exposed our xwiki to the googlebot to test our
configuration. Yesterday we got the same problems with the crawler and we
are lost like before.
I will get more precise to the problem and quote some of our logs.
My results of the analysis:
- some critical actions (eg edit) redirects the googlebot to the login page
with 302. The login page is 401, the googlebot's path stops here. fine!
Log example:
example.com - - 66.249.73.10 [27/Apr/2012:15:44:12 +0200] "GET
/wiki/example.com/edit/XWiki/GadgetClass HTTP/1.1" 302 20
www.example.com - - 66.249.73.10 [27/Apr/2012:15:44:12 +0200] "GET
/wiki/example.com/login/XWiki/XWikiLogin;jsessionid=1F380560FA9E3582D6DDB9B1D286B151?srid=yWSymcYq&xredirect=%2Fwiki%2Fexample.com%2Fedit%2FXWiki%2FGadgetClass%3Fsrid%3DyWSymcYq
HTTP/1.1" 401 3004
============================================================================
- some critical actions results in a OK (200). These are for example
deletespace but also some edits:
Log example:
example.com - - 66.249.73.10 [27/Apr/2012:15:46:30 +0200] "GET
/wiki/example.com/get/Hilfe/WebPreferences HTTP/1.1" 200 985
example.com - - 66.249.73.10 [27/Apr/2012:15:46:33 +0200] "GET
/wiki/example.com/edit/Blog/WebPreferences HTTP/1.1" 200 5271
example.com - - 66.249.73.10 [27/Apr/2012:15:46:37 +0200] "GET
/wiki/example.com/save/Blog/WebPreferences HTTP/1.1" 302 20
example.com - - 66.249.73.10 [27/Apr/2012:15:46:37 +0200] "GET
/wiki/example.com/view/Blog/WebPreferences?resubmit=%2Fwiki%2Fexample.com%2Fsave%2FBlog%2FWebPreferences%3Fsrid%3Dn3Ake7tL&xback=%2Fwiki%2Fexample.com%2Fview%2FBlog%2FWebPreferences&xpage=resubmit
HTTP/1.1" 200 3689
Here I see one part of the googlebot path. It triggers actions guests are
not allowed to. According to the example: When I entered the
/edit/Blog/WebPreferences I get a 302 redirect and a 401 login page:
jloos@live:~$ curl -IL
http:/example.com/wiki/example.com/edit/Blog/WebPreferences
HTTP/1.1 302 Moved Temporarily
Date: Sat, 28 Apr 2012 00:41:33 GMT
Server: Apache/2.2.16
Set-Cookie: JSESSIONID=6819FB0D96E0695388E1AA2A1A92AF49; Path=/
Location:
http://example.com/wiki/example.com/login/XWiki/XWikiLogin;jsessionid=6819F…
Content-Language: de
Vary: Accept-Encoding
Content-Type: text/html
HTTP/1.1 401 Unauthorized
Date: Sat, 28 Apr 2012 00:41:33 GMT
Server: Apache/2.2.16
Pragma: no-cache
Cache-Control: no-cache
Expires: Wed, 31 Dec 1969 23:59:59 GMT
Content-Language: de
Content-Length: 13590
Vary: Accept-Encoding
Content-Type: text/html;charset=utf-8
============================================================================
- some actions with confirmation forms are delivered to the googlebot too
Log example:
example.com - - 66.249.71.33 [27/Apr/2012:16:03:05 +0200] "GET
/wiki/example.com/deletespace/Start/WebHome HTTP/1.1" 200 3702
[...]
example.com - - 66.249.71.33 [27/Apr/2012:16:43:03 +0200] "GET
/wiki/example.com/deletespace/Start/WebHome?confirm=1&form_token=saMxN4MidDarWDBvxciU2w
HTTP/1.1" 200 3001
So the googlebot gets a form with the csrf-token. Than it chose the yes in
the delete confirmation. So our disaster is complete.
============================================================================
I can trace the googlebot actions very well with our logging. But I can't
reproduce these actions as a guest in any way. I tried it with and without
cookies in several browsers and with curl from the command-line.
A wild guess: There seems to be some connections with other user-logins. The
last googlebot disaster-actions occurs when a admin logged in and a crawl
was in progress. My guess: Under some crazy circumstances, the sessions of a
user flips or copied to the crawler. But I think its really far-fetched.
The IP seems to be the googlebot:
jloos@test:~$ host 66.249.71.33
33.71.249.66.in-addr.arpa domain name pointer
crawl-66-249-71-33.googlebot.com.
We are using XEM. The Master-Wiki is behind htaccess, and only the relating
wiki is free accessible.
I hope these detailed analysis isn't to detailed. And I can quote Olaf:
any hints greatly appreciated!
Greetings
Jan
--
View this message in context:
http://xwiki.475771.n2.nabble.com/severe-trouble-with-web-crawlers-tp744216…
Sent from the XWiki- Users mailing list archive at
Nabble.com.