• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
      • LucidWorks Big Data
    • Apache Releases
      • Apache Solr 4.0-dev
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Lucid University
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Lucene Revolution
      • Tradeshows & Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Board of Directors
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Log in
Home . DevZone . Forum

Lucid Imagination Forum » LucidWorks Enterprise

Crawler and Relative URLs

(2 posts) (2 voices)
  • Started 4 months ago by Lance
  • Latest reply from Andrzej Bialecki

Tags:

No tags yet.

  1. Lance
    Professional Services Engineer

    (For Rick Mendes at Intuit)

     

    I am indexing data with the crawler for one of my projects. One of the sets of data comes from a desktop application that generates a sitemap for their help content. The sitemap produces URL they term relative, but I call them pseudo relative because they don't include the leading slash. Here is a snippet of the sitemap file: <url><loc>text/en/quicktax/worksheets/qt_ws_medical.html</loc></url> When I crawl the site hosting this sitemap, the crawler does not index any of the documents. The documents are on the site. Is there any way to configure it so it will not ignore these URLs. 

    Posted 4 months ago #
  2. Andrzej Bialecki
    Moderator

    LucidWorks uses Aperture to provide a simple small scale web crawling functionality. Unfortunately, Aperture doesn't support sitemaps. Some of the target documents listed in your sitemap may have been even found and indexed if they were linked from other pages, nonetheless sitemaps were ignored...

    For more functionality or larger scale crawls you should use an external crawler, e.g. Nutch. There is some code available in the crawler-commons project for handling robot rules and sitemaps, but it's not integrated with Nutch yet - if enough people request this functionality, and perhaps provide patches then I'm sure it will be added to Nutch. You can also try ManifoldCF, which has some support for sitemaps and can index documents into LucidWorks. Or you can use any other external crawler and use it with the "External Data Source" as described in the documentation.

    Posted 4 months ago #

RSS feed for this topic

Reply

You must log in to post.

  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.