• Products
    • Overview
    • LucidWorks Search Platform
      • Features and Benefits
      • Technical Overview
      • Only with LucidWorks
      • LucidWorks and Solr
      • White Papers
      • LucidWorks Enterprise
      • LucidWorks Cloud
      • LucidWorks Big Data
    • Apache Releases
      • Apache Solr 4.0-dev
      • Apache Lucene
  • Support & Services
    • Overview
    • Support
    • Lucid University
    • ExpertLink Advisory
    • Consulting
    • Partners
    • Subscriptions
  • Why Lucid?
    • Why Lucid?
    • Technology
    • Who uses Lucene/Solr?
      • What customers are saying
    • Case Studies
    • Whitepapers
    • Demos
    • Webinars
  • Blog
  • DevZone
    • DevZone Overview
    • Forums (LWE)
    • Videos & Podcasts
      • How To's
      • Screencasts
      • Podcasts
      • Conference Videos
    • Technical Articles
      • Whitepapers
    • Reference Materials
      • Documentation
      • Solr Reference Guide
      • Solr & LucidWorks Matrix
      • Tutorials
    • Events
      • Lucene Revolution
      • Tradeshows & Conferences
      • Meet Ups
    • Code & Test
  • Downloads
  • About Us
    • Management
    • Board of Directors
    • Apache Lucene/Solr Committers
    • Careers
    • News
      • Media Coverage
      • Press Releases
    • Contact Us
Log in
Home . DevZone . Forum

Lucid Imagination Forum » LucidWorks Enterprise

Indexing S3 Bucket

(4 posts) (2 voices)
  • Started 4 months ago by justin.brister
  • Latest reply from justin.brister

Tags:

No tags yet.

  1. justin.brister
    Member

    I am trying to index an S3 bucket using the s3n connector. I have a valid bucket, include pattern and access keys configured but I am getting an exception from the crawler. I have tried changing the crawl and match patterns and I end up getting either a file not found or an authenticatIon error.

    The files in S3 are .jpg and .json files and have simply been uploaded via a Java web app.

    The LWE install is the current version running on a Mac and is straight out of the box.

    Is there anything I need to configure other than the data source on LWE? Is there something that I am missing in terms of using the s3n connector? The documentation for the data source connector mentions Hadoop. Is there something that I need to do with the Hadoop libraries to make this work?

    Thanks in advance,

    j

    Posted 4 months ago #
  2. Lance
    Professional Services Engineer

    Hi-

    When do you get these error messages? When you try to save the Data Source, or when you run a crawl?

    In your LWE home directory, look at data/logs. The 'core' logs are from the underlying LWE Core app, which tries to the configuration. There might be error messages in there. For example, if I use the wrong bucket name, I got this error message:

    2012-01-12 01:01:31,104 INFO crawl.CrawlStatus - end job id: 19 took 00:00:01.137 counters: num_new=1, num_unchanged=0, num_updated=0, num_deleted=0, num_failed=0, num_total=1 state: EXCEPTION exception: org.apache.hadoop.fs.s3.S3Exception: S3 GET failed for '/' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error>NoSuchBucket<Message>The specified bucket does not exist</Message><BucketName>s3crawler2</BucketName>

    These server logs are also available through the UI.  Under 'Status' there is a link in the middle of the page: The complete log files may be downloaded from here.

    Posted 4 months ago #
  3. justin.brister
    Member

    Lance,

    the configuration saves fine, at crawl time I get the following error;

    ERROR doc_failed [S3]

    HEAD request failed

    Response Code: 403

    Forbidden

    This error is reported against the crawl *URL so it is not getting as far as looking at the include patterns.

    I am using a path of the form;

    s3n://<host>/<bucket>/<path>/ (e.g. s3n://s3.amazon.com/mybucket//images/) for crawl, and;

    s3n://<host>/<bucket>/<path>/filename.json (e.g. s3n://s3.amazon.com/mybucket//images/image.jpg.json) for the include pattern.

    And I have put in my Access Key and Secret in the username and password. I have configured permissions on the bucket such that Authenticated has list and view permissions.

    NOTE: The root folder in the bucket is simply / hence the // after the bucketname in the path!

    Posted 4 months ago #
  4. justin.brister
    Member

    Lance,

    I figured it out :)

    The / folder that I have in my bucket is being stripped out of the requests from LWE / Solr and so the requested path is never correct. I created a new bucket that does not have this folder in its path and everything works fine.

    One for the bug list I think :)

    Thanks,

    J

    Posted 4 months ago #

RSS feed for this topic

Reply

You must log in to post.

  • Contact Us
  • About Lucid Imagination
  • Help & Support
  • Training
  • Website Feedback
  • Privacy Policy
  • Legal Terms of Use
  • Copyrights and Disclaimers
  • Sitemap
  • Admin

Apache Solr, Solr, Apache Lucene, Lucene and their logos are trademarks of the Apache Software Foundation.

© 2012 Lucid Imagination. All Right reserved.