Apache Solr Content Extraction Library (Solr Cell) Release Notes This file describes changes to the Solr Cell (contrib/extraction) module. See SOLR-284 for details. Introduction ------------ Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such as Microsoft Word, Adobe PDF, etc. (Each name is a trademark of their respective owners) This contrib module uses Apache Tika to extract content and metadata from the files, which can then be indexed. For more information, see http://wiki.apache.org/solr/ExtractingRequestHandler Getting Started --------------- You will need Solr up and running. Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder) to your Solr Home lib directory. See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in and configuring. Tika Dependency --------------- Current Version: Tika 1.1 (released 2012-03-23) $Id$ ================== Release 4.0.0-ALPHA ============== * SOLR-3254: Upgrade Solr to Tika 1.1 (janhoy) ================== Release 3.6.0 ================== * SOLR-2346: Add a chance to set content encoding explicitly via content type of stream. This is convenient when Tika's auto detector cannot detect encoding, especially the text file is too short to detect encoding. (koji) * SOLR-2901: Upgrade Solr to Tika 1.0 (janhoy) * SOLR-3295: netcdf jar is excluded from the binary release (and disabled in ivy.xml) because it requires java 6. If you want to parse this content and are willing to use java 6, just add the jar. (rmuir) ================== Release 3.5.0 ================== * SOLR-2372: Upgrade Solr to Tika 0.10 (janhoy) ================== Release 3.4.0 ================== * SOLR-2540: CommitWithin as an Update Request parameter You can now specify &commitWithin=N (ms) on the update request (janhoy) * SOLR-2743: Remove commons logging. (koji) ================== Release 3.3.0 ================== (No Changes) ================== Release 3.2.0 ================== * SOLR-2480: Add ignoreTikaException flag so that users can ignore TikaException but index meta data. (Shinichiro Abe, koji) ================== Release 3.1.0 ================== * SOLR-1902: Upgraded to Tika 0.8 and changed deprecated parse call * SOLR-1756: The date.format setting causes ClassCastException when enabled and the config code that parses this setting does not properly use the same iterator instance. (Christoph Brill, Mark Miller) * SOLR-18913: Add ICU4j to libs and add tests for Arabic extraction (Robert Muir via gsingers) * SOLR-1902: Upgraded to Tika 0.8-SNAPSHOT to incorporate passing in Solr's custom ClassLoader (gsingers) ================== Release 1.4.0 ================== 1. SOLR-284: Added in support for extraction. (Eric Pugh, Chris Harris, gsingers) 2. SOLR-284: Removed "silent success" key generation (gsingers) 3. SOLR-1075: Upgrade to Tika 0.3. See http://www.apache.org/dist/lucene/tika/CHANGES-0.3.txt (gsingers) 4. SOLR-1128: Added metadata output to "extract only" option. (gsingers) 5. SOLR-1310: Upgrade to Tika 0.4. Note there are some differences in detecting Languages now. See http://www.lucidimagination.com/search/document/d6f1899a85b2a45c/vote_apache_tika_0_4_release_candidate_2#d6f1899a85b2a45c for discussion on language detection. See http://www.apache.org/dist/lucene/tika/CHANGES-0.4.txt. (gsingers) 6. SOLR-1274: Added text serialization output for extractOnly (Peter Wolanin, gsingers)