Knowledge Services Update - Special Libraries Association
Transcription
Knowledge Services Update - Special Libraries Association
Letters From the Front Lori Emadi Head, Taxonomy & Metadata June 14, 2015 Slide 1 • Background • About the Army Data Collection • The Business Case • Building Hermes • Accessibility through Metadata • Moving forward… Slide 2 Background • RAND acquired hundreds of thousands of classified and unclassified documents from U.S. Army units returning from Iraq and Afghanistan • Files were copied as is from hundreds of hard drives – documents are varied in content, naming, format, and structure (or lack of) • Available tools were not suitable for working with such a large and diverse corpus of documents • Lacked a good way to search, make use of these data • “Letters From The Front” project was approved as an FY 2013 R&D effort • My project team developed a document indexing, search, and visualization capability called Hermes from an open source tool set that is scalable and extensible Slide 3 • Background • About the Army Data Collection • The Business Case • Building Hermes • Accessibility through Metadata • Moving forward… Slide 4 What We Learned About This Very Large Data Collection Units and Agencies (of 195, 29 submitted data) Assistant Secretary of the Army, - 3d Infantry Division (Mechanized) Acquisition, Logistics, Logistics, and and Fort Stewart - 3rd Armored 23 Army units, and -Army Technology - Centercommands, for Army Cavalry Regiment 42nd Infantry Lessons Learned - Center of Military Division I CORPS III ARMY - III support activities History - Stryker Center for Lessons CORPS and FT Hood - United States Learned - United States Army Army Special Forces Command Combined Arms Support Command (Airborne) - 75th Exploitation Task 4 Fort DOD, Government and Lee -Joint, US Armyand CorpsOther of Force - Multi-National Corps-Iraq Engineers US Army G-8 US Army Multi-National Force-Iraq - Multiactivities Communications Life Cycle National Security Transition Management Command - 101st Air Command-Iraq/Commander, NATO Assault - 10th Mountain Division Training Mission-Iraq - Office of (Light) and2Fort Drum - 10th Special Security Cooperation-Afghanistan Military Academies Forces Group - 16th Engineer United States Army Military Police Brigade - 1st Cavalry Division School - United States Military United States Army Europe and Academy Seventh Army - 1st Infantry Division Diverse content and file types SITREPS - FRAGOS - SIGACTS INTSUMS - SPOT reports - AAR WARNO - BDA - BDR - CONPLAN Emails & Email collections Order of Battle - OPLAN - OPORD (~63,000 files), PowerPoint DeploymentPDF orders - Daily (~171,000), (~64,000), Personnel Status - Military vehicle Excel (~84,000), Word and status - Alert Roster - Service other text (~400,000), images Support Order Mission Analysis and video (~300,000), … Briefs - Decision Briefs - Mission Concept Briefs - Backbriefs – Balcony Briefs - Debriefs - … Slide 5 1ID_G3\G3 Operations\FRAGOSs G3 OPS\RFI Section\SSG xxxxxxxx\DA BUCKSTER FOLDER\MILITARY RELATED CRAP\WORK RELATED CRAP\ Some interestingly named data folders … SIRs & IRs\1ID\BEFORE WE GOT ORGANIZED\ MNSTCI\NIPR\MNCI FOLDERS\NIPS\C2\C2_SECURITY\ I.Think.This.Is.The.Template.That.You.Need.To.Use.For.Submit ting.Anything.On.Me.But.I.Could.Be.Wrong.About.That.So.I.Will .Ask.About.It.Tomorrow\ 1stCAV\SJA\EXSUMS\Dan’s Super Duper FRAGO Folder\ 1stCAV\G3\EOD LNO\Im Thuper Thanks Fer Athkin\ 1ID_G3\G3 Operations\CHOPS\G3 OPS – Battle Captains\STUFF THAT BOB JUST SAVED\ 1stCAV\SJA\EXSUMS\DEAR GOD, I HOPE WE DON’T NEED THESE MONTHS\APRIL 2005\... Slide 6 • Background • About the Army Data Collection • The Business Case • Building Hermes • Accessibility through Metadata • Moving forward… Slide 7 Search Scenarios • “What survey data exists on local (Iraqi, Afghan) population attitudes, beliefs, and information consumption patterns?” • “[for x operation] We need to find out which brigades were deployed, when they were deployed, and who their brigade commanders were.” • "There are a number of interesting distinctions that may be tractable, e.g., variations in communications of different entities and whether/how they coordinate…” • “[for intelligence gathering] It takes multiple data points to create a profile. The data should be stored separately and combined to make the profile. Then you can run it across everything and look for relationships with any of the data points.“ Slide 8 Reconstructing Lost Events • Not a scenario: Missing records on the 81st BCT (Washington State ARNG) and 82nd AB Division and their operations in Afghanistan and Iraq. (Seattle Times article, July 13, 2013) Slide 9 • Background • About the Army Data Collection • The Business Case • Building Hermes • Accessibility through Metadata • Moving forward… Slide 10 There Are Many Available Tools For Dealing With Masses of Textual Data… Slide 11 The Solr Suite Provides Needed Speed, Scalability and Extensibility, and Is Open Source… • “SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.” • (“Apache Solr, at http://lucene.apache.org/Solr/) …And Solr Is An Emerging Standard for Searching Large Text Databases… Slide 12 Application Framework File Preparation and Parsing Search Human Interface (Folder navigation, file type ID, OCR, content/metadata extraction, language detection) Modular and flexible by design, this architecture can be customized for other RAND efforts Slide 13 • Background • About the Army Data Collection • The Business Case • Building Hermes • Accessibility through Metadata • Moving forward… Slide 14 …Example search results in a similar out-ofbox setting… Columbia University Library Catalog (CLIO) Slide 15 Metadata fields --customizable --implemented during processing Slide 16 Slide 17 Query Parser Syntax Fields field name followed by a colon ":" then term. title:"The Right Way" AND text:go Wildcard Searches single character wildcard "?" ; multiple character wildcard "*" Regular Expression Searches /[mb]oat/ Fuzzy Searches use the tilde, "~" : roam~ roam~1 Proximity Searches use the tilde, "~" : "jakarta apache"~10 Range Searches Use range queries with date and non-date fields: mod_date:[20020101 TO 20030101] title:{Aida TO Carmen} Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets. Boosting a Term use the caret, "^", with a boost factor at the end of the term: jakarta^4 apache "jakarta apache"^4 "Apache Lucene" Boolean Operators AND, "+", OR, NOT and "-" "jakarta apache" OR Jakarta "jakarta apache" AND "Apache Lucene" +jakarta lucene "jakarta apache" NOT "Apache Lucene" NOT "jakarta apache“ "jakarta apache" -"Apache Lucene" Grouping use parentheses to group clauses to form sub queries. (jakarta OR apache) AND website Field Grouping use parentheses to group multiple clauses: title:(+return +"pink panther") Escaping Special Characters To escape use the \ before the character. Ex: to search for (1+1):2 use the query: \(1\+1\)\:2 Slide 18 • Background • About the Army Data Collection • The Business Case • Building Hermes • Accessibility through Metadata • Moving forward… Slide 19 Next Steps • Security around collection, User authentication • Natural Language Processing (NLP) • Extracting content attachments in emails (while keeping the attachments in place) • Additional visualization options • Enhanced logging and tracking • Ability for users to rank content, add or edit metadata content • Enhance user interface Slide 20 Thank You! Any Questions? emadi@rand.org Slide 21