Knowledge Services Update - Special Libraries Association

Transcription

Knowledge Services Update - Special Libraries Association
Letters From the
Front
Lori Emadi
Head, Taxonomy & Metadata
June 14, 2015
Slide 1
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 2
Background
•
RAND acquired hundreds of thousands of classified and unclassified documents from U.S.
Army units returning from Iraq and Afghanistan
•
Files were copied as is from hundreds of hard drives – documents are varied in content,
naming, format, and structure (or lack of)
•
Available tools were not suitable for working with such a large and diverse corpus of
documents
•
Lacked a good way to search, make use of these data
•
“Letters From The Front” project was approved as an FY 2013 R&D effort
•
My project team developed a document indexing, search, and visualization capability
called Hermes from an open source tool set that is scalable and extensible
Slide 3
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 4
What We Learned About This Very Large
Data Collection
Units and Agencies (of 195, 29 submitted data)
Assistant Secretary of the Army, - 3d Infantry Division (Mechanized)
Acquisition, Logistics, Logistics, and and Fort Stewart - 3rd Armored
23 Army
units,
and -Army
Technology
- Centercommands,
for Army
Cavalry
Regiment
42nd Infantry
Lessons Learned - Center
of
Military
Division
I
CORPS
III ARMY - III
support activities
History - Stryker Center for Lessons CORPS and FT Hood - United States
Learned - United States Army
Army Special Forces Command
Combined Arms Support Command (Airborne) - 75th Exploitation Task
4 Fort
DOD,
Government
and
Lee -Joint,
US Armyand
CorpsOther
of
Force
- Multi-National Corps-Iraq Engineers
US
Army
G-8
US
Army
Multi-National
Force-Iraq - Multiactivities
Communications Life Cycle
National Security Transition
Management Command - 101st Air Command-Iraq/Commander, NATO
Assault - 10th Mountain Division
Training Mission-Iraq - Office of
(Light) and2Fort
Drum - 10th
Special Security Cooperation-Afghanistan Military
Academies
Forces Group - 16th Engineer
United States Army Military Police
Brigade - 1st Cavalry Division School - United States Military
United States Army Europe and
Academy
Seventh Army - 1st Infantry Division
Diverse content and file types
SITREPS - FRAGOS - SIGACTS INTSUMS - SPOT reports - AAR WARNO
- BDA
- BDR
- CONPLAN Emails
& Email
collections
Order
of Battle
- OPLAN
- OPORD (~63,000
files),
PowerPoint
DeploymentPDF
orders
- Daily
(~171,000),
(~64,000),
Personnel
Status - Military
vehicle
Excel (~84,000),
Word and
status
- Alert
Roster - Service
other text
(~400,000),
images
Support
Order
Mission
Analysis
and video (~300,000), …
Briefs - Decision Briefs - Mission
Concept Briefs - Backbriefs –
Balcony Briefs - Debriefs - …
Slide 5
1ID_G3\G3 Operations\FRAGOSs G3 OPS\RFI Section\SSG
xxxxxxxx\DA BUCKSTER FOLDER\MILITARY RELATED CRAP\WORK
RELATED CRAP\
Some interestingly named data folders …
SIRs & IRs\1ID\BEFORE
WE GOT ORGANIZED\
MNSTCI\NIPR\MNCI FOLDERS\NIPS\C2\C2_SECURITY\
I.Think.This.Is.The.Template.That.You.Need.To.Use.For.Submit
ting.Anything.On.Me.But.I.Could.Be.Wrong.About.That.So.I.Will
.Ask.About.It.Tomorrow\
1stCAV\SJA\EXSUMS\Dan’s Super Duper
FRAGO Folder\
1stCAV\G3\EOD LNO\Im Thuper Thanks Fer
Athkin\
1ID_G3\G3 Operations\CHOPS\G3 OPS –
Battle Captains\STUFF THAT BOB JUST
SAVED\
1stCAV\SJA\EXSUMS\DEAR GOD, I HOPE
WE DON’T NEED THESE MONTHS\APRIL
2005\...
Slide 6
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 7
Search Scenarios
•
“What survey data exists on local (Iraqi, Afghan) population attitudes,
beliefs, and information consumption patterns?”
•
“[for x operation] We need to find out which brigades were deployed,
when they were deployed, and who their brigade commanders were.”
•
"There are a number of interesting distinctions that may be tractable,
e.g., variations in communications of different entities and whether/how
they coordinate…”
•
“[for intelligence gathering] It takes multiple data points to create a
profile. The data should be stored separately and combined to make the
profile. Then you can run it across everything and look for relationships
with any of the data points.“
Slide 8
Reconstructing Lost Events
• Not a scenario: Missing records on the 81st
BCT (Washington State ARNG) and 82nd AB
Division and their operations in Afghanistan
and Iraq. (Seattle Times article, July 13,
2013)
Slide 9
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 10
There Are Many Available Tools For Dealing With Masses of
Textual Data…
Slide 11
The Solr Suite Provides Needed Speed, Scalability and
Extensibility, and Is Open Source…
•
“SolrTM is the popular, blazing fast open source
enterprise search platform from the Apache
LuceneTM project. Its major features include
powerful full-text search, hit highlighting, faceted
search, near real-time indexing, dynamic clustering,
database integration, rich document (e.g., Word,
PDF) handling, and geospatial search. Solr is highly
reliable, scalable and fault tolerant, providing
distributed indexing, replication and load-balanced
querying, automated failover and recovery,
centralized configuration and more. Solr powers the
search and navigation features of many of the
world's largest internet sites.”
•
(“Apache Solr, at http://lucene.apache.org/Solr/)
…And Solr Is An Emerging Standard
for Searching Large Text
Databases…
Slide 12
Application Framework
File Preparation
and Parsing
Search
Human Interface
(Folder navigation, file type ID, OCR,
content/metadata extraction,
language detection)
Modular and flexible by design, this architecture can be customized for other RAND efforts
Slide 13
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 14
…Example search results in a similar out-ofbox setting…
Columbia University Library Catalog (CLIO)
Slide 15
Metadata fields --customizable
--implemented during
processing
Slide 16
Slide 17
Query Parser Syntax
Fields field name followed by a colon ":" then term. title:"The Right Way" AND text:go
Wildcard Searches single character wildcard "?" ; multiple character wildcard "*"
Regular Expression Searches /[mb]oat/
Fuzzy Searches use the tilde, "~" : roam~
roam~1
Proximity Searches use the tilde, "~" : "jakarta apache"~10
Range Searches Use range queries with date and non-date fields: mod_date:[20020101 TO 20030101]
title:{Aida TO Carmen}
Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
Boosting a Term use the caret, "^", with a boost factor at the end of the term: jakarta^4 apache "jakarta apache"^4 "Apache Lucene"
Boolean Operators AND, "+", OR, NOT and "-"
"jakarta apache" OR Jakarta "jakarta apache" AND "Apache Lucene" +jakarta lucene
"jakarta apache" NOT "Apache Lucene" NOT "jakarta apache“ "jakarta apache" -"Apache Lucene"
Grouping use parentheses to group clauses to form sub queries. (jakarta OR apache) AND website
Field Grouping use parentheses to group multiple clauses: title:(+return +"pink panther")
Escaping Special Characters To escape use the \ before the character. Ex: to search for (1+1):2 use the query: \(1\+1\)\:2
Slide 18
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 19
Next Steps
•
Security around collection, User authentication
•
Natural Language Processing (NLP)
•
Extracting content attachments in emails (while keeping the attachments
in place)
•
Additional visualization options
•
Enhanced logging and tracking
•
Ability for users to rank content, add or edit metadata content
• Enhance user interface
Slide 20
Thank You! Any Questions?
emadi@rand.org
Slide 21