Indexing. 2 Accessing Data During Query Evaluation Scan the entire collection Typical in early batch retrieval systems Still used today, in hardware form.

Indexing

2 Accessing Data During Query Evaluation Scan the entire collection Typical in early batch retrieval systems Still used today, in hardware form (eg. Fast Data Finder) Computational and I/O coast are O (character in collection) Practical only for small collections

3 Accessing Data During Query Evaluation Use indexes for direct access Evaluation time O (query term occurrences in collection) Practical for large collections Many opportunities for optimization

4 What should the Index contain? Database systems index primary and secondary keys Index provides fast access to a subset of database record Scan subset to find solution set IR Problem: Cannot predict keys that people will use in queries Every word in a document is a potential search term Solution: Index by all keys (word)

5 Some vocabulary about Indexing File organizations or indexes are used to increase performance of system Text indexing is the process of deciding what will be used to represent a given document Index terms are used to build indexes for the documents The retrieval model described how the indexed terms are incorporated in to a model Relationship between retrieval model and indexing model

6 Accessing the Index Index accessed through features or keys or terms Keys/terms can be atomic or complex Most common atomic keys/terms: Words in text, punctuation Manually assigned terms (controlled and uncontrolled vocabulary) Document structure: sentence and paragraph boundaries Inter or intra document links (e.g. citations)

7 Accessing the Index Composed features Sequences: phrases, names, dates, monetary amounts Sets : synonym classes

8 Manual vs. Automatic Indexing Manual or human indexing: Index decide which keywords to assign to document based on controlled vocabulary e.g. MEDLINE, Yahoo Significant cost Automatic indexing: Indexing program decides which words, phrases or other features to use from test of document Indexing speeds range widely

9 Manual vs. Automatic Indexing Current indexing practice Text categorization “Intelligent” IR Current indexing practice Text search engines “Statistical” IR ManualAutomatic Controlled Vocabulary Free text

10 Manual vs. Automatic Indexing Experimental evidence is that retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies Experiments have also shown that using both manual and automatic indexing improves performance

11 Some vocabulary words Index language Language used to describe documents and queries Exhaustivity Number of different topics indexed, completeness Specificity Level of accuracy of indexing Pre-coordinate indexing Combinations of index terms uses as indexing label E.g., author lists key phrases of paper Post-coordinate indexing Combinations generated at search time Most common and the focus of this course

12 Indexing Choices What is a word? Embedded punctuation (e.g. MD-11, hard-core) Case folding (e.g., New vs new, Apple vs apple) Stopwords (e.g., the, an, a, on) Morphology (e.g., computer, compute, computing) Index granularity has a large impact on speed and effectiveness Index term? Index surface forms? Both ?

13 Basic automatic Indexing Parse documents to recognize structure E.g., title, date, other fields Scan for word tokens Numbers,special characters, hyphenation, Capitalization, etc Stopword removal Stem words Weight words Want more important words to have higher weight Optional Phrase indexing Thesaurus classes

14 Words vs. Terms vs. Concepts Simple indexing is based on words or word stems More complex indexing could include phrases or thesaurus classes Concept-base retrieval often used to imply something beyond word indexing Words, phrases, synonyms, linguistics can all be evidence used to infer present of the concept E.g., the concept “information retrieval” can be inferred based on the presence of the words “information”, “retrieval”, the phrase “information retrieval” and may be the phrase “text retrieval”

15 Phrases Both statistical and syntactic methods have been used to identify good phrases Proven techniques include finding all word pairs that occur more than n times in the corpus or using a part of speech tagger to identify simple noun phrases 1,100,000 phrases extracted from all TREC data Phrases can have an impact on both effectiveness and efficiency Phrase indexing will speed up phrase queries Finding documents containing “White House” better than finding documents containing both words

16 Information Extraction Special recognizers for specific concepts People, organization, places, dates, amounts, product Meta terms such as #COMPANY, #PERSON can be added to indexing

17 Indexing Example

18 Implementations Common implementations of indexes Bitmaps Signature files Inverted files Hashing N-grams

19 N-grams สามารถหาความรู้เพิ่มเติมได้ จาก โปรแกรมสร้าง N-gram ระดับตัวอักษรสำหรับภาษาไทย ไฟล์ที่เอามาลองสร้าง N-gram นั้นเป็นไฟล์ข่าวภาษาไทย มีข่าวอยู่หลาย 1,000 ข่าว มีจำนวนตัวอักษรทั้งหมด 28,694,548 ตัว (77 MB) ตัวอักษรพวกนี้รวมทั้งเครื่องหมาย, เลข, และตัวอักษร อื่นๆที่เกิดขึ้นในข่าว หลังจากโปรแกรมรันเสร็จ นี่คือผลของ 10 อันดับแรกที่เกิดขึ้นบ่อยที่สุด า _1901143 น _1553522 _1493261 ร _1445651 ่ _1214212 ก _1182815 อ _1089453 เ _1006035 ง _984559 ม _927818 ข้อสังเกตเล่นๆ : - สระอา เกิดขึ้นบ่อยที่สุด ด้วยความถี่ 1901143 ครั้ง - วรรค เกิดบ่อยเป็นอันดับ 3 ตอนแรกคิดว่าจะเกิดขึ้นน้อยในภาษาไทย - bigram ที่เกิดขึ้นบ่อยสุด คือ - าร ( สระอาตามด้วย รอเรือ ) ด้วยความถี่ 311818 ครั้ง http://catadmin.cattelecom.com/km/blog/kittichonm/category/search-engine/n-gram/

20 Indexes: Inverted Lists Inverted lists are today the most common indexing technique Sources file: collection, organized by document Inverted file: collection organized by term One record per term, listing locations where term occurs

21 Inverted Lists During query evaluation, traverse lists for each query term OR: the union of component list AND: an intersection of component list Proximity: an intersection of component list SUM: the union of component lists : each entry has a score

22 Inverted Files Example test: each line is a document

23 Inverted Files

24 Word-Level Inverted File

25 Index Construction Methods Memory-based inversion Sort-based inversion All above, combined with compression FAST-INV Based on text partitioning

26 Index Construction: Overview Total text size 5 GB, with 5 million documents, 40 MB main memory

27 Expanding the Index Simplest way to handle documents insertion for the inverted file index Accumulate updates in a stop-press file For each query issued the stop-press file is checked When the stop=press grows too large, re-index the entire collection Major disadvantage: to keep performance up to scratch, stop-press files must be kept small, so re- indexing need to be done often, while it takes longer with ever growing data set

28 Indexes: Signature Files Bag of words only For each term, allocate fixed size s-bit vector (signature) Define hash function: Each term has an s-bit signature May not be unique OR the term signatures to form document signature Long documents are a problem Usually segment them into smaller pieces

29 Encoding and Compression Encoding transforms data from one representation to another Compression is an encoding that takes less space Lossless: decoder can reproduce message exactly Lossy: can reproduce message approximately Degree of compression: (Original – Encoded)/Encoded Example: (125MB-25MB)/25 MB = 400%

30 Compression Advantage of Compression Save space in memory (e.g., compressed cache Save space when storing (e.g., disk, CD- ROM) Save time when accessing (e.g., I/O) Save time when communicating (e.g., over network)

31 Compression Disadvantages of Compression Costs time and computation to compress and uncompress Complicates or prevents random access May involve loss of information (e.g., JPEG, MP3) Makes data corruption much more costly. Small errors may make all of the data inaccessible.

Indexing. 2 Accessing Data During Query Evaluation Scan the entire collection Typical in early batch retrieval systems Still used today, in hardware form.

งานนำเสนอที่คล้ายกัน

งานนำเสนอเรื่อง: "Indexing. 2 Accessing Data During Query Evaluation Scan the entire collection Typical in early batch retrieval systems Still used today, in hardware form."— ใบสำเนางานนำเสนอ:

งานนำเสนอที่คล้ายกัน

เรื่องโครงการ

การติดต่อกลับ

เข้าสู่ระบบ

ลงทะเบียนผ่านเครือข่ายสังคม:

Indexing. 2 Accessing Data During Query Evaluation Scan the entire collection Typical in early batch retrieval systems Still used today, in hardware form.

งานนำเสนอที่คล้ายกัน

งานนำเสนอเรื่อง: "Indexing. 2 Accessing Data During Query Evaluation Scan the entire collection Typical in early batch retrieval systems Still used today, in hardware form."— ใบสำเนางานนำเสนอ:

งานนำเสนอที่คล้ายกัน

เรื่องโครงการ

การติดต่อกลับ