Chapter 1 Introduction to Information Retrieval For Web mining
Information Retrieval โลกของ ICT : คำต่างๆ เช่น searching, google, keyword, … การค้นหาในอดีตและปัจจุบัน การสืบค้นข้อมูล (Data) สารสนเทศ (Information)องค์ ความรู้ (Knowledge) ปัญญา (Wisdom)
Information Retrieval
Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Unstructured Format / Free-Form / Semi- Structure structured Format / Well-defined Syntax Finding : Browsing, Searching Creating : Expert system, Artificial Intelligence
Unstructured (text) vs. structured (database) data in 1996
Unstructured (text) vs. structured (database) data in 2009
Sec. 1.1 Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return) Later lectures Grep is line-oriented; IR is document oriented.
Information Retrieval เป้าหมายของ IR Text Retrieval Speech Retrieval Image Retrieval Music Retrieval Cross-Language Retrieval Question-Answering System etc.
Information Retrieval IR Models Classical Boolean Model : อาศัยทฤษฎี Set Theory ซึ่งเป็น พื้นฐานทางคณิตศาสตร์ การสืบค้นทางนี้จึงเป็นเพียงลักษณะการ match และ un-Match เท่านั้น Vector Space Model โดยใช้หลักการของ Matrix algebra แต่ละหน่วยใน Matrix จะเป็นค่าน้ำหนักที่ขึ้นอยู่กับความถี่ของ เทอมที่ปรากฏในเอกสาร (Term Frequency) และความถี่ของ เทอมในเอกสารทั้งหมดในค่าผกผัน (Inverse Document Fequency ) คำสอบถามก็จะเขียนเป็น Vector Probabilistic Relevance Model ใช้ทฤษฎี Probability Theory โดยการใช้ค่าเปรียบเทียบค่าความน่าจะเป็นในการ matching ระหว่างสองค่า
Information Retrieval ขั้นตอนการประมวลผล IR การทำดรรชนี (Indexing) เป็นการสร้างตัวแทนเอกสาร การจัดรูปแบบคำสอบถาม (Query Formulation) เป็นการสร้าง ตัวแทนคำสอบถาม : and, or ,not, * (pattern) การเทียบเคียงจับคู่ (matching) ตัวแทนคำสอบถามกับตัวแทน เอากสาร : ขึ้นอยู่กับการใช้โมเดล เช่น Boolean Model หรือ Vector Model การเลือก (selection) รายการผลลัพธ์ที่ตรงประเด็น : Title, abstract การปรับเปลี่ยนคำสอบถามใหม่ (Query Reformulation) : Query Expression การเพิ่มเทอมในการสอบถาม Relevance Feedback การป้อนความเกี่ยวพันธ์ย้อนกลับ
The classic search model TASK Get rid of mice in a politically correct way Misconception? Info Need Info about removing mice without killing them Mistranslation? Verbal form How do I trap mice alive? Misformulation? Query mouse trap SEARCH ENGINE Query Refinement Results Corpus
How good are the retrieved docs? Sec. 1.1 How good are the retrieved docs? การประเมิน (Performance Evaluation) Precision : Fraction of retrieved docs that are relevant to user’s information need Recall : Fraction of relevant docs in collection that are retrieved
Sec. 1.1 Bigger collections Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents.
Inverted index construction Sec. 1.2 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen More on these later. Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 13 16 1
Indexer steps: Token sequence Sec. 1.2 Indexer steps: Token sequence Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
Indexer steps: Sort Core indexing step Sort by terms Sec. 1.2 And then docID Core indexing step
Indexer steps: Dictionary & Postings Sec. 1.2 Indexer steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Why frequency? Will discuss later.
Where do we pay in storage? Sec. 1.2 Where do we pay in storage? Lists of docIDs Terms and counts Later in the course: How do we index efficiently? How much storage do we need? Pointers
The index we just built How do we process a query? Sec. 1.3 The index we just built Today’s focus How do we process a query? Later - what kinds of queries can we process?
Query processing: AND Consider processing the query: Brutus AND Caesar Sec. 1.3 Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; “Merge” the two postings: 2 4 8 16 32 64 128 Brutus 1 2 3 5 8 13 21 34 Caesar
Sec. 1.3 The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 34 128 2 4 8 16 32 64 1 3 5 13 21 2 4 8 16 32 64 128 Brutus Caesar 2 8 1 2 3 5 8 13 21 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
Intersecting two postings lists (a “merge” algorithm)
Query optimization What is the best order for query processing? Sec. 1.3 Query optimization What is the best order for query processing? Consider a query that is an AND of n terms. For each of the n terms, get its postings, then AND them together. Brutus 2 4 8 16 32 64 128 Caesar 1 2 3 5 8 16 21 34 Calpurnia 13 16 Query: Brutus AND Calpurnia AND Caesar 23
Query optimization example Sec. 1.3 Query optimization example Process in order of increasing freq: start with smallest set, then keep cutting further. This is why we kept document freq. in dictionary Brutus 2 4 8 16 32 64 128 Caesar 1 2 3 5 8 16 21 34 Calpurnia 13 16 Execute the query as (Calpurnia AND Brutus) AND Caesar.
IR vs. databases: Structured vs unstructured data Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
Unstructured data Typically refers to free text Allows Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents
Semi-structured data In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates “semi-structured” search such as Title contains data AND Bullets contain search … to say nothing of linguistic structure
More sophisticated semi-structured search Title is about Object Oriented Programming AND Author something like stro*rup where * is the wild-card operator Issues: how do you process “about”? how do you rank results? The focus of XML search
Clustering, classification and ranking Clustering: Given a set of docs, group them into clusters based on their contents. Classification: Given a set of topics, plus a new doc D, decide which topic(s) D belongs to. Ranking: Can we learn how to best order a set of documents, e.g., a set of search results
More sophisticated information retrieval Cross-language information retrieval Question answering Summarization Text mining …
Resources for today’s lecture Introduction to Information Retrieval, chapter 1 Any questions?