Research Projects Technical Transfer Research Outputs Research Profiles by Department by Name  
  Information Technology
 
Extraction of Key Chinese Language Statistics from Major Chinese Communities for IT Applications
擷取關鍵性漢語詞語與統計資料以用於資訊科技發展
 
Principal Investigator
Professor T'SOU, Benjamin Ka Yin 鄒嘉彥 教授 [ Profile ]
Professor (Chair) of Linguistics and Asian Languages; Director , Language Information Sciences Research Centre
Stage of Technology Transfer: fundamental R&D product level
Research Area: Information Technology

Abstract
For efficient IT applications involving data extraction and information mining, there are two important systems or backup components:
(1) Authorative statistics on word, character and related usage distribution, for specific communities and usage domains
(2) An efficient and effective method to analyze copious Chinese textual data to determine and update sub databases according to different classificatory criteria such as localities or specific domains.

On the basis of 10 years of experience in sophisticated and synchronous processing of extensive textual material from Chinese newspapers and electronic media of Hong Kong, Taipei, Beijing, Shanghai, Macau and Singapore, researchers at the Corpus Linguistics Laboratory of CityU's Language Information Sciences Research Centre have obtained the necessary statistics and developed extensive databases as well as language engineering techniques.

The processing of fresh material has been synchronized on a weekly basis in the last 10 years to capture the salient trends of linguistic, cultural and social developments in the diverse Chinese speech communities. The system thus also offers an innovative “Window” approach for a whole variety of useful applications in IT and comparative studies. Up to August 2004, it has a unique and growing dictionary with over 700,000 entries drawn from the systematic culling of rigorously defined material covered by 150,000,000 Chinese characters. The dictionary shows alternate uses in the different Chinese communities (e.g. President Bush is represented by 3 different Chinese names in PRC, Hong Kong and Taiwan, Bin Ladin by 13 different Chinese names, and SARs with 23 different Chinese renditions, etc). This is partly based on a derivative and accumulative Celebrity Roster which regularly lists personalities with the highest media exposure in Beijing, Hong Kong, Shanghai and Taipei, and which is also available.

Features

1. A collection of lexical databases ranked by frequency of occurrence in Beijing, Hong Kong, Shanghai, Singapore and Taipei totally more than 700,000 items.
2. A list of personal names, place names, organization names for individual countries or all communities names on the frequency of occurrence.
3. New words in each and all regions regularly updateable by month or year.
4. Compound words.
5. Electronic news monitoring service.

Applications
The dynamically updated dictionary from LIVAC and its usage frequency statistics provide a rich resource with a range of useful information tags for search engine development and enhancement in the IT field, and for other applications in areas such as education, linguistics and other areas of social sciences. The useful comparison and longitudinal tracking of alternate Chinese versions of non-Chinese names and alternate expressions for the same item are especially useful for search engine developers and Information Content Providers as well as for monitoring social and cultural events. Up-to-date quantitative data on the Chinese language have been found to be particularly useful for applications in the field of information technology by academic and industrial users. They have been applied to other areas as diverse as language education and machine translation systems in language engineering.

簡介
資料提取和信息擷取在IT科技應用方面,有以下兩個重要系統支援成份:
1. 詞、字或相關使用情況在不同華人地區或特定使用領域中的權威統計數據
2. 使用高效率高效用的方法,以地方、特定領域性等不同分類原則,分析數量龐大的中文文本資料,以制定和更新總資料庫與屬下分類資料庫

本中心的語料庫實驗室經過十年精確和共時處理,從來自香港、台北、北京、上海、澳門及新加坡具影響力華文報刊及電子媒體的龐大文本資料中,已累積大量有用的統計數據、資料庫和技術。

過去十年來,生語料每周共時處理最少一次,以捕捉不同漢語社區的語言、文化、社會等顯著發展和趨勢。同時,系統通過嶄新的「視窗方法」,為資訊科技及比較方法帶來全面而有效的應用。到2004年8月,系統已建立了獨一無二的詞庫,詞條來自嚴格選取和介定的語料,目前詞種數目超過700,000,來自150,000,000字的語料,數量還在增加中。詞庫顯示了不同漢語地區的語用差異(例如:美國總統Bush在大陸、香港和台灣各有不同相等譯名,Bin Ladin 則有「拉登」等13個不同的中文譯名,“非典、SARs”有23種不同的名稱) 。名人榜也由此衍生,它列出北京、香港、上海和台北報刊見報率最高的人名並可供參考。

特色
1. 按詞頻排序的詞庫,分別來自北京、香港、上海、新加坡、台北等地區,目前詞條已超過700,000
2. 各地區及所有地區合計的人名、地名、機構名使用頻率統計表
3. 各地區及所有地區合計的新詞統計,以月份、年份、地區定期更新
4. 複合詞
5. 電子新聞追蹤服務

應用
不斷及時更新的LIVAC詞庫,可以提供有價值和豐富的語用與用語資訊資源,以助開發及增強搜索器的發展,並為其他範疇如教育、語言學及其他社會科學等應用帶來裨益。詞庫從共時和歷時的多個層面,收錄大量同一外語詞在不同地區的不同譯詞和其他各地區自發新詞,以及同一概念不同表述的詞語等,更大大有助於搜索器開發者、資訊供應者用來跟蹤和監察社會動態及分析事件因由。定時更新的漢語量化數據已被學術界及工商界廣泛應用於各方面,包括語言習得與教學,以及語言工程學等其他範疇如機器翻譯。

 
• LIVAC Website