- Requirements Specification
This document describes the requirements specification for the KWIC (Key Word in Context) index system. The content of this document includes functional requirements, non-functional requirements of the system, as well as the system models which illustrate the scenarios and the use case model of the KWIC index system. The system models are included in the Requirements Analysis document.
Search engine nowadays is widely used by people for researching purposes. The responsiveness and accuracy has became a big concern for web based search engine. Many factors contribute to bad response time and poor keyword targeting rate. First, network oftentimes is the major issue that slows down search engine. Among the many of network issues, excluding internet connectivity, availability is the primary factor that causes slow network response rate. Second, insufficient data records. Search engine is based on the knowledge of the system. The more the system knows, the more it could supply to users. Third, as mentioned before, poorly targeted key words causes the issue that even the system has the knowledge base that user wants to query but the key words do not match to it. This situation is usually caused by user typos or unnecessary words.
Thus, to address the issues mentioned above, our system is designed to provide features including: high availability, automated learning database, and accurate keyword matching. These features are provided in three subsystems. In this document, we only discuss KWIC subsystem.
Again, as mentioned before, this system consists of three subsystems, namely,
- web crawler system that runs in a background service continuously to collect webpages and their content info,
- backend API based on a distributed database, Cassandra in specific,
- KWIC index system which imports the data collected by web crawler system, generates indices, and outputs them to backend.
In this interim project SRS document, we only discuss the KWIC index system. More details will be described in the final project SRS. From the listed brief introduction, it is obvious that the scope of KWIC index system would be relatively narrow, since it accepts auto generated data from the web crawler system and outputs indices by making a backend API call. However, for testing purpose, the KWIC index system also accepts manual imports from data file, and outputs to a terminal.
The ultimate objectives of this project is to provide a search engine that has the features of high availability, knowledgeable database, and accurate keyword matching. Thus, from the overall perspective, the system will be evaluated from three different aspects:
For availability:
- Does the system still works given one of the remote servers is done
- Does the system still is still responsive given one of the remote servers is being visited as a hotspot
For knowledgeable database:
- Does the database increases daily
- Does the database supplies abundant searching results
For accurate keyword matching:
- Does the results returned by the system satisfied user need
- Does the results returned by the system has any irrelevant web pages
- Does the system filters verbose keywords
However, for the KWIC index system, the criteria would be:
- Does the system support keywords filtering
- Does the system returns result when a verbose string is inputted
- Does the system builds enough indices based on inputted data
- AM—Access Method
- AM—Active Matrix
- AMOLED—Active-Matrix Organic Light-Emitting Diode
- B2E—Business-to-Employee
- BAL—Basic Assembly Language
- BAM—Block Availability Map
- Bash—Bourne-again shell
- CAT—Computer-Aided Translation
- CAQ—Computer-Aided Quality Assurance
- CASE—Computer-Aided Software Engineering
- cc—C Compiler
- DAO—Data Access Objects
- DAO—Disk-At-Once
- DAP—Directory Access ProtocolEDA—Electronic Design
- Automation
- EDGE— Enhanced Data rates for GSM Evolution
- EDI—Electronic Data Interchange
- SaaS—Software as a Service
- SAM— Security Account Manager
- SAN—Storage Area Network
- SAS—Serial attached SCSI
- UTRAN—Universal Terrestrial Radio Access Network
- UUCP—Unix to Unix Copy
- VNC—Virtual Network Computing
- VOD—Video On Demand
- VoIP—Voice over Internet Protocol
- VPN—Virtual Private Network
- WPAD—Web Proxy Autodiscovery Protocol
- WPAN—Wireless Personal Area Network
- WPF—Windows Presentation Foundation
- WS-D—Web Services-Discovery
- XMPP—eXtensible Messaging and Presence Protocol
- XMS—Extended Memory Specification
- XNS—Xerox Network Systems
- XP—Cross-Platform
- ZIF—Zero Insertion Force
- ZIFS—Zero Insertion Force Socket
- ZIP—ZIP file archive
- Documenting Software Architectures: Views and Beyond, P. Clements, F. Bachmann, L. Bass, D. Garlan, J. Ivers, R. Little, R. Nord and J. Stafford, MA: Addison-Wesley, 2003.
- Software Architecture in Practice, L. Bass, P. Clements & R. Kazman, Addison Wesley.
- Architecting Software Intensive Systems: A Practitioner’s Guide, A. Lattanze, Boca Raton, FL: Auerbach Publishing, 2008.
- Component-Based Software Engineering, Edited by A. W. Brown, IEEE Computer Society
- Design Patterns: Elements of Reusable Object-Oriented Software, Eric Gamma, Richard Helm, Ralph Johnson and John Vlissides, Addison-Wesley
- Design Patterns for Object-Oriented Software Development, Wolfgang Pree, Addison-Wesley Longman
- Seamless Object-Oriented Software Architecture: Analysis and Design of Reliable Systems, Kim Walden & Jean-Marc Nerson, Prentice Hall
- Designing Enterprise Applications with the J2EE Platform, 2/E, Inderjeet Singh, Beth Stearns, Mark Johnson, The Enterprise Team, Addison Wesley & Benjamin Cummings
- Understanding CORBA: The Common Object Request Broker Architecture, Randy Otte, Paul Patrick and Mark Roy, Prentice Hall
- The Essential Client/Server Architecture: Survivor's Guide, Robert Orfali, Dan Harkey and Jeri Edwards, John Wiley & Sons
- Network Application Support for Building Open Systems, James Martin and Joe Leben, Digital Press
- Non-Functional Requirements in Software Engineering, Lawrence Chung, Brian Nixon, Eric Yu and John Mylopoulos, Kluwer Academic Publishing
- The Unified Modeling Language User Guide, Booch, Rumbaugh, Jacobson, Addison Wesley, 1999
The system is designed to be object-oriented for better maintainability, thus, uses Abstract Data Type architecture. The interfaces between three subsystems are predefined, so they can be worked together parallelly. More details about the other two subsystems (web crawler system and backend API) is discussed separated. This document elaborates the functional and non-functional requirements for KWIC index system, as well as the system models of it. The KWIC index system is mainly composed of four components:
- Line storage, where the inputted keywords and corresponding URL are stored
- Circular shift, where the keywords are duplicated and shifted in order to improve keywords targeting rate
- Alphabetic shift, where the keywords and its duplications are ordered alphabetically. This provides a B-Tier to enable the auto-completion function in front end.
Prior to September 1993 the World Wide Web was entirely indexed by hand. There was a list of webservers edited by Tim Berners-Lee and hosted on the CERN webserver. One Google.nl snapshot of the list in 1992 remains,[4] but as more and more web servers went online the central list could no longer keep up. On the NCSA site, new servers were announced under the title "What's New!"
Web search engine has been evolving ever since. Nowadays, Google occupies 90.14% market share, Bing 3.24%, Baidu 2.18%, and Yahoo! 2.08%, respectively. Search engine becomes very mature, and equiped with state-of-art technologies to make it even more powerful, for example, machine learning enables search engine predicting what user is looking for based on his/her historical search data.
References: Web search engine: https://en.wikipedia.org/wiki/Web_search_engine
Our system by no means to develop a search engine that is competitive to these giant companies. The purpose of this project is to let us get familiar with the architecture considerations behind the scene.
Though the aim is not to compete with Google, we still want to implement a completed search engine. Thus the system is designed to have high availability, knowledgeable database, and accurate keyword matching. This document introduces the design of KWIC subsystem in specific, which is responsible for importing collected web page data and indexing them into backend.
This section includes the detailed functional requirements of KWIC index system
This requirement is mainly for testing purpose. The tester should be able to input a sentence in a textbox manually, and the system should output a list of indices of that sentence.
The system should be able to read data from files, auto-generates indices according to what has been read, and uploads the data to backend
Since the web crawler system collects web page data consistently, the KWIC index system should import what has been collected periodically to allow the crawler system clean out-dated data.
The system should be able to send generated indices to backend via API call.
This section includes the detailed non-functional requirements of KWIC index system
User-friendly is a big concern nowadays in software industry. Thus, the KWIC system should have a neat and intuitive user interface to guide user have to use this system. Particularly, when the system is formally integrated with web crawler system and backend API, there should have no extra action for the user to take once it has been start.
The system should be tested throughly to avoid disaster crush that would impact the whole search engine. Log mechanism should be implemented in order to trace and debug when a crush happened.
Since the system is used continuously, a background thread should be created for the system. The operating system should not be impacted by the KWIC system because of I/O operations and network trafic.
The KWIC system should be supportable in current equipment such as computers, monitors, routers etc.
The KWIC system will be implemented parallely with the crawler system and backend API. Each subsystem is designed to be object-oriented style.
All the data interfaces are defined at the begining of this project. All three subsystems should conform to the agreement to avoid integration issues.
The user interface should be neat and intuitive to guide user have to use this system. Particularly, when the system is formally integrated with web crawler system and backend API, there should have no extra action for the user to take once it has been start.
The deliverable of this project should be packed with a readme file including necessary info such as environment setup, usage example, and user manual.
The backend and database servers will hosted at ECSS 3.213 University of Texas at Dallas. The usage of this should also be compliant to the university rule. This project is only used for the Advanced Software Architecture course, and should not be used for any commercial purpose without permission.
The KWIC index system only has two possible scenarios due to its nature: manual data input and auto file import. Both scenarios will output indicies. They are described in details in the following use case models.
This use case is designed for internal test. Thus, the actor would only allow to be tester. The tester should be able to input a sentence and relieved a list of indices as result.
Use Case name | Manual data input |
Participating Actors | Tester |
Entry condition | Tester inputs a sentence |
Flow of events |
|
Exit condition | Ordered sentences are outputted |
Exceptions | Any error happends during the process |
Special Requirements | None |
This use case is designed for production. After integration with web crawler system and backend API, this system should be able to operate automatically. It reads the collected from crawler system periodically and outputs the indices to database by using backend API.
Use Case name | Auto file import |
Participating Actors | Admin, Backend |
Entry condition | Admin starts system |
Flow of events |
|
Exit condition | Ordered sentences are outputted |
Exceptions |
|
Special Requirements | None |