Data Science

The BS in data science studies the collection, manipulation, storage, retrieval, and computational analysis of data in its various forms, including numeric, textual, image, and video data from small to large volumes. The program combines computer science, information science, mathematics, statistics, and probability theory into an integrated curriculum that prepares students for careers or graduate studies in big data analysis, data science, and data analytics. The course work covers exploratory data analysis, data manipulation in a variety of programming languages, large-scale data storage, predictive analytics, machine learning, data mining, and information visualization and presentation. Data science has emerged as a discipline due to the confluence of two major events:

  1. The ability to collect, store, prune, process, and transmit large amounts of data in the cloud.
  2. The convergence of programming, statistics, artificial intelligence, and visualization as complementary tools for the analysis and understanding of data.

DS 1990. Elective. 1-4 Hours.

Offers elective credit for courses taken at other academic institutions. .

DS 2990. Elective. 1-4 Hours.

Offers elective credit for courses taken at other academic institutions. .

DS 3990. Elective. 1-4 Hours.

Offers elective credit for courses taken at other academic institutions. .

DS 4100. Data Collection, Integration, and Analysis. 4 Hours.

Studies how to collect data from multiple sources and integrate them into consistent data sets. Covers how to use semi-automated and automated classification to integrate disparate data sets; how to parse data from files, XML, JSON, APIs, and structured data stores to construct analyzable data sets that are stored in databases; and how to assess and ensure quality of data. Introduces key concepts of algorithms and data structures, including divide-and-conquer, sorting and selection, and graph traversal and descriptive analysis of data through descriptive statistics and plotting. Analyzes complexity and run-time behavior of programs. Presents approaches for data anonymization and protecting data privacy. Studies data shaping and manipulation techniques for data analysis and the R and Python programming languages. Prereq. CS 2510.

DS 4200. Information Presentation and Visualization. 4 Hours.

Introduces foundational principles, methods, and techniques of visualization to enable creation of effective information representations suitable for exploration and discovery. Covers the design and evaluation process of visualization creation, visual representations of data, relevant principles of human vision and perception, and basic interactivity principles. Studies data types and a wide range of visual data encodings and representations. Draws examples from physics, biology, health science, social science, geography, business, and economics. Emphasizes good programming practices for both static and interactive visualizations. Creates visualizations in Excel and Tableau as well as R, Python, and open web-based authoring libraries. Requires programming in Python, JavaScript, HTML, and CSS. Requires extensive writing including documentation, explanations, and discussions of the findings from the data analyses and the visualizations. Prereq. CS 2510.

DS 4300. Large-Scale Information Storage and Retrieval. 4 Hours.

Introduces data and information storage approaches for structured and unstructured data. Covers how to build large-scale information storage structures using distributed storage facilities. Explores data quality assurance, storage reliability, and challenges of working with very large data volumes. Studies how to model multidimensional data. Implements distributed databases. Considers multitier storage design, storage area networks, and distributed data stores. Applies algorithms, including graph traversal, hashing, and sorting, to complex data storage systems. Considers complexity theory and hardness of large-scale data storage and retrieval. Requires use of nonrelational, document, key-column, key-value, and graph databases and programming in R, Python, and C++. Prereq. CS 3200 and DS 4100.

DS 4400. Machine Learning and Data Mining 1. 4 Hours.

Introduces supervised and unsupervised predictive modeling, data mining, and machine-learning concepts. Uses tools and libraries to analyze data sets, build predictive models, and evaluate the fit of the models. Covers common learning algorithms, including dimensionality reduction, classification, principal-component analysis, k-NN, k-means clustering, gradient descent, regression, logistic regression, regularization, multiclass data and algorithms, boosting, and decision trees. Studies computational aspects of probability, statistics, and linear algebra that support algorithms, including sampling theory and computational learning. Requires programming in R and Python. Applies concepts to common problem domains, including recommendation systems, fraud detection, or advertising. Prereq. (a) DS 4300 and (b) ECON 2350, ENVR 2500, MATH 3081, or PSYC 2320.

DS 4420. Machine Learning and Data Mining 2. 4 Hours.

Continues with supervised and unsupervised predictive modeling, data mining, and machine-learning concepts. Covers mathematical and computational aspects of learning algorithms, including kernels, time-series data, collaborative filtering, support vector machines, neural networks, Bayesian learning and Monte Carlo methods, multiple regression, and optimization. Uses mathematical proofs and empirical analysis to assess validity and performance of algorithms. Studies additional computational aspects of probability, statistics, and linear algebra that support algorithms. Requires programming in R and Python. Applies concepts to common problem domains, including spam filtering. Prereq. DS 4400.

DS 4900. Data Science Senior Project. 4 Hours.

Designed to help students develop a sophisticated understanding of data collection, integration, storage, statistical analysis, visualization, and machine-supported analysis and modeling. Requires students to analyze a substantial data set using statistical and visual methods and to build machine-learning models to discover patterns in the data. Results must be communicated in writing. Requires substantial programming in R, Python, Java, or C++. Prereq. DS 4200 and DS 4420 (which latter may be taken concurrently).

DS 4990. Elective. 1-4 Hours.

Offers elective credit for courses taken at other academic institutions. .

DS 4991. Research. 4 Hours.

Offers an opportunity to conduct research under faculty supervision. .

DS 4992. Directed Study. 1-4 Hours.

Offers independent work under the direction of members of the department on a chosen topic. .

DS 4993. Independent Study. 1-4 Hours.

Offers independent work under the direction of members of the department on a chosen topic. .

DS 4994. Internship. 4 Hours.

Offers students an opportunity for internship work. .

DS 4996. Experiential Education Directed Study. 1-4 Hours.

Draws upon the student’s approved experiential activity and integrates it with study in the academic major. Restricted to those students who are using it to fulfill their experiential education requirement. .

DS 4997. Data Science Thesis. 4 Hours.

Offers students an opportunity to prepare an undergraduate thesis under faculty supervision. .

DS 4998. Data Science Thesis Continuation. 4 Hours.

Focuses on student continuing to prepare an undergraduate thesis under faculty supervision. .

DS 5110. Introduction to Data Management and Processing. 4 Hours.

Introduces relational database management systems as well as modern parallel data processing systems such as MapReduce. Topics in relational databases include relational algebra; SQL; stored procedures; user-defined functions; cursors; embedded SQL programs; client-server interfaces; entity-relationship diagrams; normalization; B-trees; concurrency; transactions; database security; constraints; object-relational DBMSs; and specialized engines such as spatial, text, XML conversion, and time series. Includes exercises using a commercial relational or object-relational database management system. Topics in parallel data processing include MapReduce programming, introduction to cloud computing, distributed databases, and distributed file systems.

DS 5220. Supervised Machine Learning and Learning Theory. 4 Hours.

Supervised machine learning is the study and design of algorithms that enables computers/machines to learn from experience or data, given examples of data with a known outcome of interest. This course is an introduction to supervised machine learning. Provides a broad view of models and algorithms for supervised decision making. Discusses the methodological foundations behind the models and the algorithms, as well as issues of practical implementation and use, and techniques for assessing the performance. Includes a term project involving programming and/or work with real-life data sets. Prereq. CS 5800 or EECE 7205 (either may be taken concurrently); students should be proficient in programming languages such as Python, R, or Matlab.

DS 5230. Unsupervised Machine Learning and Data Mining. 4 Hours.

Unsupervised machine learning and data mining is the process of discovering and summarizing patterns from large amounts of data, without examples of data with a known outcome of interest. This course is an introduction to unsupervised machine learning and data mining. Seeks to provides a broad view of models and algorithms for unsupervised data exploration. Discusses the methodological foundations behind the models and the algorithms, as well as issues of practical implementation and use, and techniques for assessing the performance. Includes a term project involving programming and/or work with real-life datasets. Prereq. CS 5800 or EECE 7205 (either may be taken concurrently); students should be proficient in programming languages such as Python, R, or Matlab.