Main Project total points: 500

Slides:



Advertisements
Similar presentations
The Experience Factory May 2004 Leonardo Vaccaro.
Advertisements

Evaluating Visual and Statistical Exploration of Scientific Literature Networks Robert Gove 1,3, Cody Dunne 1,3, Ben Shneiderman 1,3, Judith Klavans 2,
Experimental Psychology PSY 433
Data Mining Techniques
EMPRICAL RESEARCH REPORTS
BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Software Engineering Chapter 16 User Interface Design Ku-Yaw Chang Assistant Professor Department of Computer Science and Information.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Software Project Planning Defining the Project Writing the Software Specification Planning the Development Stages Testing the Software.
Grade Book Database Presentation Jeanne Winstead CINS 137.
Thomas HeckeleiPublishing and Writing in Agricultural Economics 1 Observations on assignment 4 - Reviews General observations  Good effort! Some even.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Project Planning Defining the project Software specification Development stages Software testing.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Advanced Higher Computing Science
Dr.V.Jaiganesh Professor
Main Project total points: 500
Introduction to Human Services
Recommendation in Scholarly Big Data
AP CSP: Cleaning Data & Creating Summary Tables
BIO1130 Lab 2 Scientific literature
Chapter 2: Hypothesis development: Where research questions come from.
Microgrid Concepts and Distributed Generation Technologies
Visual Information Retrieval
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Evaluating-Ayasdi’s-Topological-Data-Analysis-For-Big-Data_HKim2015
How to Read a Paper.
Experimental Psychology
Single Sample Registration
Parts of an Academic Paper
INTRODUCTION.

Software Documentation
Thesis writing Session 2017
Main Project total points: 500
Experimental Psychology PSY 433
Ying He Wuhan University of Technology Twitter: #AMIA2017
I'd like to suggest that our Ph.D. programs often do students a disservice in two ways. First, I don't think.
Title (make it fun or a pun)
The Scientific Method.
Getting your research noticed
Writing the Document Based Question (DBQ) Essay
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Main Project total points: 500
Experimental Psychology PSY 433
Introduction into Knowledge and information
Barbara Gastel INASP Associate
Title (make it fun or a pun)
Title (make it fun or a pun)
BIO1130 Lab 2 Scientific literature
Introduction to Visual Analytics
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Lesson 3 Bioinformatics Laboratory
CHAPTER 7: Information Visualization
Nearest Neighbors CSC 576: Data Mining.
Poster Template This template is a general guide
DATA ANALYSIS DR. ELIZABETH M. ANTHONY
Summarizing Journal Articles & Online Tools for Researchers
Data Mining CSCI 307, Spring 2019 Lecture 21
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Experimental Psychology PSY 433
Presentation transcript:

Main Project total points: 500 200/500 = 40% finished by March 27 Introduction, Background, Partial Results/Discussion, Acknowledgement, Author contribution, funding/conflicts, References 250/500 = 50% finished by April 5 400/500 = 80% finished by April 17 500/500 = 100% finished by April 26

Include all parameters in figure caption. For 200 point draft due March 27, I recommend Introduction (25 points) Include description of your data set. How many points and in what dimension? Describe each coordinate of a point in your dataset (what do the variables mean). How will you compute distances between data points (or put in later section). What is your goal and how do you plan to achieve it? Background Describe the TDA algorithm including benefits and limitations. Consider using example(s) to illustrate your points. (100 points) Describe background needed to understand your data set (50+ points). Partial Results/Discussion (50+ points) Include many images from python TDA mapper and analyze these images. Can put some images in appendix if you don’t have time to analyze all images. Consider comparing to other techniques (e.g. hierarchical clustering). Acknowledgement, Author contribution, Please also include your commented R code. Funding/conflicts, References (yes, these points add up to more than 200) Include all parameters in figure caption. (20 points)

http://www.bigdata.uni-frankfurt.de/wp-content/uploads/2015/10/ Evaluating-Ayasdi’s-Topological-Data-Analysis-For-Big-Data_HKim2015.pdf

“Color ranges over red to blue and it has different meanings, depending on the type of attributes. For the continuous values, color represents an average of value. A red node contains data samples that have higher average values. In contrast, a blue node contains lower average values. In contrast, for the categorical values, color represents a value concentration.” Analyze your data

3.2.2.2 Insight by Ranked Variables Going back to the Titanic example, the result of the KS-statistic show, that the variable “Sex” is the most strongly related to passengers death. We could generally assume that men conceded the places in lifeboats to women. Furthermore, it is feasible to deduct the subtle reasons of the death of each group. The passengers in group A died because of two reasons: they were man and the cabin class type was low. The passengers in the group B died because they were man. Finally, the passengers in the group C died because they were staying at third class even though most of them were women.

Fig. 1 Ebola Tweet Network Plotted with NodeXL Fig. 1 Ebola Tweet Network Plotted with NodeXL. Nodes represent vertices whose tweets contain the keyword “ebola”, mentions or replies-to other vertices. The “who-mentions-who” or “who-replies to-whom” relationship between msf_uk and nytimes is illustrated above. Here, msf_uk follows the nytimes

Data points = tweets

Relationship numerically coded as 1: A tweet which contains the keyword “Ebola”, but does not contain @foo where foo is the twitter name of someone who has sent a tweet containing the keyword “Ebola” 2: A mention: a tweet that contains @foo anyplace except at the beginning. 3: a replies-to relationship: a tweet that contains @foo at the beginning.

“However, this is only a visual comparison, and thus the metric that suits the data better needs to be determined.” Hamming, Manhattan Jaccard Variance Normalized Euclidean.

Jacard Distance = 1 – J(A, B) http://d-roger.com/2016/09/07/finding-similar-items-查找相似项/ Jacard Distance = 1 – J(A, B)

https://en.wikipedia.org/wiki/Jaccard_index

Data points = tweets

https://people.rit.edu/rmb5229/320/project3/hamming.html http://rosalind.info/glossary/hamming-distance/ Hamming Distance = 7

Data points = tweets

http://www.joachimdespland.com/mammoth.html http://bloggity.nurdz.com/gamedev-math/manhattan/ http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Manhattan_Distance_Metric.htm

Data points = tweets

“However, this is only a visual comparison, and thus the metric that suits the data better needs to be determined.” Hamming, Manhattan Jaccard Variance Normalized Euclidean. k-nearest neighbors, resolution = 15, gain = 2.

2D filter:

https://www.nytimes.com/2017/03/22/science/open-access-journals.html

Web of Science Journal Citation Reports

http://www.austms.org.au/Rankings/0101_AustMS_final_ranked.html

http://www.austms.org.au/Rankings/0101_AustMS_final_ranked.html

http://www. ncbi. nlm. nih. gov/pubmed/2406472 http://www.ncbi.nlm.nih.gov/pubmed/2406472?dopt=Abstract&holding=npg

False Positives will occur https://xkcd.com/882/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2686470/ ABSTRACT In the scientific research community, plagiarism and covert multiple publications of the same data are considered unacceptable because they undermine the public confidence in the scientific integrity. Yet, little has been done to help authors and editors to identify highly similar citations, which sometimes may represent cases of unethical duplication. For this reason, we have made available Déjà vu, a publicly available database of highly similar Medline citations identified by the text similarity search engine eTBLAST. Following manual verification, highly similar citation pairs are classified into various categories ranging from duplicates with different authors to sanctioned duplicates. Déjà vu records also contain user-provided commentary and supporting information to substantiate each document's categorization. Déjà vu and eTBLAST are available to authors, editors, reviewers, ethicists and sociologists to study, intercept, annotate and deter questionable publication practices. These tools are part of a sustained effort to enhance the quality of Medline as ‘the’ biomedical corpus. The Déjà vu database is freely accessible at http://spore.swmed.edu/dejavu. The tool eTBLAST is also freely available at http://etblast.org.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1933238/ ABSTRACT Authors, editors and reviewers alike use the biomedical literature to identify appropriate journals in which to publish, potential reviewers for papers or grants, and collaborators (or competitors) with similar interests. Traditionally, this process has either relied upon personal expertise and knowledge or upon a somewhat unsystematic and laborious process of manually searching through the literature for trends. To help with these tasks, we report three utilities that parse and summarize the results of an abstract similarity search to find appropriate journals for publication, authors with expertise in a given field, and documents similar to a submitted query. The utilities are based upon a program, eTBLAST, designed to identify similar documents within literature databases such as (but not limited to) MEDLINE. These services are freely accessible through the Internet at http://invention.swmed.edu/etblast/etblast.shtml, where users can upload a file or paste text such as an abstract into the browser interface.

https://link.springer.com/article/10.3758%2Fs13428-015-0664-2 http://www.nature.com/news/stat-checking-software-stirs-up-psychology-1.21049

http://blog.pubpeer.com/?p=190

https://pubpeer.com

Most journals now require at least a conflicts of interest statement. Many also require author contribution list.

http://jcs.biologists.org/content/121/11/1771 I'd like to suggest that our Ph.D. programs often do students a disservice in two ways. First, I don't think students are made to understand how hard it is to do research. And how very, very hard it is to do important research. It's a lot harder than taking even very demanding courses. What makes it difficult is that research is immersion in the unknown. We just don't know what we're doing. We can't be sure whether we're asking the right question or doing the right experiment until we get the answer or the result.