Presentation is loading. Please wait.

Presentation is loading. Please wait.

SIoT 2015 Regression and Functional Optimization models in Scientometric analysis: An Overview Dr Snehanshu Saha Professor, Dept of CS & Engg. PES Institute.

Similar presentations


Presentation on theme: "SIoT 2015 Regression and Functional Optimization models in Scientometric analysis: An Overview Dr Snehanshu Saha Professor, Dept of CS & Engg. PES Institute."— Presentation transcript:

1 SIoT 2015 Regression and Functional Optimization models in Scientometric analysis: An Overview Dr Snehanshu Saha Professor, Dept of CS & Engg. PES Institute of Technology, Bangalore South Campus Bangalore

2 Contents Introduction Impact Factor SCImago Journal and Country Rank
Challenges and Drawbacks Problem Definition Literature Survey Recent Work Done JIS JIMI Modeling Internationality, A Novel Approach Methodology to be used to model and quantify Internationality High Level Diagram of the Approach Data Sources to capture features Feature Selection The Big Story Summarizing the basic steps of the project Deliverables Collaborators References

3 Introduction Measuring ‘internationality’ of scientific journals is an open problem. There exist no metric to rank journals based on the quality of work submitted by researcher. The Web of Science (WoS) and Scopus are the two main citation databases that are frequently used to rank journals in a discipline in terms of their productivity as well as the total citations received so as to indicate the journals impact, influence. The most powerful indices used by these databases to rank journals are the journal impact factor (JIF) used by WoS, & SCImago Journal Rank (SJR) used by Scopus Both the JIF and SJR are commonly used to rank, evaluate, categorize and compare journals

4 Impact FACTOR The impact factor (IF) of an academic journals is a measure which reflects the average number of citations to recent articles published in the journal. It is frequently used as relative importance of a journal within its field, with journals with higher impact factors considered to be more important than those with lower ones. The impact factor is highly dependent on the academic discipline, possibly on the speed with which papers get cited in a field

5 SCImago journal and country rank

6 contd… Where SJRi -Scimago Journal Rank of the Journal i.
Cji - Citation from journal j to journal i. Cj - Number of References of journal j. d - Constant, normally 0.85. e - Constant, normally 0.10. N - Number of Journals Artj - Number of Articles of journal j

7 challenges with the existing system
The impact factor is not always a reliable instrument, it is used for measuring and comparing the influence of entire journals, but not for the assessment of single papers, researchers or research programmes Calculation of impact factor is based on various other indicators and obtaining indicator values is challenging Evaluating and Assignment of weight to each significant factor is time consuming and tedious.

8 Drawbacks with SJR Uses Google Page Ranking Algorithm to evaluate ranking for journals SJR uses 13 input parameters, data availability is relatively difficult, and the evaluation cost is high Process of ranking the journals is iterative so if initial chosen value is wrong it takes more time to calculate the rank. As the process is iterative, if the initial rank/guesses are not a good approximation, the process has to restart. SCImago and SCOPUS require storage of significant volume of data.

9 Problem Definition The Dessertation aims at
(Phase One)Capturing/ Classifying Essential Metrics to rank journals based on the quality of work submitted by researcher.– This includes capturing the data from SCOPUS, Web Of Science, Harzing or any other benchmark database. The input parameters which will be used by our model will be different from those obtained from web. For eg. Self-citation, one of the key input parameter, to be captured by our own crawler.

10 Problem Definition contd…
(Phase Two) Define, Quantify and Reconcile "Internationality" – The features captured in previous phase are used as input parameters to Cobb Douglas and Translog Function. A score is obtained from the function that gives internationality score hence quantifies "internationality".

11 Literature Survey Neelam Jangid, Snehanshu Saha, Siddhant Gupta, Mukunda Rao J., Ranking of Journals in Science and Technology Domain: A Novel And Computationally Lightweight Approach, IERI Procedia 10(2014)57–62 doi: /j.ieri G. Buchandiran, An Exploratory Study of Indian Science and Technology Publication Output, Department of Library and Information Science, Loyola Institute of Technology Chennai Chiang Kao, The Authorship And Internationality Of Industrial Engineering Journals, Scientometrics, 80 (3) (2009) Chia-Lin Changa, Michael McAleer, Les Oxley, Coercive journal self citations, impact factor, Journal Influence and Article Influence , Mathematics and Computers in Simulation 93 (2013) 190–197

12 Contd.. Perakakis, P., Taylor, M., Buela-Casal, P., &Checa, P. (2006). A neuro-fuzzy system to calculate a journal internationality index. In: Proceedings CEDI 2005 symposium. Buela-Casal, G., Perakakis, P., Taylor, M. y Checa, P. (2006). Measuring internationality: Reflections and perspectives on academia journals. Scientometrics, vol. 67, n. 1, Abrizah; A.N. Zainab; K. Kiran; R.G. Raj LIS journals scientific impact and subject categorization: a comparison between Web of Science and Scopus, Scientometrics (2013) 94:721–740 DOI /s Ludo Waltman, Nees Jan van Eck, Thed N. van Leeuwen, Martijn S. Visser, Some modifications to the SNIP journal impact indicator, Journal of Informetrics 7 (2013) 272– 285

13 recent work done Neelam Jangid, Snehanshu Saha, Archana Mathur, Anand Narasimhamurthy, ”DSRS: Estimation and Forecasting of Journal Influence in the Science and Technology Domain via a Lightweight Quantitative Approach” Nandita Dwivedi, Avantika Dwivedi, Snehanshu Saha, Archana Mathur, Gouri Ginde (2015) “JIMI,Journal Internationality Modelling Index- An Analytical Investigation”, Fourth National Conference of Institute of Scientometrics Chitra Balasubramaniam, Snehanshu Saha, Harsha RS, Nandita Dwivedi, Avantika Dwivedi, Archana Mathur (2015), Modeling Internationality - A Novel Approach of Classication of Scholarly Journals. Gouri Ginde, Snehanshu Saha, Chitra Balasubramaniam, Harsha R.S, Archana Mathur, BS Dayasagar, Anand M N(2015) “Mining massive databases for computation of scholastic indices - Model and Quantify ‘internationality’ of peer-reviewed journals: A Technical Report”, Fourth National Conference of Institute of Scientometrics

14 Jis Journal influence score
The initial 13 parameters are captured from SJR. These are :- H index:- No. of articles(h) that have received at least h citation over the whole period Total Documents:-Published Document in particular Journal Total Docs (3years):-Published Document in the 3 previous years in particular Journal Total References:-Bibliographical Reference in Journal Total Cites (3years):-Citation received in a year for journal documents published in previous 3 years Self Cites (3years):-Journal's self citation in year to its own documents published in previous 3 years Citable Docs. (3years):-Citable documents published in previous 3 years Cites / Doc. (4years):-Average Citation per document in a 4 year period Cites / Doc. (3years):-Average Citation per document in a 3 year period Cites / Doc. (2years):- Average Citation per document in a 2 year period References / Doc.:-Average references per document Cited Docs.:-Number of Document published in previous 3 years that have been cited at least once Uncited Docs.:-Number of Document published in previous 3 years that have never been cited % International Collaboration:-Publication Ratio whose affiliation include more than one country

15 Optimized Factors for statistical analysis
Initial 13 parameter are reduced to 9 as smaller dataset is considered for evaluation of score. H index:- No. of articles(h) that have received at least h citation over the whole period Total Documents:-Published Document in particular Journal Total Docs (3years):-Published Document in the 3 previous years in particular Journal Total References:-Bibliographical Reference in Journal Total Cites (3years):-Citation received in a year for journal documents published in previous 3 years Citable Docs. (3years):-Citable documents published in previous 3 years References / Doc.:-Average references per document Cites / Doc. (2years):- Average Citation per document in a 2 year period

16 Procedure/steps to calculate JIS (step 1)
Cross correlations and Multiple Linear regression equation is applied to the final 9 parameters . On this data, Regression analysis is carried out and P-value and Correlation Coefficient value is tested to make an optimization decision for the removal of parameters. A down selected set of parameters are correlated pair wise and parameters with minimal correlation are retained . A regression is done on final parameters to obtain their intercepts. Finally the derived model equation gives an influence score for a journal called JIS

17 Regression analysis – Phase 1
Tool Used:- Excel Data Analysis Module Results:-

18 Regression analysis – Phase 1
Inference:- Factor P-Value Correlation Coefficient Optimization Decision H index 0.005 0.69 Total Docs. (2012) 0.012 0.16 Total Docs. (3years) 0.334 0.37 Total Refs. 0.013 0.40 Total Cites (3years) 0.001 0.55 Citable Docs. (3years) 0.516 Cites / Doc. (2years) 0.000 0.85 Ref. / Doc. 0.751 0.17

19 Regression analysis – Phase 2
Regression and Residual Analysis Performed for Remaining Factors Regression Results:-

20 Regression analysis – Phase 2
Residual Results:- Data points with extreme +ve Residuals Data points with extreme -ve Residuals These data points are 10% of overall Data and removed for further statistical and regression analysis

21 Regression analysis – Phase 3
Tool Used:- Excel Data Analysis Module Results:- Total Docs(3 Years) has P value >0.05 and a negative coefficient which is not logical. Hence, removed from further analysis

22 Regression analysis – Final Phase
Tool Used:- Excel Data Analysis Module Results:- R Square is >0.75 F value is <0.05 All P-values are <0.05

23 DSRS – step two The down selected set of variables computed in previous step for multiple journals was used to compute the overall variance from the covariance matrix. Compute the percentage of variability accounted for by individual input variables. To validate, a regression is run where the small set of five input variables selected as described.

24 regression Regression Statistics Multiple R 0.8774 R Square 0.76983
Adjusted R Square Standard Error Observations 225 ANOVA Df SS MS F Significance F Regression 5 E-68 Residual 219 Total 224 Coefficients t Stat P-value Lower 95% Upper 95% Intercept 5.87E-06 Quarter 8.69E-06 H index Total Docs. (2012) Total Refs. -8.3E-06 7.75E-06 E-05 6.98E-06 Cites/Doc. (2years) 1.19E-19

25 Front Screen Quarter values (1-4)

26 H-Index Total Documents for 2012

27 Total References Cites/Docs for 2 years

28 FINAL Output Screen With JIS Score

29 Regression Equation Journal Influence Score = ( *Quarter) + ( * H index) + ( * Total Documents For Current Year) - (8.3E-06 * Total References) + ( *Cites/Doc. in previous 2 years) Further the journals are differentiated into clusters of “National” and “International” journals using K means Clustering Algorithm.

30 K-Means Clustering Algorithm
Step 1: Calculate the influence score of all the journals in the sample set. Step 2: Select two distinct cluster means arbitrarily. Step 3: Initialize the variables (Iteration no =0, max_iterations =100, changed = 1) Step 4: Loop until both the conditions are satisfied While(changed ==1 & iteration no. < max_iterations)

31 Contd.. Step 4.1 Increment iteration no, make changed=0
Step 4.2 For all samples in the dataset , classify all into the class with the nearest cluster mean. Step 4.3 Initialize variables to 0 (ele0 = 0, ele11 = 0, sum0 = 0, sum1 = 0)

32 Contd.. Step 4.4 Re-compute cluster means
For all samples in the dataset If ( class == 0) Add influence score to sum0 , increment ele0 else Add influence score to sum1 , increment ele1 new0 = sum0 / ele0; new1 = sum1 / ele1;

33 Contd..

34 Jimi - journal internationality modeling index
The paper presents an analytical model to compute internationality and investigates the efficacy theoretically. A technique to quantify internationality by exploiting a mathematical model, which determines the internationality of the journal by using two major factors- Source Normalized Impact per Paper (SNIP) and International Collaboration. The Cobb - Douglas production function for x1 and x2 as inputs and y, internationality, as output is- Where , , x1 is taken from SCImagojr and x2 is taken from Journal Metrices (Scopus) It is proved that, the model has a global maxima, a particular value of the inputs (SNIP and International Collaboration) would ensure some maximum value of internationality, subject to a constraint or set of constraints. α

35 Proof of concept Implications of Theorem 1:
Cobb-Douglas is concave for conditions on elasticities, thus for such values of the elasticities, the Hessian Matrix of the Cobb- Douglas function is negative semi-definite and therefore concave and attains a global maxima.

36 Theorem 2

37 Contd.. Significance of concavity: The extrema of the function, used to model "internationality" is useful in finding a global maximal value of the "internationality" indicator. The modeling paradigm is based on the fact that, there exists a maximum internationality score and the score/values in the neighborhood could be classified as the levels of internationality. It is, in this context, explored if the maxima given by the concave function, i.e. Cobb-Douglas is the global maxima.

38 2012 X1 increase X2 decrease X1 decrease X2 increase
X1 and X2 increase X1 and X2 decrease

39 A different approach to Model Internationality
In this paper Cobb Douglas and Translog methods are used to calculate the Internationality of a journal. In the model, Multiple Linear Regression (MLR) for Cobb Douglas Model and curvilinear Regression for Translog model generates a result that shows what percentage of the variability is explained by the given dataset. The parameters chosen to compute the internationality score of a journal are: International Collaboration SNIP (Source Normalized Impact Factor) Number of Cited Documents/Total Number of Documents

40 Contd.. The Cobb-Douglas production function gives the maximum and minimum value of the output for a set of input. It allows us to obtain the maximum internationality score for a set of values of international collaboration(x1) and SNIP values(x2). Translog is a second order log linear form (including all cross terms) represented as:

41 Results- Results obtained from Cobb-Douglas and Translog Model analysis have been discussed Boundary distributions of internationality for year 2011 and 2012

42 Contd… Histogram of internationality 2011 and 2012

43 Results First the Boundary distributions of y is plotted for years 2011 and The results from the plot signifies that there is high density spread between 0.6 and 1. This means that the internationality of most of the journals lies between these values. Next the y values derived from functional form are used to create the histogram plots. The results signifies that it follows gaussian distribution. Hence gaussian classification is preformed, classifying journals belonging to first standard deviation as “International journals” and rest as “National journals”.

44 Methodology to be used High level view of the proposed method to model and quantify “Internationality” is classified into two phases : (Phase One) Crawling the Web to acquire indicators of non-indexed journals. Big Data tools like Map/Reduce program or PIG scripts to be executed to process csv files for extraction of required indicators Cobb Douglus and Translog Model applied to the indicators to obtain a score for classifying journals into “international and national” categories

45 Methodology to be used Phase Two :
Obtaining indicators of the indexed-journals which are older than 5 years, from SCOPUS, Journal Metric from Elsevier, CWTS Journal Indicators etc Next step Dimensionality Reduction, is achieved by applying Multiple Linear Regression and Principal Component Analysis in order to eliminate redundant indicators. Resultant features/ indicators are fed into Regression Model to obtain an influence score of a journal. K means clustering algorithm applied to obtain the clusters of internationality

46 High level diagram

47 Data source 1 )The Journal Metrics from Elsevier- provides the statistics for features mentioned below: Source Normalized Impact per Paper (SNIP) Impact per Publication (IPP) SCImago Journal Rank (SJR)

48 Data source 2 ) The SCImago Journal & Country Rank - a portal that includes the journals and country scientific indicators developed from the information contained in Scopus database. SJR (SCImago Journal Rank) indicator H Index Total Docs./Total Documents Total Docs. (3years) Total References Total Cites (3years) Citable Documents Cites per Documents (2 years) Cites per Doc (3 years): Cites per Doc (4 years) Ref. / Doc Self Cites Non-citable documents Cited Documents Uncited Documents % International Collaboration

49 Data source 3) CWTS Journal Indicators
P - The number of publications of a source in the past three years. IPP- The impact per publication SNIP % self cit

50 Our own crawlers Data can be gathered using a web crawler or a software tool to capture all recent (non-indexed) journals (between 3 years to 5 years in publication) from the Google Scholar. Web crawling by using the python script: A couple of python scripts are

51 Feature mapping All these features from the various data sources would be processed further to contain only the desired features of journals to contribute in evaluation of internationality index. These features would be Total Cited Documents International Collaboration Ratio SNIP(Source Normalized Impact per Paper) Turnaround Time Acceptance Ratio (Rejection Ratio) Impersonal Citation Ratio Self Citation/Total Citation The above features are to be fed as input parameters to evaluate JIS score using MLR and DSRS. Further, using Cobb Douglas and Translog Production Function, an Internationality Score is to be calculated.

52 The big story Prestige/Internationality of a Journal is the convex combination of JIS and Internationality Score, represented as- Internationality of a Journal , YI = α JIS + (1 - α) JIMI ; α 1 where YI refers to the internationality score as response variable(to be sorted in decreasing order) JIS is the internationality score obtained from metric JIS JIMI is the score evaluated from work done using two parameters (JIMI) and α is a weight deduced from the cross correlation.

53 Summarizing the basic steps of the project
Web Crawling of journals Dimensionality Analysis Feature Extraction and Data Collection Total Cited documents International Collaboration Ratio SNIP (Source Normalized Impact per Paper) Turnaround time Acceptance Ratio (Rejection Ratio) Impersonal Citation Ratio Self Citation/Total Citations Analytics : Cobb-Douglas and Log Production Model Classification and Clustering End result -Clusters

54 Deliverables JIS Correlation between JIS and Internationality
Diffusion of Internationality A Web based application tool similar to SCOPUS. to characterize and quantify journals which are not SCOPUS/ISI Web of Science indexed

55 Collaborators Indian Statistical Institute, Bangalore
Institute of Scientometrics, Tumkur BITS Hyderabad

56 References [1] Website : http://www.journalmetrics.com/values.php
[2] Harzing, A.W. (2007) Publish or Perish, available from [3] Website: [4] Neelam Jangid, Snehanshu Saha, Anand Narasimhamurthy, Archana Mathur Computing the Prestige of a journal: A Revised Multiple Linear Regression Approach (2015); WCI- ACM Digital library(accepted), Aug 10-13, 2015. [5] Gualberto Buela-Casal, Pandelis Perkakis, Michael Taylor and Purificacion Checha, Measuring Internationality: Reflections And Perspectives On Academic Journals, Scientometrics, 67 (1) (2006) [6] Ludo Waltman, Nees Jan van Eck, Thed N. van Leeuwen, Martijn S. Visser, Some modifications to the SNIP journal impact indicator, Journal of Informetrics 7 (2013) 272– 285


Download ppt "SIoT 2015 Regression and Functional Optimization models in Scientometric analysis: An Overview Dr Snehanshu Saha Professor, Dept of CS & Engg. PES Institute."

Similar presentations


Ads by Google