- Détails du magazine
- Première publication
- 30 Mar 2017
- Période de publication
- 4 fois par an
- Accès libre
The 1st International Conference on Data-driven Knowledge Discovery: When Data Science Meets Information Science. June 19–22, 2016, Beijing ⋅ ChinaPublié en ligne: 01 Sep 2017
Pages: 1 - 5
- Accès libre
Pages: 79 - 94
This research aims to identify product search tasks in online shopping and analyze the characteristics of consumer multi-tasking search sessions.
The experimental dataset contains 8,949 queries of 582 users from 3,483 search sessions. A sequential comparison of the Jaccard similarity coefficient between two adjacent search queries and hierarchical clustering of queries is used to identify search tasks.
(1) Users issued a similar number of queries (1.43 to 1.47) with similar lengths (7.3–7.6 characters) per task in mono-tasking and multi-tasking sessions, and (2) Users spent more time on average in sessions with more tasks, but spent less time for each task when the number of tasks increased in a session.
The task identification method that relies only on query terms does not completely reflect the complex nature of consumer shopping behavior.
These results provide an exploratory understanding of the relationships among multiple shopping tasks, and can be useful for product recommendation and shopping task prediction.
The originality of this research is its use of query clustering with online shopping task identification and analysis, and the analysis of product search session characteristics.
- Product search
- Shopping task identification
- Shopping task analysis
- Multitasking session
- Accès libre
Predictive Characteristics of Co-authorship Networks: Comparing the Unweighted, Weighted, and Bipartite Cases
Pages: 59 - 78
This study aims to answer the question to what extent different types of networks can be used to predict future co-authorship among authors.
We compare three types of networks: unweighted networks, in which a link represents a past collaboration; weighted networks, in which links are weighted by the number of joint publications; and bipartite author-publication networks. The analysis investigates their relation to positive stability, as well as their potential in predicting links in future versions of the co-authorship network. Several hypotheses are tested.
Among other results, we find that weighted networks do not automatically lead to better predictions. Bipartite networks, however, outperform unweighted networks in almost all cases.
Only two relatively small case studies are considered.
The study suggests that future link prediction studies on co-occurrence networks should consider using the bipartite network as a training network.
This is the first systematic comparison of unweighted, weighted, and bipartite training networks in link prediction.
- Network evolution
- Link prediction
- Weighted networks
- Bipartite networks
- Two-mode networks
- Accès libre
Pages: 42 - 58
The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.
The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms.
We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks.
The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers.
Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.
We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.
- Machine learning
- Data cleaning
- Accès libre
Pages: 27 - 41
In this contribution, we want to detect the document type profiles of the three prestigious journals
Using relative values based on fractional counting, we investigate the distribution of publications across document types at both the journal and country level, and we use (cosine) document type profile similarity values to compare pairs of publication years within countries.
The main limitation of this research concerns the Web of Science classification of publications into document types. Since the analysis of the paper is based on document types of Web of Science, the classification in question is not free from errors, and the accuracy of the analysis might be affected.
Results show that
Results highlight the importance of other document types than
- Document type profile
- Accès libre
The Power-weakness Ratios (PWR) as a Journal Indicator: Testing the “Tournaments” Metaphor in Citation Impact Studies
Pages: 6 - 26
Ramanujacharyulu developed the Power-weakness Ratio (PWR) for scoring tournaments. The PWR algorithm has been advocated (and used) for measuring the impact of journals. We show how such a newly proposed indicator can empirically be tested.
PWR values can be found by recursively multiplying the citation matrix by itself until convergence is reached in both the cited and citing dimensions; the quotient of these two values is defined as PWR. We study the effectiveness of PWR using journal ecosystems drawn from the Library and Information Science (LIS) set of the Web of Science (83 journals) as an example. Pajek is used to compute PWRs for the full set, and Excel for the computation in the case of the two smaller sub-graphs: (1)
A test using the set of 83 journals converged, but did not provide interpretable results. Further decomposition of this set into homogeneous sub-graphs shows that—like most other journal indicators—PWR can perhaps be used within homogeneous sets, but not across citation communities. We conclude that PWR does not work as a journal impact indicator; journal impact, for example, is not a tournament.
Journals that are not represented on the “citing” dimension of the matrix-for example, because they no longer appear, but are still registered as “cited” (e.g.
The association of “cited” with “power” and “citing” with “weakness” can be considered as a metaphor. In our opinion, referencing is an actor category and can be studied in terms of behavior, whereas “citedness” is a property of a document with an expected dynamics very different from that of “citing.” From this perspective, the PWR model is not valid as a journal indicator.
Arguments for using PWR are: (1) its symmetrical handling of the rows and columns in the asymmetrical citation matrix, (2) its recursive algorithm, and (3) its mathematical elegance. In this study, PWR is discussed and critically assessed.