Publications

  1. Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation Michal Lukasik,   Lin Chen,   Harikrishna Narasimhan,   Aditya Menon,   Wittawat Jitkrittum,   Felix Yu,   Sashank Reddi,   Gang Fu,   Mohammadhossein Bateni,   Sanjiv Kumar Forty-second International Conference on Machine Learning 2025
    Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
  2. Cost-Aware Routing for Efficient Text-To-Image Generation Qinchan Li,   Kenneth Chen,   Changyue Su,   Wittawat Jitkrittum,   Qi Sun,   Patsorn Sangkloy arXiv 2025
    Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.
  3. Faster Cascades via Speculative Decoding Harikrishna Narasimhan,   Wittawat Jitkrittum,   Ankit Rawat,   Seungyeon Kim,   Neha Gupta,   Aditya Menon,   Sanjiv Kumar The Thirteenth International Conference on Learning Representations 2025
    Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches interleave two models, but via fundamentally distinct mechanisms: deferral rule that invokes the larger model only for “hard” inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel scoring mode. These mechanisms offer different benefits: empirically, cascades offer compelling cost-quality trade-offs, often even outperforming the large model; speculative cascades offer impressive speed-ups, while guaranteeing quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.
  4. I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning Stephan Rabanser,   Nathalie Rauschmayr,   Achin Kulshrestha,   Petra Poklukar,   Wittawat Jitkrittum,   Sean Augenstein,   Congchao Wang,   Federico Tombari arXiv 2025
    Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.
  5. Universal Model Routing for Efficient LLM Inference Wittawat Jitkrittum,   Harikrishna Narasimhan,   Ankit Rawat,   Jeevesh Juneja,   Zifeng Wang,   Chen-Yu Lee,   Pradeep Shenoy,   Rina Panigrahy,   Aditya Menon,   Sanjiv Kumar arXiv 2025
    Large language models' significant advances in capabilities are accompanied by significant increases in inference costs. Model routing is a simple technique for reducing inference cost, wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective strategies, relying on cluster-based routing and a learned cluster map respectively. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors. Experiments on a range of public benchmarks show the effectiveness of the proposed strategies in routing amongst more than 30 unseen LLMs.
  6. A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs Ankit Rawat,   Veeranjaneyulu Sadhanala,   Afshin Rostamizadeh,   Ayan Chakrabarti,   Wittawat Jitkrittum,   Vladimir Feinberg,   Seungyeon Kim,   Hrayr Harutyunyan,   Nikunj Saunshi,   Zachary Nado,   Rakesh Shivanna,   Sashank Reddi,   Aditya Menon,   Rohan Anil,   Sanjiv Kumar arXiv 2024
    A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.
  7. Cascade-Aware Training of Language Models Congchao Wang,   Sean Augenstein,   Keith Rush,   Wittawat Jitkrittum,   Harikrishna Narasimhan,   Ankit Rawat,   Aditya Menon,   Alec Go arXiv 2024
    Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for simpler queries. Cascaded systems are typically built with independently trained models, neglecting the advantages of considering inference-time interactions of the cascaded LMs during training. In this paper, we present cascade-aware training(CAT), an approach to optimizing the overall quality-cost performance tradeoff of a cascade of LMs. We achieve inference-time benefits by training the small LM with awareness of its place in a cascade and downstream capabilities. We demonstrate the value of the proposed method with over 60 LM tasks of the SuperGLUE, WMT22, and FLAN2021 datasets.
  8. USTAD: Unified Single-model Training Achieving Diverse Scores for Information Retrieval Seungyeon Kim,   Ankit Rawat,   Manzil Zaheer,   Wittawat Jitkrittum,   Veeranjaneyulu Sadhanala,   Sadeep Jayasumana,   Aditya Menon,   Rob Fergus,   Sanjiv Kumar Forty-first International Conference on Machine Learning 2024
    Modern information retrieval (IR) systems consists of multiple stages like retrieval and ranking, with Transformer-based models achieving state-of-the-art performance at each stage. In this paper, we challenge the tradition of using separate models for different stages and ask if a single Transformer encoder can provide relevance score needed in each stage. We present USTAD – a new unified approach to train a single network that can provide powerful ranking scores as a cross-encoder (CE) model as well as factorized embeddings for large-scale retrieval as a dual-encoder (DE) model. Empirically, we find a single USTAD model to be competitive to separate ranking CE and retrieval DE models. Furthermore, USTAD combines well with a novel embedding matching-based distillation, significantly improving CE to DE distillation. It further motivates novel asymmetric architectures for student models to ensure a better embedding alignment between the student and the teacher while ensuring small online inference cost. On standard benchmarks like MSMARCO, we demonstrate that USTAD with our proposed distillation method leads to asymmetric students with only 1/10th trainable parameter but retaining 95-97% of the teacher performance.
  9. Language Model Cascades: Token-Level Uncertainty And Beyond Neha Gupta,   Harikrishna Narasimhan,   Wittawat Jitkrittum,   Ankit Rawat,   Aditya Menon,   Sanjiv Kumar The Twelfth International Conference on Learning Representations 2024
    Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. A simple strategy to achieve more favorable cost-quality tradeoffs is cascading: here, a small model is invoked for most “easy” instances, while a few “hard” instances are deferred to the large model. While the principles underpinning effective cascading are well-studied for classification tasks — with deferral based on predicted class uncertainty favored theoretically and practically — a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across different examples. To mitigate the length bias, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.
  10. Plugin estimators for selective classification with out-of-distribution detection Harikrishna Narasimhan,   Aditya Menon,   Wittawat Jitkrittum,   Sanjiv Kumar The Twelfth International Conference on Learning Representations 2024
    Real-world classifiers can benefit from the option of abstaining from predicting on samples where they have low confidence. Such abstention is particularly useful on samples which are close to the learned decision boundary, or which are outliers with respect to the training sample. These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature. Recent work on selective classification with OOD detection (SCOD) has argued for the unified study of these problems; however, the formal underpinnings of this problem are still nascent, and existing techniques are heuristic in nature. In this paper, we propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches from the SC and OOD detection literature. In the course of our analysis, we formally explicate how naïve use of existing SC and OOD detection baselines may be inadequate for SCOD. We empirically demonstrate that our approaches yields competitive SC and OOD detection trade-offs compared to common baselines.
  11. When Does Confidence-Based Cascade Deferral Suffice? Wittawat Jitkrittum,   Neha Gupta,   Aditya Menon,   Harikrishna Narasimhan,   Ankit Rawat,   Sanjiv Kumar NeurIPS 2023
    Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.
  12. Learning to Reject for Balanced Error and Beyond Harikrishna Narasimhan,   Aditya Menon,   Wittawat Jitkrittum,   Neha Gupta,   Sanjiv Kumar The Twelfth International Conference on Learning Representations 2024
    Learning to reject (L2R) is a classical problem where one seeks a classifier capable of abstaining on low-confidence samples. Most prior work on L2R has focused on minimizing the standard misclassification error. However, in many real-world applications, the label distribution is highly imbalanced, necessitating alternate evaluation metrics such as the balanced error or the worst-group error that enforce equitable performance across both the head and tail classes. In this paper, we establish that traditional L2R methods can be grossly sub-optimal for such metrics, and show that this is due to an intricate dependence in the objective between the label costs and the rejector. We then derive the form of the Bayes-optimal classifier and rejector for the balanced error, propose a novel plug-in approach to mimic this solution, and extend our results to general evaluation metrics. Through experiments on benchmark image classification tasks, we show that our approach yields better trade-offs in both the balanced and worst-group error compared to L2R baselines.
  13. On Bias-Variance Alignment in Deep Models Lin Chen,   Michal Lukasik,   Wittawat Jitkrittum,   Chong You,   Sanjiv Kumar The Twelfth International Conference on Learning Representations 2024
    Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance.
  14. A Kernel Stein Test for Comparing Latent Variable Models Heishiro Kanagawa,   Wittawat Jitkrittum,   Lester Mackey,   Kenji Fukumizu,   Arthur Gretton Journal of the Royal Statistical Society Series B: Statistical Methodology 2023
    {We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., Proceedings of the 33rd international conference on machine learning (pp. 276–284); Chwialkowski et al., (2016), In Proceedings of the 33rd international conference on machine learning (pp. 2606–2615); Yang et al., (2018), In Proceedings of the 35th international conference on machine learning (pp. 5561–5570)) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structures and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test, which is based on samples from the models and does not exploit the latent structure.}
  15. Post-hoc estimators for learning to defer to an expert Harikrishna Narasimhan,   Wittawat Jitkrittum,   Aditya Menon,   Ankit Rawat,   Sanjiv Kumar NeurIPS 2022
    Many practical settings allow a learner to defer predictions to one or more costly experts. For example, the learning to defer paradigm allows a learner to defer to a human expert, at some monetary cost. Similarly, the adaptive inference paradigm allows a base model to defer to one or more large models, at some computational cost. The goal in these settings is to learn classification and deferral mechanisms to optimise a suitable accuracy-cost tradeoff. To achieve this, a central issue studied in prior work is the design of a coherent loss function for both mechanisms. In this work, we demonstrate that existing losses have two subtle limitations: they can encourage underfitting when there is a high cost of deferring, and the deferral function can have a weak dependence on the base model predictions. To resolve these issues, we propose a post-hoc training scheme: we train a deferral function on top of a base model, with the objective of predicting to defer when the base model's error probability exceeds the cost of the expert model. This may be viewed as applying a partial surrogate to the ideal deferral loss, which can lead to a tighter approximation and thus better performance. Empirically, we verify the efficacy of post-hoc training on benchmarks for learning to defer and adaptive inference.
  16. A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch Patsorn Sangkloy,   Wittawat Jitkrittum,   Diyi Yang,   James Hays European Conference on Computer Vision 2022
    We address the problem of retrieving in-the-wild images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
  17. Discussion of Multiscale Fisher's Independence Test for Multivariate Dependence Antonin Schrab,   Wittawat Jitkrittum,   Zolt{\'a}n Szab{\'o},   Dino Sejdinovic,   Arthur Gretton Biometrika 2022
    We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as it is the case with the level of MultiFIT. In our experiments, we observe some of the performance limitations of MultiFIT in terms of test power.
  18. ELM: Embedding and Logit Margins for Long-Tail Learning Wittawat Jitkrittum,   Aditya Menon,   Ankit Rawat,   Sanjiv Kumar 2022
    Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners. Several recent approaches for the problem have proposed enforcing a suitable margin in logit space. Such techniques are intuitive analogues of the guiding principle behind SVMs, and are equally applicable to linear models and neural models. However, when applied to neural models, such techniques do not explicitly control the geometry of the learned embeddings. This can be potentially sub-optimal, since embeddings for tail classes may be diffuse, resulting in poor generalization for these classes. We present Embedding and Logit Margins (ELM), a unified approach to enforce margins in logit space, and regularize the distribution of embeddings. This connects losses for long-tail learning to proposals in the literature on metric embedding, and contrastive learning. We theoretically show that minimising the proposed ELM objective helps reduce the generalisation gap. The ELM method is shown to perform well empirically, and results in tighter tail class embeddings.
  19. ABCDP: Approximate Bayesian Computation with Differential Privacy Mijung Park,   Margarita Vinaroz,   Wittawat Jitkrittum Entropy 2021
    We developed a novel approximate Bayesian computation (ABC) framework, ABCDP, which produces differentially private (DP) and approximate posterior samples. Our framework takes advantage of the sparse vector technique (SVT), widely studied in the differential privacy literature. SVT incurs the privacy cost only when a condition (whether a quantity of interest is above/below a threshold) is met. If the condition is sparsely met during the repeated queries, SVT can drastically reduce the cumulative privacy loss, unlike the usual case where every query incurs the privacy loss. In ABC, the quantity of interest is the distance between observed and simulated data, and only when the distance is below a threshold can we take the corresponding prior sample as a posterior sample. Hence, applying SVT to ABC is an organic way to transform an ABC algorithm to a privacy-preserving variant with minimal modification, but yields the posterior samples with a high privacy level. We theoretically analyzed the interplay between the noise added for privacy and the accuracy of the posterior samples. We apply ABCDP to several data simulators and show the efficacy of the proposed framework.
  20. Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces Ankit Rawat,   Aditya Menon,   Wittawat Jitkrittum,   Sadeep Jayasumana,   Felix Yu,   Sashank Reddi,   Sanjiv Kumar ICML 2021
    Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on long-tail classification and retrieval benchmarks.
  21. A Witness Two-Sample Test Jonas K\"ubler,   Wittawat Jitkrittum,   Bernhard Sch\"olkopf,   Krikamol Muandet Proceedings of The 25th International Conference on Artificial Intelligence and Statistics 2022
    The Maximum Mean Discrepancy (MMD) has been the state-of-the-art nonparametric test for tackling the two-sample problem. Its statistic is given by the difference in expectations of the witness function, a real-valued function defined as a weighted sum of kernel evaluations on a set of basis points. Typically the kernel is optimized on a training set, and hypothesis testing is performed on a separate test set to avoid overfitting (i.e., control type-I error). That is, the test set is used to simultaneously estimate the expectations and define the basis points, while the training set only serves to select the kernel and is discarded. In this work, we propose to use the training data to also define the weights and the basis points for better data efficiency. We show that 1) the new test is consistent and has a well-controlled type-I error; 2) the optimal witness function is given by a precision-weighted mean in the reproducing kernel Hilbert space associated with the kernel; and 3) the test power of the proposed test is comparable or exceeds that of the MMD and other modern tests, as verified empirically on challenging synthetic and real problems (e.g., Higgs data).
  22. Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation Jia-Jie Zhu,   Wittawat Jitkrittum,   Moritz Diehl,   Bernhard Sch{\"o}lkopf Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 2021
    We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finite-order moment bounds. This perspective unifies multiple existing robust and stochastic optimization methods. We prove a theorem that generalizes the classical duality in the mathematical problem of moments. Enabled by this theorem, we reformulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations such as polynomial losses and knowledge of the Lipschitz constant. We then establish a connection between DRO and stochastic optimization with expectation constraints. Finally, we propose practical algorithms based on both batch convex solvers and stochastic functional gradient, which apply to general optimization and machine learning tasks.
  23. More Powerful Selective Kernel Tests for Feature Selection Jen Lim,   Makoto Yamada,   Wittawat Jitkrittum,   Yoshikazu Terada,   Shigeyuki Matsui,   Hidetoshi Shimodaira Proceedings of The 23rd International Conference on Artificial Intelligence and Statistics 2020
    Refining one’s hypotheses in light of data is a commonplace scientific practice, however,this approach introduces selection bias and can lead to specious statisticalanalysis.One approach of addressing this phenomena is via conditioning on the selection procedure, i.e., how we have used the data to generate our hypotheses, and prevents information to be used again after selection.Many selective inference (a.k.a. post-selection inference) algorithms typically take this approach but will “over-condition”for sake of tractability. While this practice obtains well calibrated $p$-values,it can incur a major loss in power. In our work, we extend two recent proposals for selecting features using the Maximum Mean Discrepancyand Hilbert Schmidt Independence Criterion to condition on the minimalconditioning event. We show how recent advances inmultiscale bootstrap makesthis possible and demonstrate our proposal over a range of synthetic and real world experiments.Our results show that our proposed test is indeed more powerful in most scenarios.
  24. Learning Kernel Tests Without Data Splitting Jonas K\"{u}bler,   Wittawat Jitkrittum,   Bernhard Sch\"{o}lkopf,   Krikamol Muandet NeurIPS 2020
    Modern large-scale kernel-based tests such as maximum mean discrepancy (MMD) and kernelized Stein discrepancy (KSD) optimize kernel hyperparameters on a held-out sample via data splitting to obtain the most powerful test statistics. While data splitting results in a tractable null distribution, it suffers from a reduction in test power due to smaller test sample size. Inspired by the selective inference framework, we propose an approach that enables learning the hyperparameters and testing on the full sample without data splitting. Our approach can correctly calibrate the test in the presence of such dependency, and yield a test threshold in closed form. At the same significance level, our approach's test power is empirically larger than that of the data-splitting approach, regardless of its split proportion.
  25. Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem Jia-Jie Zhu,   Wittawat Jitkrittum,   Moritz Diehl,   Bernhard Schölkopf 2020 59th IEEE Conference on Decision and Control (CDC) 2020
    In order to anticipate rare and impactful events, we propose to quantify the worst-case risk under distributional ambiguity using a recent development in kernel methods - the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associated reproducing kernel Hilbert space in a nonparametric manner. We then present the tractable approximation and its theoretical justification. As a concrete application, we numerically test the proposed method in characterizing the worst-case constraint violation probability in the context of a constrained stochastic control system.
  26. Testing Goodness of Fit of Conditional Density Models with Kernels Wittawat Jitkrittum,   Heishiro Kanagawa,   Bernhard Sch\"{o}lkopf The Conference on Uncertainty in Artificial Intelligence (UAI) 2020
    We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function p(y|x) and a joint sample, decide whether the sample is drawn from p(y|x)rx(x) for some density rx. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the normalizing constant. We show that 1) our tests are consistent against any fixed alternative conditional model; 2) the statistics can be estimated easily, requiring no density estimation as an intermediate step; and 3) our second test offers an interpretable test result providing insight on where the conditional model does not fit well in the domain of the covariate. We demonstrate the interpretability of our test on a task of modeling the distribution of New York City's taxi drop-off location given a pick-up point. To our knowledge, our work is the first to propose such conditional goodness-of-fit tests that simultaneously have all these desirable properties.
  27. Kernel Conditional Moment Test via Maximum Moment Restriction Krikamol Muandet,   Wittawat Jitkrittum,   Jonas Kübler The Conference on Uncertainty in Artificial Intelligence (UAI) 2020
    We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on conditional moment embeddings (CMME)---a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS). After transforming the conditional moment restrictions into a continuum of unconditional counterparts, the test statistic is defined as the maximum moment restriction within the unit ball of the RKHS. We show that the CMME fully characterizes the original conditional moment restrictions, leading to consistency in both hypothesis testing and parameter estimation. The proposed test also has an analytic expression that is easy to compute as well as closed-form asymptotic distributions. Our empirical studies show that the KCM test has a promising finite-sample performance compared to existing tests.
  28. Kernel Stein Tests for Multiple Model Comparison Jen Lim,   Makoto Yamada,   Bernhard Sch\"{o}lkopf,   Wittawat Jitkrittum NeurIPS 2019
    We address the problem of nonparametric multiple model comparison: given $l$ candidate models, decide whether each candidate is as good as the best one(s) in the list (negative), or worse (positive). We propose two statistical tests, each controlling a different notion of decision errors. The first test, building on the post selection inference framework, provably controls the fraction of best models that are wrongly declared worse (false positive rate). The second test is based on multiple correction, and controls the fraction of the models declared worse that are in fact as good as the best (false discovery rate). We prove that under some conditions the first test can yield a higher true positive rate than the second. Experimental results on toy and real (CelebA, Chicago Crime data) problems show that the two tests have high true positive rates with well-controlled error rates. By contrast, the naive approach of choosing the model with the lowest score without correction leads to a large number of false positives.
  29. Fisher Efficient Inference of Intractable Models Song Liu,   Takafumi Kanamori,   Wittawat Jitkrittum,   Yu Chen NeurIPS 2019
    Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic Cram{é}r-Rao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the Kullback-Leibler divergence minimization criterion implemented via density ratio estimation procedure and Stein operator. We study the problem of model inference using DLE. We prove its consistency and show the asymptotic variance of its solution can also attain the equality of the efficiency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pre-trained deep neural network.
  30. Generate Semantically Similar Images with Kernel Mean Matching Wittawat Jitkrittum,   Patsorn Sangkloy,   Muhammad Gondal,   Amit Raj,   James Hays,   Bernhard Schölkopf Women in Computer Vision Workshop, CVPR 2019
    Oral presentation. 6 out of 64 accepted papers.
    {{ no such element: dict object['abstract'] }}
  31. Kernel Mean Matching for Content Addressability of {GANs} Wittawat Jitkrittum,   Patsorn Sangkloy,   Muhammad Gondal,   Amit Raj,   James Hays,   Bernhard Schölkopf ICML 2019
    Long oral presentation
    We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable to any generative models which transform latent vectors to samples, and does not require retraining of the model. Experiments on various high-dimensional image generation problems (CelebA-HQ, LSUN bedroom, bridge, tower) show that our approach is able to generate images which are consistent with the input set, while retaining the image quality of the original model. To our knowledge, this is the first work that attempts to construct, at test time, a content-addressable generative model from a trained marginal model.
  32. Kernel-Guided Training of Implicit Generative Models with Stability Guarantees Arash Mehrjou,   Wittawat Jitkrittum,   Krikamol Muandet,   Bernhard Schölkopf ArXiv 2019
    Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from issues such as instability, uninterpretability, and difficulty in assessing their performance. If we see these implicit models as dynamical systems, some of these issues are caused by being unable to control their behavior in a meaningful way during the course of training. In this work, we propose a theoretically grounded method to guide the training trajectories of GANs by augmenting the GAN loss function with a kernel-based regularization term that controls local and global discrepancies between the model and true distributions. This control signal allows us to inject prior knowledge into the model. We provide theoretical guarantees on the stability of the resulting dynamical system and demonstrate different aspects of it via a wide range of experiments.
  33. Informative Features for Model Comparison Wittawat Jitkrittum,   Heishiro Kanagawa,   Patsorn Sangkloy,   James Hays,   Bernhard Schölkopf,   Arthur Gretton NeurIPS 2018
    Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.
  34. Large Sample Analysis of the Median Heuristic Damien Garreau,   Wittawat Jitkrittum,   Motonobu Kanagawa ArXiv 2018
    In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel two-sample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel two-sample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.
  35. A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum,   Wenkai Xu,   Zoltan Szabo,   Kenji Fukumizu,   Arthur Gretton NeurIPS 2017
    NeurIPS 2017 Best paper. 3 out of 3240 submissions.
    We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.
  36. Kernel-Based Distribution Features for Statistical Tests and Bayesian Inference Wittawat Jitkrittum 2017
    My PhD Thesis. Gatsby Unit, University College London.
    The kernel mean embedding is known to provide a data representation which preserves full information of the data distribution. While typically computationally costly, its nonparametric nature has an advantage of requiring no explicit model specification of the data. At the other extreme are approaches which summarize data distributions into a finite-dimensional vector of hand-picked summary statistics. This explicit finite-dimensional representation offers a computationally cheaper alternative. Clearly, there is a trade-off between cost and sufficiency of the representation, and it is of interest to have a computationally efficient technique which can produce a data-driven representation, thus combining the advantages from both extremes. The main focus of this thesis is on the development of linear-time mean-embedding-based methods to automatically extract informative features of data distributions, for statistical tests and Bayesian inference. In the first part on statistical tests, several new linear-time techniques are developed. These include a new kernel-based distance measure for distributions, a new linear-time nonparametric dependence measure, and a linear-time discrepancy measure between a probabilistic model and a sample, based on a Stein operator. These new measures give rise to linear-time and consistent tests of homogeneity, independence, and goodness of fit, respectively. The key idea behind these new tests is to explicitly learn distribution-characterizing feature vectors, by maximizing a proxy for the probability of correctly rejecting the null hypothesis. We theoretically show that these new tests are consistent for any finite number of features. In the second part, we explore the use of random Fourier features to construct approximate kernel mean embeddings, for representing messages in expectation propagation (EP) algorithm. The goal is to learn a message operator which predicts EP outgoing messages from incoming messages. We derive a novel two-layer random feature representation of the input messages, allowing online learning of the operator during EP inference.
  37. Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans Kiyohito Iigaya,   Aurelie Jolivald,   Wittawat Jitkrittum,   Iain Gilchrist,   Peter Dayan,   Elizabeth Paul,   Michael Mendl PLOS ONE 2016
    Positive and negative moods can be treated as prior expectations over future delivery of rewards and punishments. This provides an inferential foundation for the cognitive (judgement) bias task, now widely-used for assessing affective states in non-human animals. In the task, information about affect is extracted from the optimistic or pessimistic manner in which participants resolve ambiguities in sensory input. Here, we report a novel variant of the task aimed at dissecting the effects of affect manipulations on perceptual and value computations for decision-making under ambiguity in humans. Participants were instructed to judge which way a Gabor patch (250ms presentation) was leaning. If the stimulus leant one way (e.g. left), pressing the REWard key yielded a monetary WIN whilst pressing the SAFE key failed to acquire the WIN. If it leant the other way (e.g. right), pressing the SAFE key avoided a LOSS whilst pressing the REWard key incurred the LOSS. The size (0–100 UK pence) of the offered WIN and threatened LOSS, and the ambiguity of the stimulus (vertical being completely ambiguous) were varied on a trial-by-trial basis, allowing us to investigate how decisions were affected by differing combinations of these factors. Half the subjects performed the task in a ‘Pleasantly’ decorated room and were given a gift (bag of sweets) prior to starting, whilst the other half were in a bare ‘Unpleasant’ room and were not given anything. Although these treatments had little effect on self-reported mood, they did lead to differences in decision-making. All subjects were risk averse under ambiguity, consistent with the notion of loss aversion. Analysis using a Bayesian decision model indicated that Unpleasant Room subjects were (‘pessimistically’) biased towards choosing the SAFE key under ambiguity, but also weighed WINS more heavily than LOSSes compared to Pleasant Room subjects. These apparently contradictory findings may be explained by the influence of affect on different processes underlying decision-making, and the task presented here offers opportunities for further dissecting such processes.
  38. An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum,   Zoltán Szabó,   Arthur Gretton ICML 2017
    A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is data-efficient, and that runs in linear time (with respect to the sample size n). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In real-world benchmarks, independence tests using the optimized features perform comparably to the state-of-the-art quadratic-time HSIC test, and outperform competing O(n) and O(n log n) tests.
  39. Interpretable Distribution Features with Maximum Testing Power Wittawat Jitkrittum,   Zoltán Szabó,   Kacper Chwialkowski,   Arthur Gretton NeurIPS 2016
    Oral presentation. 1.8\% of total submission.
    Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. An empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In real-world benchmarks on high-dimensional text and image data, linear-time tests using the proposed semimetrics achieve comparable performance to the state-of-the-art quadratic-time maximum mean discrepancy test, while returning human-interpretable features that explain the test results.
  40. K2-ABC: Approximate {B}ayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings Mijung Park,   Wittawat Jitkrittum,   Dino Sejdinovic AISTATS 2016
    Oral presentation. 6.5\% of total submissions.
    Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2-ABC, uses maximum mean discrepancy (MMD) as a dissimilarity measure between the distributions over observed and simulated data. MMD is easily estimated as the squared difference between their empirical kernel embeddings. Experiments on a simulated scenario and a real-world biological problem illustrate the effectiveness of the proposed algorithm.
  41. Bayesian Manifold Learning: The Locally Linear Latent Variable Model Mijung Park,   Wittawat Jitkrittum,   Ahmad Qamar,   Zoltán Szabó,   Lars Buesing,   Maneesh Sahani NeurIPS 2015
    We introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic model for non-linear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data. Thus, the LL-LVM encapsulates the local-geometry preserving intuitions that underlie non-probabilistic methods such as locally linear embedding (LLE). Its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships, select the intrinsic dimensionality of the manifold, construct out-of-sample extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold.
  42. Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages Wittawat Jitkrittum,   Arthur Gretton,   Nicolas Heess,   S. Eslami,   Balaji Lakshminarayanan,   Dino Sejdinovic,   Zoltán Szabó UAI 2015
    We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernel-based regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel two-layer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.
  43. Performance of synchrony and spectral-based features in early seizure detection: exploring feature combinations and effect of latency Vincent Adam,   Joana Soldado-Magraner,   Wittawat Jitkrittum,   Heiko Strathmann,   Balaji Lakshminarayanan,   Alessandro Ialongo,   Gergo Bohner,   Ben Huh,   Lea Goetz,   Shaun Dowling,   Iulian Serban,   Matthieu Louis International Workshop on Seizure Prediction (IWSP) 2015: Epilepsy Mechanisms, Models, Prediction and Control 2015
    {{ no such element: dict object['abstract'] }}
  44. High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso Makoto Yamada,   Wittawat Jitkrittum,   Leonid Sigal,   Eric Xing,   Masashi Sugiyama Neural Computation 2014
    The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.
  45. Feature Selection via L1-Penalized Squared-Loss Mutual Information Wittawat Jitkrittum,   Hirotaka Hachiya,   Masashi Sugiyama IEICE Transactions 2013
    Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in real-world problems, has not received much attention. As an attempt to take feature interaction into account, we propose L1-LSMI, an L1-regularization based algorithm that maximizes a squared-loss variant of mutual information between selected features and outputs. Numerical results show that L1-LSMI performs well in handling redundancy, detecting non-linear dependency, and considering feature interaction.
  46. Squared-loss Mutual Information Regularization: A Novel Information-theoretic Approach to Semi-supervised Learning Gang Niu,   Wittawat Jitkrittum,   Bo Dai,   Hirotaka Hachiya,   Masashi Sugiyama ICML 2013
    We propose squared-loss mutual information regularization (SMIR) for multi-class probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semi-supervised algorithms: Analytical solution, out-of-sample/multi-class classification, and probabilistic output. Furthermore, novel generalization error bounds are derived. Experiments show SMIR compares favorably with state-of-the-art methods.
  47. QAST: Question Answering System for {Thai} Wikipedia Wittawat Jitkrittum,   Choochart Haruechaiyasak,   Thanaruk Theeramunkong Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions 2009
    We propose an open-domain question answering system using Thai Wikipedia as the knowledge base. Two types of information are used for answering a question: (1) structured information extracted and stored in the form of Resource Description Framework (RDF), and (2) unstructured texts stored as a search index. For the structured information, SPARQL transformed query is applied to retrieve a short answer from the RDF base. For the unstructured information, keyword-based query is used to retrieve the shortest text span containing the questions’s key terms. From the experimental results, the system which integrates both approaches could achieve an average MRR of 0.47 based on 215 test questions.
  48. Implementing News Article Category Browsing Based on Text Categorization Technique Choochart Haruechaiyasak,   Wittawat Jitkrittum,   Chatchawal Sangkeettrakarn,   Chaianun Damrongrat Web Intelligence/IAT Workshops 2008
    We propose a feature called category browsing to enhance the full-text search function of Thai-language news article search engine. The category browsing allows users to browse and filter search results based on some predefined categories. To implement the category browsing feature, we applied and compared among several text categorization algorithms including decision tree, Naive Bayes (NB) and Support Vector Machines (SVM). To further increase the performance of text categorization, we performed evaluation among many feature selection techniques including document frequency thresholding (DF), information gain (IG) and x2 (CHI). Based on our experiments using a large news corpus, the SVM algorithm with the IG feature selection yielded the best performance with the F1 measure equal to 95.42%.
  49. Proximity-Based Semantic Relatedness Measurement on Thai Wikipedia Wittawat Jitkrittum,   Thanaruk Theeramunkong,   Choochart Haruechaiyasak International Conference on Knowledge, Information and Creativity Support Systems (KICSS) 2008
    {{ no such element: dict object['abstract'] }}
  50. Managing Offline Educational Web Contents with Search Engine Tools Choochart Haruechaiyasak,   Chatchawal Sangkeettrakarn,   Wittawat Jitkrittum International Conference on Asian Digital Libraries 2007
    In this paper, we describe our ongoing project to help alleviate the digital divide problem among high schools in rural areas of Thailand. The idea is to select, organize, index and distribute useful educational Web contents to schools where the Internet connection is not available. These Web contents can be used by teachers and students to enhance the teaching and learning for many class subjects. We have collaborated with a group of teachers from different high schools in order to gather the requirements for designing our software tools. One of the challenging issues is the variation in computer hardwares and network configuration found in different schools. Some shools have PCs connected to the school’s server via the Local Area Network (LAN). While some other schools have low-performance PCs without any network connection. To support both cases, we provide two solutions via two different search engine tools. These tools support content administrators, e.g., teachers, with the features to organize and index the contents. The tools also provide general users with the features to browse and search for needed information. Since the contents and index are locally stored on hard disk or some removable media such as CD-ROM, the Internet connection is not needed.